Runner
ReasonScape Runner (runner.py)¶
runner.py executes configurations in configs/
by applying a template from templates/
and a sampler from sampler/
to perform inference, writing the output to results/
.
usage: runner.py [-h] --config CONFIG --template TEMPLATE [--seed SEED] [--precision PRECISION] [--density DENSITY] [--degree DEGREE] --model MODEL --sampler SAMPLER --apibase APIBASE [--apikey APIKEY] [--parallel PARALLEL] [--threads THREADS] [--timeout TIMEOUT] [--cache CACHE] [--shuffle] [--offline ROUNDS]
Unified Experiment Runner
options:
-h, --help show this help message and exit
--config CONFIG YAML experiment configuration file
--template TEMPLATE Prompt template file
--seed SEED Random seed
--precision PRECISION Precision level to use from config (defaults to first defined)
--density DENSITY Grid sampling density value: normal (default), lowdef, corner
--degree DEGREE Degree value to apply to tasks that do not already have degree specified
--model MODEL Model to use
--sampler SAMPLER Sampler parameters file
--apibase APIBASE API base URL
--apikey APIKEY API key
--parallel PARALLEL Parallel completions per step
--threads THREADS Parallel steps
--timeout TIMEOUT Request timeout
--cache CACHE Cache database file
--shuffle Shuffle steps into random order for even GPU workload
--offline ROUNDS Offline mode: generate ROUNDS worth of samples without LLM inference or evaluation
Manifold Enhancement Arguments¶
The runner supports several arguments for controlling difficulty manifold generation:
--precision
: Selects which precision level to use from the config file'sprecision
section. If not specified, uses the first defined precision level.--density
: Controls grid sampling density for manifold tasks. Options are:normal
(default): Standard grid samplinglowdef
: Lower definition sampling for faster evaluationcorner
: Corner sampling for boundary analysis--degree
: Sets the degree value for tasks that don't already specify one. Higher degrees typically increase task difficulty.
Execution Modes¶
--shuffle
: Randomizes step execution order for more even GPU workload distribution across parallel execution--offline ROUNDS
: Generates test samples without performing LLM inference, useful for test generation validation and dataset creation
Runner Output Schema¶
The resulting files in results/<timestamp>_<experiment>_<model>_<template>_<sampler>/
contain 1 record per line:
{
"task": "000_movies_choice_count-3_reference_count-4",
"version": 1,
"base_task": "movies",
"key": "000_movies_choice_count-3_reference_count-4-0",
"input": "Which movie belongs with this group: 12 Years a Slave, Braveheart, Ford v Ferrari?\n\nOptions:\n(A) Dark Phoenix\n(B) The Incredible Hulk\n(C) Hacksaw Ridge",
"target": "(C)",
"response_enum": [
"(A)",
"(B)",
"(C)"
],
"genresult": {
"reference_movies": [
"12 Years a Slave",
"Braveheart",
"Ford v Ferrari"
],
"choices": [
"Dark Phoenix",
"The Incredible Hulk",
"Hacksaw Ridge"
],
"selected_genres": [
"Biography",
"Drama"
],
"selected_themes": []
},
"messages": [
{
"role": "user",
"content": "Given a list of movies a user has watched and liked, recommend a similar film. Think step by step, then respond with Final Answer: (A) - (Z)\nWhich movie belongs with this group: 12 Years a Slave, Braveheart, Ford v Ferrari?\n\nOptions:\n(A) Dark Phoenix\n(B) The Incredible Hulk\n(C) Hacksaw Ridge"
}
],
"answer": "To recommend a similar film, let's analyze the given movies and the options:\n\n1. **12 Years a Slave**: A historical drama about the brutal life of a slave in the 19th century.\n2. **Braveheart**: A historical drama about William Wallace, a Scottish warrior who led a rebellion against English rule in the 13th century.\n3. **Ford v Ferrari: Ferrari**: A biographical drama about the rivalry between Henry Ford and Enzo Ferrari, focusing on the development of the Ford Model T and Ferrari's racing cars.\n\nAll three movies are historical dramas with strong emotional and dramatic elements, focusing on significant historical events and figures.\n\nNow, let's evaluate the options:\n\n(A) **Dark Phoenix**: A superhero film about the transformation of Jean Grey into the Dark Phoenix, focusing on a fictional character and a sci-fi theme.\n(B) **The Incredible Hulk**: A superhero film about Bruce Banner turning into the Hulk, focusing on a fictional character and a sci-fi theme.\n(C) **Hacksaw Ridge**: A historical drama about Desmond Doss, a conscientious objector who saved many lives during World War II.\n\nGiven the preference for historical dramas, the most similar film to the group would be:\n\n(C) **Hacksaw Ridge**\n\nFinal Answer: (C) Hacksaw Ridge",
"usage": {
"completion_tokens": 269,
"prompt_tokens": 76,
"prompt_tokens_details": null,
"total_tokens": 345
},
"timings": {
"termination_reason": null,
"finish_reason": "stop"
},
"step": 0,
"params": {
"reference_count": 4,
"choice_count": 3,
"count": 128
},
"model": "Phi-4-mini-instruct-fp16",
"template": "zerocot-nosys",
"sampler": "greedy-4k",
"guess_chance": 0.3333333333333333,
"normalized_answer": "(C) Hacksaw Ridge",
"extracted_answer": "(C)",
"is_valid": true,
"is_correct": true,
"is_truncated": false
}
The exact fields in genresult
will depend on the generator itself, they allow us to post-process filter the data to look for patterns arising from generator architecture or sub-test definitions.