Runner
ReasonScape Runner (runner.py)¶
runner.py executes configurations in configs/ by applying a template from templates/ and a sampler from samplers/ to perform inference, writing the output to results/.
usage: runner.py [-h] --config CONFIG --template TEMPLATE
[--seed SEED] [--precision PRECISION] [--density DENSITY]
[--degree DEGREE] --model MODEL --sampler SAMPLER
--apibase APIBASE [--apikey APIKEY] [--parallel PARALLEL]
[--threads THREADS] [--timeout TIMEOUT] [--cache CACHE]
[--shuffle] [--offline]
Unified Experiment Runner
options:
-h, --help show this help message and exit
--config CONFIG YAML experiment configuration file
--template TEMPLATE Prompt template file (must render a JSON list of messages)
--seed SEED Random seed (default: 42)
--precision PRECISION Precision level to use from the config (defaults to first defined)
--density DENSITY Grid sampling density (normal|lowdef|corner, default: normal)
--degree DEGREE Degree value applied to tasks that lack one (default: 0)
--model MODEL Model identifier for the LLM
--sampler SAMPLER JSON file with sampler parameters
--apibase APIBASE Base URL of the API (trailing “/v1” is added automatically)
--apikey APIKEY API key (defaults to $OPENAI_API_KEY)
--parallel PARALLEL Concurrent completions per step (default: 1)
--threads THREADS Concurrent steps (default: 1)
--timeout TIMEOUT Request timeout in seconds (default: 3600)
--cache CACHE SQLite cache file (default: cache.db)
--shuffle Randomise step order for more even GPU load
--offline Generate samples only – no LLM calls, no evaluation
Manifold Enhancement Arguments¶
The runner provides several arguments that affect task generation and difficulty:
-
--precision PRECISION– selects a named precision block from the experiment YAML.
If omitted the first block is used. The chosen block must contain acountfield; offline mode aborts without it. -
--density DENSITY– grid sampling density for manifold tasks.
Options:normal(default),lowdef,corner.
The value is only applied when a task definition does not already specify adensitykey. -
--degree DEGREE– degree value applied to tasks that lack adegreeentry.
The override is used only if the task JSON does not already containdegree. -
Task modes (value of
modein the experiment YAML): grid– explicit parameter grid (default).list– list of concrete parameter dictionaries.manifold– high‑level description resolved to a grid viaresolve_manifold.-
static– deprecated (raises an exception). -
Adaptive stopping criteria (optional fields in a task definition):
targetci– desired half‑width of the 95 % confidence interval.targetciht– tighter half‑width used when truncation is high.abortht– abort step if(truncated / total) > abortht.backoffht– switch totargetcihtwhen truncation exceeds this value.minci– lower bound for the confidence target.
These parameters together let you control difficulty, sampling granularity, and when the runner stops collecting more samples.
Execution Modes¶
--shuffle: Randomizes step execution order for more even GPU workload distribution across parallel execution--offline ROUNDS: Generates test samples without performing LLM inference, useful for test generation validation and dataset creation
Runner Output Schema¶
The resulting files in results/<timestamp>_<experiment>_<model>_<template>_<sampler>/ contain 1 record per line:
{
"task": "000_movies_choice_count-3_reference_count-4",
"version": 1,
"base_task": "movies",
"key": "000_movies_choice_count-3_reference_count-4-0",
"input": "Which movie belongs with this group: 12 Years a Slave, Braveheart, Ford v Ferrari?\n\nOptions:\n(A) Dark Phoenix\n(B) The Incredible Hulk\n(C) Hacksaw Ridge",
"target": "(C)",
"response_enum": [
"(A)",
"(B)",
"(C)"
],
"genresult": {
"reference_movies": [
"12 Years a Slave",
"Braveheart",
"Ford v Ferrari"
],
"choices": [
"Dark Phoenix",
"The Incredible Hulk",
"Hacksaw Ridge"
],
"selected_genres": [
"Biography",
"Drama"
],
"selected_themes": []
},
"messages": [
{
"role": "user",
"content": "Given a list of movies a user has watched and liked, recommend a similar film. Think step by step, then respond with Final Answer: (A) - (Z)\nWhich movie belongs with this group: 12 Years a Slave, Braveheart, Ford v Ferrari?\n\nOptions:\n(A) Dark Phoenix\n(B) The Incredible Hulk\n(C) Hacksaw Ridge"
}
],
"answer": "To recommend a similar film, let's analyze the given movies and the options:\n\n1. **12 Years a Slave**: A historical drama about the brutal life of a slave in the 19th century.\n2. **Braveheart**: A historical drama about William Wallace, a Scottish warrior who led a rebellion against English rule in the 13th century.\n3. **Ford v Ferrari: Ferrari**: A biographical drama about the rivalry between Henry Ford and Enzo Ferrari, focusing on the development of the Ford Model T and Ferrari's racing cars.\n\nAll three movies are historical dramas with strong emotional and dramatic elements, focusing on significant historical events and figures.\n\nNow, let's evaluate the options:\n\n(A) **Dark Phoenix**: A superhero film about the transformation of Jean Grey into the Dark Phoenix, focusing on a fictional character and a sci-fi theme.\n(B) **The Incredible Hulk**: A superhero film about Bruce Banner turning into the Hulk, focusing on a fictional character and a sci-fi theme.\n(C) **Hacksaw Ridge**: A historical drama about Desmond Doss, a conscientious objector who saved many lives during World War II.\n\nGiven the preference for historical dramas, the most similar film to the group would be:\n\n(C) **Hacksaw Ridge**\n\nFinal Answer: (C) Hacksaw Ridge",
"usage": {
"completion_tokens": 269,
"prompt_tokens": 76,
"prompt_tokens_details": null,
"total_tokens": 345
},
"timings": {
"termination_reason": null,
"finish_reason": "stop"
},
"step": 0,
"params": {
"reference_count": 4,
"choice_count": 3,
"count": 128
},
"model": "Phi-4-mini-instruct-fp16",
"template": "zerocot-nosys",
"sampler": "greedy-4k",
"guess_chance": 0.3333333333333333,
"normalized_answer": "(C) Hacksaw Ridge",
"extracted_answer": "(C)",
"is_valid": true,
"is_correct": true,
"is_truncated": false
}
The exact fields in genresult will depend on the generator itself, they allow us to post-process filter the data to look for patterns arising from generator architecture or sub-test definitions.