Skip to content

ReasonScape Runner (runner.py)

runner.py executes configurations in configs/ by applying a template from templates/ and a sampler from samplers/ to perform inference, writing the output to results/.

usage: runner.py [-h] --config CONFIG --template TEMPLATE
                 [--seed SEED] [--precision PRECISION] [--density DENSITY]
                 [--degree DEGREE] --model MODEL --sampler SAMPLER
                 --apibase APIBASE [--apikey APIKEY] [--parallel PARALLEL]
                 [--threads THREADS] [--timeout TIMEOUT] [--cache CACHE]
                 [--shuffle] [--offline]

Unified Experiment Runner

options:
  -h, --help            show this help message and exit
  --config CONFIG       YAML experiment configuration file
  --template TEMPLATE   Prompt template file (must render a JSON list of messages)
  --seed SEED           Random seed (default: 42)
  --precision PRECISION Precision level to use from the config (defaults to first defined)
  --density DENSITY     Grid sampling density (normal|lowdef|corner, default: normal)
  --degree DEGREE       Degree value applied to tasks that lack one (default: 0)
  --model MODEL         Model identifier for the LLM
  --sampler SAMPLER     JSON file with sampler parameters
  --apibase APIBASE     Base URL of the API (trailing “/v1” is added automatically)
  --apikey APIKEY       API key (defaults to $OPENAI_API_KEY)
  --parallel PARALLEL   Concurrent completions per step (default: 1)
  --threads THREADS     Concurrent steps (default: 1)
  --timeout TIMEOUT     Request timeout in seconds (default: 3600)
  --cache CACHE         SQLite cache file (default: cache.db)
  --shuffle             Randomise step order for more even GPU load
  --offline             Generate samples only – no LLM calls, no evaluation

Config file format

For details on task modes (grid, list, manifold) and adaptive stopping criteria (targetci, targetciht, abortht, etc.), see the Configs documentation.

Manifold Enhancement Arguments

The runner provides several arguments that affect task generation and difficulty:

  • --precision PRECISION – selects a named precision block from the experiment YAML.

If omitted the first block is used. The chosen block must contain a count field; offline mode aborts without it.

  • --density DENSITY – grid sampling density for manifold tasks.

Options: normal (default), lowdef, corner.

The value is only applied when a task definition does not already specify a density key.

  • --degree DEGREE – degree value applied to tasks that lack a degree entry.

The override is used only if the task JSON does not already contain degree.

Execution Modes

  • --shuffle: Randomizes step execution order for more even GPU workload distribution across parallel execution
  • --offline ROUNDS: Generates test samples without performing LLM inference, useful for test generation validation and dataset creation

Runner Output Schema

The resulting files in results/<timestamp>_<experiment>_<model>_<template>_<sampler>/ contain 1 record per line:

{
    "task": "000_movies_choice_count-3_reference_count-4",
    "version": 1,
    "base_task": "movies",
    "key": "000_movies_choice_count-3_reference_count-4-0",
    "input": "Which movie belongs with this group: 12 Years a Slave, Braveheart, Ford v Ferrari?\n\nOptions:\n(A) Dark Phoenix\n(B) The Incredible Hulk\n(C) Hacksaw Ridge",
    "target": "(C)",
    "response_enum": [
        "(A)",
        "(B)",
        "(C)"
    ],
    "genresult": {
        "reference_movies": [
            "12 Years a Slave",
            "Braveheart",
            "Ford v Ferrari"
        ],
        "choices": [
            "Dark Phoenix",
            "The Incredible Hulk",
            "Hacksaw Ridge"
        ],
        "selected_genres": [
            "Biography",
            "Drama"
        ],
        "selected_themes": []
    },
    "messages": [
        {
            "role": "user",
            "content": "Given a list of movies a user has watched and liked, recommend a similar film. Think step by step, then respond with Final Answer: (A) - (Z)\nWhich movie belongs with this group: 12 Years a Slave, Braveheart, Ford v Ferrari?\n\nOptions:\n(A) Dark Phoenix\n(B) The Incredible Hulk\n(C) Hacksaw Ridge"
        }
    ],
    "answer": "To recommend a similar film, let's analyze the given movies and the options:\n\n1. **12 Years a Slave**: A historical drama about the brutal life of a slave in the 19th century.\n2. **Braveheart**: A historical drama about William Wallace, a Scottish warrior who led a rebellion against English rule in the 13th century.\n3. **Ford v Ferrari: Ferrari**: A biographical drama about the rivalry between Henry Ford and Enzo Ferrari, focusing on the development of the Ford Model T and Ferrari's racing cars.\n\nAll three movies are historical dramas with strong emotional and dramatic elements, focusing on significant historical events and figures.\n\nNow, let's evaluate the options:\n\n(A) **Dark Phoenix**: A superhero film about the transformation of Jean Grey into the Dark Phoenix, focusing on a fictional character and a sci-fi theme.\n(B) **The Incredible Hulk**: A superhero film about Bruce Banner turning into the Hulk, focusing on a fictional character and a sci-fi theme.\n(C) **Hacksaw Ridge**: A historical drama about Desmond Doss, a conscientious objector who saved many lives during World War II.\n\nGiven the preference for historical dramas, the most similar film to the group would be:\n\n(C) **Hacksaw Ridge**\n\nFinal Answer: (C) Hacksaw Ridge",
    "usage": {
        "completion_tokens": 269,
        "prompt_tokens": 76,
        "prompt_tokens_details": null,
        "total_tokens": 345
    },
    "timings": {
        "termination_reason": null,
        "finish_reason": "stop"
    },
    "step": 0,
    "params": {
        "reference_count": 4,
        "choice_count": 3,
        "count": 128
    },
    "model": "Phi-4-mini-instruct-fp16",
    "template": "zerocot-nosys",
    "sampler": "greedy-4k",
    "guess_chance": 0.3333333333333333,
    "normalized_answer": "(C) Hacksaw Ridge",
    "extracted_answer": "(C)",
    "is_valid": true,
    "is_correct": true,
    "is_truncated": false
}

The exact fields in genresult will depend on the generator itself, they allow us to post-process filter the data to look for patterns arising from generator architecture or sub-test definitions.