Skip to content

Runner

ReasonScape Runner (runner.py)

runner.py executes configurations in configs/ by applying a template from templates/ and a sampler from samplers/ to perform inference, writing the output to results/.

usage: runner.py [-h] --config CONFIG --template TEMPLATE
                 [--seed SEED] [--precision PRECISION] [--density DENSITY]
                 [--degree DEGREE] --model MODEL --sampler SAMPLER
                 --apibase APIBASE [--apikey APIKEY] [--parallel PARALLEL]
                 [--threads THREADS] [--timeout TIMEOUT] [--cache CACHE]
                 [--shuffle] [--offline]

Unified Experiment Runner

options:
  -h, --help            show this help message and exit
  --config CONFIG       YAML experiment configuration file
  --template TEMPLATE   Prompt template file (must render a JSON list of messages)
  --seed SEED           Random seed (default: 42)
  --precision PRECISION Precision level to use from the config (defaults to first defined)
  --density DENSITY     Grid sampling density (normal|lowdef|corner, default: normal)
  --degree DEGREE       Degree value applied to tasks that lack one (default: 0)
  --model MODEL         Model identifier for the LLM
  --sampler SAMPLER     JSON file with sampler parameters
  --apibase APIBASE     Base URL of the API (trailing “/v1” is added automatically)
  --apikey APIKEY       API key (defaults to $OPENAI_API_KEY)
  --parallel PARALLEL   Concurrent completions per step (default: 1)
  --threads THREADS     Concurrent steps (default: 1)
  --timeout TIMEOUT     Request timeout in seconds (default: 3600)
  --cache CACHE         SQLite cache file (default: cache.db)
  --shuffle             Randomise step order for more even GPU load
  --offline             Generate samples only – no LLM calls, no evaluation

Manifold Enhancement Arguments

The runner provides several arguments that affect task generation and difficulty:

  • --precision PRECISION – selects a named precision block from the experiment YAML.
    If omitted the first block is used. The chosen block must contain a count field; offline mode aborts without it.

  • --density DENSITY – grid sampling density for manifold tasks.
    Options: normal (default), lowdef, corner.
    The value is only applied when a task definition does not already specify a density key.

  • --degree DEGREE – degree value applied to tasks that lack a degree entry.
    The override is used only if the task JSON does not already contain degree.

  • Task modes (value of mode in the experiment YAML):

  • grid – explicit parameter grid (default).
  • list – list of concrete parameter dictionaries.
  • manifold – high‑level description resolved to a grid via resolve_manifold.
  • static – deprecated (raises an exception).

  • Adaptive stopping criteria (optional fields in a task definition):

  • targetci – desired half‑width of the 95 % confidence interval.
  • targetciht – tighter half‑width used when truncation is high.
  • abortht – abort step if (truncated / total) > abortht.
  • backoffht – switch to targetciht when truncation exceeds this value.
  • minci – lower bound for the confidence target.

These parameters together let you control difficulty, sampling granularity, and when the runner stops collecting more samples.

Execution Modes

  • --shuffle: Randomizes step execution order for more even GPU workload distribution across parallel execution
  • --offline ROUNDS: Generates test samples without performing LLM inference, useful for test generation validation and dataset creation

Runner Output Schema

The resulting files in results/<timestamp>_<experiment>_<model>_<template>_<sampler>/ contain 1 record per line:

{
    "task": "000_movies_choice_count-3_reference_count-4",
    "version": 1,
    "base_task": "movies",
    "key": "000_movies_choice_count-3_reference_count-4-0",
    "input": "Which movie belongs with this group: 12 Years a Slave, Braveheart, Ford v Ferrari?\n\nOptions:\n(A) Dark Phoenix\n(B) The Incredible Hulk\n(C) Hacksaw Ridge",
    "target": "(C)",
    "response_enum": [
        "(A)",
        "(B)",
        "(C)"
    ],
    "genresult": {
        "reference_movies": [
            "12 Years a Slave",
            "Braveheart",
            "Ford v Ferrari"
        ],
        "choices": [
            "Dark Phoenix",
            "The Incredible Hulk",
            "Hacksaw Ridge"
        ],
        "selected_genres": [
            "Biography",
            "Drama"
        ],
        "selected_themes": []
    },
    "messages": [
        {
            "role": "user",
            "content": "Given a list of movies a user has watched and liked, recommend a similar film. Think step by step, then respond with Final Answer: (A) - (Z)\nWhich movie belongs with this group: 12 Years a Slave, Braveheart, Ford v Ferrari?\n\nOptions:\n(A) Dark Phoenix\n(B) The Incredible Hulk\n(C) Hacksaw Ridge"
        }
    ],
    "answer": "To recommend a similar film, let's analyze the given movies and the options:\n\n1. **12 Years a Slave**: A historical drama about the brutal life of a slave in the 19th century.\n2. **Braveheart**: A historical drama about William Wallace, a Scottish warrior who led a rebellion against English rule in the 13th century.\n3. **Ford v Ferrari: Ferrari**: A biographical drama about the rivalry between Henry Ford and Enzo Ferrari, focusing on the development of the Ford Model T and Ferrari's racing cars.\n\nAll three movies are historical dramas with strong emotional and dramatic elements, focusing on significant historical events and figures.\n\nNow, let's evaluate the options:\n\n(A) **Dark Phoenix**: A superhero film about the transformation of Jean Grey into the Dark Phoenix, focusing on a fictional character and a sci-fi theme.\n(B) **The Incredible Hulk**: A superhero film about Bruce Banner turning into the Hulk, focusing on a fictional character and a sci-fi theme.\n(C) **Hacksaw Ridge**: A historical drama about Desmond Doss, a conscientious objector who saved many lives during World War II.\n\nGiven the preference for historical dramas, the most similar film to the group would be:\n\n(C) **Hacksaw Ridge**\n\nFinal Answer: (C) Hacksaw Ridge",
    "usage": {
        "completion_tokens": 269,
        "prompt_tokens": 76,
        "prompt_tokens_details": null,
        "total_tokens": 345
    },
    "timings": {
        "termination_reason": null,
        "finish_reason": "stop"
    },
    "step": 0,
    "params": {
        "reference_count": 4,
        "choice_count": 3,
        "count": 128
    },
    "model": "Phi-4-mini-instruct-fp16",
    "template": "zerocot-nosys",
    "sampler": "greedy-4k",
    "guess_chance": 0.3333333333333333,
    "normalized_answer": "(C) Hacksaw Ridge",
    "extracted_answer": "(C)",
    "is_valid": true,
    "is_correct": true,
    "is_truncated": false
}

The exact fields in genresult will depend on the generator itself, they allow us to post-process filter the data to look for patterns arising from generator architecture or sub-test definitions.