Runner

ReasonScape Runner (runner.py)¶

runner.py executes configurations in configs/ by applying a template from templates/ and a sampler from sampler/ to perform inference, writing the output to results/.

usage: runner.py [-h] --config CONFIG --template TEMPLATE [--seed SEED] [--precision PRECISION] [--density DENSITY] [--degree DEGREE] --model MODEL --sampler SAMPLER --apibase APIBASE [--apikey APIKEY] [--parallel PARALLEL] [--threads THREADS] [--timeout TIMEOUT] [--cache CACHE] [--shuffle] [--offline ROUNDS]

Unified Experiment Runner

options:
  -h, --help            show this help message and exit
  --config CONFIG       YAML experiment configuration file
  --template TEMPLATE   Prompt template file
  --seed SEED           Random seed
  --precision PRECISION Precision level to use from config (defaults to first defined)
  --density DENSITY     Grid sampling density value: normal (default), lowdef, corner
  --degree DEGREE       Degree value to apply to tasks that do not already have degree specified
  --model MODEL         Model to use
  --sampler SAMPLER     Sampler parameters file
  --apibase APIBASE     API base URL
  --apikey APIKEY       API key
  --parallel PARALLEL   Parallel completions per step
  --threads THREADS     Parallel steps
  --timeout TIMEOUT     Request timeout
  --cache CACHE         Cache database file
  --shuffle             Shuffle steps into random order for even GPU workload
  --offline ROUNDS      Offline mode: generate ROUNDS worth of samples without LLM inference or evaluation

Manifold Enhancement Arguments¶

The runner supports several arguments for controlling difficulty manifold generation:

--precision: Selects which precision level to use from the config file's precision section. If not specified, uses the first defined precision level.
--density: Controls grid sampling density for manifold tasks. Options are:
normal (default): Standard grid sampling
lowdef: Lower definition sampling for faster evaluation
corner: Corner sampling for boundary analysis
--degree: Sets the degree value for tasks that don't already specify one. Higher degrees typically increase task difficulty.

Execution Modes¶

--shuffle: Randomizes step execution order for more even GPU workload distribution across parallel execution
--offline ROUNDS: Generates test samples without performing LLM inference, useful for test generation validation and dataset creation

Runner Output Schema¶

The resulting files in results/<timestamp>_<experiment>_<model>_<template>_<sampler>/ contain 1 record per line:

{
    "task": "000_movies_choice_count-3_reference_count-4",
    "version": 1,
    "base_task": "movies",
    "key": "000_movies_choice_count-3_reference_count-4-0",
    "input": "Which movie belongs with this group: 12 Years a Slave, Braveheart, Ford v Ferrari?\n\nOptions:\n(A) Dark Phoenix\n(B) The Incredible Hulk\n(C) Hacksaw Ridge",
    "target": "(C)",
    "response_enum": [
        "(A)",
        "(B)",
        "(C)"
    ],
    "genresult": {
        "reference_movies": [
            "12 Years a Slave",
            "Braveheart",
            "Ford v Ferrari"
        ],
        "choices": [
            "Dark Phoenix",
            "The Incredible Hulk",
            "Hacksaw Ridge"
        ],
        "selected_genres": [
            "Biography",
            "Drama"
        ],
        "selected_themes": []
    },
    "messages": [
        {
            "role": "user",
            "content": "Given a list of movies a user has watched and liked, recommend a similar film. Think step by step, then respond with Final Answer: (A) - (Z)\nWhich movie belongs with this group: 12 Years a Slave, Braveheart, Ford v Ferrari?\n\nOptions:\n(A) Dark Phoenix\n(B) The Incredible Hulk\n(C) Hacksaw Ridge"
        }
    ],
    "answer": "To recommend a similar film, let's analyze the given movies and the options:\n\n1. **12 Years a Slave**: A historical drama about the brutal life of a slave in the 19th century.\n2. **Braveheart**: A historical drama about William Wallace, a Scottish warrior who led a rebellion against English rule in the 13th century.\n3. **Ford v Ferrari: Ferrari**: A biographical drama about the rivalry between Henry Ford and Enzo Ferrari, focusing on the development of the Ford Model T and Ferrari's racing cars.\n\nAll three movies are historical dramas with strong emotional and dramatic elements, focusing on significant historical events and figures.\n\nNow, let's evaluate the options:\n\n(A) **Dark Phoenix**: A superhero film about the transformation of Jean Grey into the Dark Phoenix, focusing on a fictional character and a sci-fi theme.\n(B) **The Incredible Hulk**: A superhero film about Bruce Banner turning into the Hulk, focusing on a fictional character and a sci-fi theme.\n(C) **Hacksaw Ridge**: A historical drama about Desmond Doss, a conscientious objector who saved many lives during World War II.\n\nGiven the preference for historical dramas, the most similar film to the group would be:\n\n(C) **Hacksaw Ridge**\n\nFinal Answer: (C) Hacksaw Ridge",
    "usage": {
        "completion_tokens": 269,
        "prompt_tokens": 76,
        "prompt_tokens_details": null,
        "total_tokens": 345
    },
    "timings": {
        "termination_reason": null,
        "finish_reason": "stop"
    },
    "step": 0,
    "params": {
        "reference_count": 4,
        "choice_count": 3,
        "count": 128
    },
    "model": "Phi-4-mini-instruct-fp16",
    "template": "zerocot-nosys",
    "sampler": "greedy-4k",
    "guess_chance": 0.3333333333333333,
    "normalized_answer": "(C) Hacksaw Ridge",
    "extracted_answer": "(C)",
    "is_valid": true,
    "is_correct": true,
    "is_truncated": false
}

The exact fields in genresult will depend on the generator itself, they allow us to post-process filter the data to look for patterns arising from generator architecture or sub-test definitions.