Experiment Configuration Reference¶

This document describes how to define evaluation experiments in ReasonScape. Experiment configs are the Stage 1: Definition component that generates parametric test cases for model evaluation.

See architecture.md for how configs fit into the five-stage pipeline.

Overview: What is an Experiment Config?¶

An experiment config is a YAML file (e.g., configs/r12.yaml) that defines:

Task definitions - Which evaluation tasks to run and how to sample their parameter spaces
Task metadata - View definitions for analysis (grouping and filtering recipes)

The config tells runner.py exactly what test cases to generate and how to sample them efficiently. Each test case is deterministically created from coordinate seeds, enabling reproducibility and caching.

Config format: YAML (not JSON)

Config location: configs/ directory

Usage: python runner.py --config configs/r12.yaml --template ... --sampler ... --model ...

Core Structure¶

Every experiment config has two main sections:

name: "experiment_identifier"

tasks:
  [task_definitions...]

Name¶

A simple string identifier for the experiment:

name: r12

This name appears in filenames and logs but doesn't affect test generation.

Tasks Definition¶

Tasks define the evaluation scenarios and how to sample their parameter spaces. Each task maps to a generator function (in tasks/) and specifies a sampling strategy.

Task Structure¶

tasks:
  - name: task_identifier
    file: tasks/taskname.json
    mode: calibrate|grid
    [mode_specific_parameters...]
    label: "Human-Readable Task Name"
    views: [...]

Task Parameters¶

name: Internal identifier for the task (used in outputs and logs)
file: Path to task generator file (JSON format)
mode: How to sample the parameter space: calibrate or grid
label: Human-readable name for visualizations and reports
views (optional): View definitions for analysis

Sampling Modes¶

ReasonScape supports two parameter space sampling strategies. Choose based on your research question and computational budget.

Grid Mode: Cartesian Product¶

Define parameter ranges; the system generates all combinations:

tasks:
  - name: "arithmetic_simple"
    file: "tasks/arithmetic.json"
    mode: "grid"
    grid:
      _id: arithmetic_all
      min_number: [-9, -99]
      max_number: [9, 99]
      max_depth: [0, 1, 2, 4]
      length: [8, 16, 32]

This will generate 2 × 2 × 4 × 3 = 48 points. All entries starting with _ will appear in the manifold{} for downstream filtering/grouping.

Calibrate Mode: Token-Targeted Sampling (Recommended)¶

Define calibration targets so the system finds parameter values that hit desired token counts:

tasks:
  - name: tables
    label: Tables
    file: tasks/tables.json
    mode: calibrate
    calibrate:
    - _id:          "csv_tall"
      num_rows:    { token_targets: [2000, 4000, 6000, 8000, 10000, 12000, 14000], min: 10, max: 1500, slope: [100, 500], step: 5 }
      num_columns: { values: [4] }
      format:      { values: [1] }
      operation:   { values: [2, 3, 5] }
      filter_type: { values: [1, 2, 3, 4] }
    - _id:          "markdown_tall"
      num_rows:    { token_targets: [2000, 4000, 6000, 8000, 10000, 12000, 14000], min: 10, max: 1500, slope: [100, 500], step: 5 }
      num_columns: { values: [4] }
      format:      { values: [2] }
      operation:   { values: [2, 3, 5] }
      filter_type: { values: [1, 2, 3, 4] }

Use case: Evaluating performance at specific input sizes when the relationship between a parameter (e.g., num_rows) and token count is nonlinear or task-dependent.

Behavior: The calibrator searches for parameter values that produce inputs at each token_targets value, within the given min/max bounds and slope hints. The resolved target_tokens coordinate is stored in manifold{} alongside the _id.

Calibrate Entry Parameters¶

_id: Named identifier for this calibration entry; stored in manifold.id
token_targets: List of input token counts to target
min/max: Search bounds for the calibrated parameter
slope: Hint for the search (expected tokens-per-unit range)
step: Minimum step size for the search
values: Fixed values for non-calibrated parameters (Cartesian product over all fixed parameters)

Multiple Calibrate Entries¶

A single task can define multiple calibrate entries to cover different parameter regions:

tasks:
  - name: tables
    file: tasks/tables.json
    mode: calibrate
    calibrate:
    - _id: "csv_tall"
      format: { values: [1] }
      num_rows: { token_targets: [2000, 4000, 8000], min: 10, max: 1500, slope: [100, 500], step: 5 }
      ...
    - _id: "markdown_tall"
      format: { values: [2] }
      num_rows: { token_targets: [2000, 4000, 8000], min: 10, max: 1500, slope: [100, 500], step: 5 }
      ...

All entries are evaluated. This enables testing multiple format or structural variants in a single task run.

Views: Analysis Recipes¶

Views define how to slice and group results for analysis. They are named query recipes — a group_by, optional filters, and optional facet_by — that operate directly on manifold.* and params.* columns. Nothing is pre-computed; views are a config concept, not a database concept.

views:
  - view: formats
    label: "Format x Input Tokens {operation}"
    group_by: ["params.format", "manifold.target_tokens"]
    facet_by: ["params.operation"]
  - view: filters
    label: "Filter x Input Tokens {operation}"
    group_by: ["params.filter_type", "manifold.target_tokens"]
    facet_by: ["params.operation"]
  - view: project_formats
    label: "Format"
    group_by: ["params.format"]
    filters: { "manifold.target_tokens": 2000 }

View Parameters¶

view: Unique view identifier (used with --view flag in analysis tools)
label: Human-readable name; {param} tokens are interpolated with facet values
group_by: Ordered list of manifold.* or params.* columns defining the primary axes
facet_by (optional): Columns used to split results into separate panels or series
filters (optional): Fixed column values to restrict the view to a specific slice

Use case: Understanding how one or two parameters affect performance while controlling confounds — equivalent to what projections and surfaces provided previously, but expressed as a unified config-level recipe rather than pre-computed database entries.

Integration with runner.py¶

runner.py consumes the YAML config to generate test cases:

python runner.py \
    --config configs/r12.yaml \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-16k.json \
    --model claude-opus-4-5 \
    --apibase http://localhost:3333

Config Execution Parameters¶

From the config file: - Task definitions (what tests to create)

From command-line arguments: - --template - Which prompting strategy (see templates-samplers.md) - --sampler - Which generation parameters (see templates-samplers.md)

Output¶

runner.py writes results as NDJSON (newline-delimited JSON) to results/:

results/
  {eval_id}/
    steps.ndjson  # One test case per line

Each line contains: - Test metadata (task, parameters) - Input prompt (system + user messages) - Model output - Token counts - Truncation flags

These raw steps are consumed by evaluate.py in Stage 3 to produce aggregated points for the PointsDB.

Complete Examples¶

Grid mode: configs/r12.yaml — 12 task definitions with Cartesian parameter grids and views for Stage 4/5 analysis.

Calibrate mode: configs/tables-16k.yaml — Multiple calibrate entries covering different table formats, with views for format, filter, and token-size analysis.