Skip to content

Experiment Configuration Reference

This document describes how to define evaluation experiments in ReasonScape. Experiment configs are the Stage 1: Definition component that generates parametric test cases for model evaluation.

See architecture.md for how configs fit into the five-stage pipeline.


Overview: What is an Experiment Config?

An experiment config is a YAML file (e.g., configs/r12.yaml) that defines:

  1. Task definitions - Which evaluation tasks to run and how to sample their parameter spaces
  2. Task metadata - View definitions for analysis (grouping and filtering recipes)

The config tells runner.py exactly what test cases to generate and how to sample them efficiently. Each test case is deterministically created from coordinate seeds, enabling reproducibility and caching.

Config format: YAML (not JSON)

Config location: configs/ directory

Usage: python runner.py --config configs/r12.yaml --template ... --sampler ... --model ...


Core Structure

Every experiment config has two main sections:

name: "experiment_identifier"

tasks:
  [task_definitions...]

Name

A simple string identifier for the experiment:

name: r12

This name appears in filenames and logs but doesn't affect test generation.


Tasks Definition

Tasks define the evaluation scenarios and how to sample their parameter spaces. Each task maps to a generator function (in tasks/) and specifies a sampling strategy.

Task Structure

tasks:
  - name: task_identifier
    file: tasks/taskname.json
    mode: calibrate|grid
    [mode_specific_parameters...]
    label: "Human-Readable Task Name"
    views: [...]

Task Parameters

  • name: Internal identifier for the task (used in outputs and logs)
  • file: Path to task generator file (JSON format)
  • mode: How to sample the parameter space: calibrate or grid
  • label: Human-readable name for visualizations and reports
  • views (optional): View definitions for analysis

Sampling Modes

ReasonScape supports two parameter space sampling strategies. Choose based on your research question and computational budget.

Grid Mode: Cartesian Product

Define parameter ranges; the system generates all combinations:

tasks:
  - name: "arithmetic_simple"
    file: "tasks/arithmetic.json"
    mode: "grid"
    grid:
      _id: arithmetic_all
      min_number: [-9, -99]
      max_number: [9, 99]
      max_depth: [0, 1, 2, 4]
      length: [8, 16, 32]

This will generate 2 × 2 × 4 × 3 = 48 points. All entries starting with _ will appear in the manifold{} for downstream filtering/grouping.

Define calibration targets so the system finds parameter values that hit desired token counts:

tasks:
  - name: tables
    label: Tables
    file: tasks/tables.json
    mode: calibrate
    calibrate:
    - _id:          "csv_tall"
      num_rows:    { token_targets: [2000, 4000, 6000, 8000, 10000, 12000, 14000], min: 10, max: 1500, slope: [100, 500], step: 5 }
      num_columns: { values: [4] }
      format:      { values: [1] }
      operation:   { values: [2, 3, 5] }
      filter_type: { values: [1, 2, 3, 4] }
    - _id:          "markdown_tall"
      num_rows:    { token_targets: [2000, 4000, 6000, 8000, 10000, 12000, 14000], min: 10, max: 1500, slope: [100, 500], step: 5 }
      num_columns: { values: [4] }
      format:      { values: [2] }
      operation:   { values: [2, 3, 5] }
      filter_type: { values: [1, 2, 3, 4] }

Use case: Evaluating performance at specific input sizes when the relationship between a parameter (e.g., num_rows) and token count is nonlinear or task-dependent.

Behavior: The calibrator searches for parameter values that produce inputs at each token_targets value, within the given min/max bounds and slope hints. The resolved target_tokens coordinate is stored in manifold{} alongside the _id.

Calibrate Entry Parameters

  • _id: Named identifier for this calibration entry; stored in manifold.id
  • token_targets: List of input token counts to target
  • min/max: Search bounds for the calibrated parameter
  • slope: Hint for the search (expected tokens-per-unit range)
  • step: Minimum step size for the search
  • values: Fixed values for non-calibrated parameters (Cartesian product over all fixed parameters)

Multiple Calibrate Entries

A single task can define multiple calibrate entries to cover different parameter regions:

tasks:
  - name: tables
    file: tasks/tables.json
    mode: calibrate
    calibrate:
    - _id: "csv_tall"
      format: { values: [1] }
      num_rows: { token_targets: [2000, 4000, 8000], min: 10, max: 1500, slope: [100, 500], step: 5 }
      ...
    - _id: "markdown_tall"
      format: { values: [2] }
      num_rows: { token_targets: [2000, 4000, 8000], min: 10, max: 1500, slope: [100, 500], step: 5 }
      ...

All entries are evaluated. This enables testing multiple format or structural variants in a single task run.


Views: Analysis Recipes

Views define how to slice and group results for analysis. They are named query recipes — a group_by, optional filters, and optional facet_by — that operate directly on manifold.* and params.* columns. Nothing is pre-computed; views are a config concept, not a database concept.

views:
  - view: formats
    label: "Format x Input Tokens {operation}"
    group_by: ["params.format", "manifold.target_tokens"]
    facet_by: ["params.operation"]
  - view: filters
    label: "Filter x Input Tokens {operation}"
    group_by: ["params.filter_type", "manifold.target_tokens"]
    facet_by: ["params.operation"]
  - view: project_formats
    label: "Format"
    group_by: ["params.format"]
    filters: { "manifold.target_tokens": 2000 }

View Parameters

  • view: Unique view identifier (used with --view flag in analysis tools)
  • label: Human-readable name; {param} tokens are interpolated with facet values
  • group_by: Ordered list of manifold.* or params.* columns defining the primary axes
  • facet_by (optional): Columns used to split results into separate panels or series
  • filters (optional): Fixed column values to restrict the view to a specific slice

Use case: Understanding how one or two parameters affect performance while controlling confounds — equivalent to what projections and surfaces provided previously, but expressed as a unified config-level recipe rather than pre-computed database entries.


Integration with runner.py

runner.py consumes the YAML config to generate test cases:

python runner.py \
    --config configs/r12.yaml \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-16k.json \
    --model claude-opus-4-5 \
    --apibase http://localhost:3333

Config Execution Parameters

From the config file: - Task definitions (what tests to create)

From command-line arguments: - --template - Which prompting strategy (see templates-samplers.md) - --sampler - Which generation parameters (see templates-samplers.md)

Output

runner.py writes results as NDJSON (newline-delimited JSON) to results/:

results/
  {eval_id}/
    steps.ndjson  # One test case per line

Each line contains: - Test metadata (task, parameters) - Input prompt (system + user messages) - Model output - Token counts - Truncation flags

These raw steps are consumed by evaluate.py in Stage 3 to produce aggregated points for the PointsDB.


Complete Examples

Grid mode: configs/r12.yaml — 12 task definitions with Cartesian parameter grids and views for Stage 4/5 analysis.

Calibrate mode: configs/tables-16k.yaml — Multiple calibrate entries covering different table formats, with views for format, filter, and token-size analysis.


See Also

  • architecture.md - How Stage 1 fits into the five-stage pipeline
  • manifold.md - Deep-dive on two-plane data model and manifold design
  • tasks.md - Task generator specifications and API