Skip to content

Configuration

Templates: Prompting Strategy

Templates transform test cases into model inputs, enabling systematic comparison of different reasoning elicitation strategies:

Template System Prompt Turn Structure Examples CoT
zeroshot Task description Single user input None No
zeroshot-nosys None Task + input as user None No
zerocot-nosys Task description Single user input None Yes
multishot Task description Multi-turn examples Input/answer pairs No
multishot-nosys None Multi-turn examples Task+input/answer pairs No
multishot-cot Task description Multi-turn examples Input/reasoning/answer Yes
unified-cot None Single user message Input/reasoning/answer Yes

Templates enable research questions like: - System Prompt Dependency: How much do models rely on system vs user instructions? - Few-Shot Effectiveness: Do examples improve performance without reasoning chains? - Chain-of-Thought Impact: How much do reasoning demonstrations help? - Turn Structure Effects: Does conversation structure affect reasoning quality?

Samplers: Generation Parameter Control

Samplers define the generation parameters used during LLM inference, controlling how models produce responses. ReasonScape includes optimized sampling configurations for different model families and use cases.

Available Samplers

Sampler Context Strategy Description Use Case
greedy-2k.json 2K Greedy Deterministic sampling, 2K token limit Resource-constrained reproducible benchmarking
greedy-4k.json 4K Greedy Deterministic sampling, 4K token limit Standard reproducible benchmarking
greedy-8k.json 8K Greedy Deterministic sampling, 8K token limit Extended context reproducible benchmarking
greedy-max.json - Greedy Deterministic sampling, no token limit Maximum context reproducible benchmarking
magistral-2k.json 2K Magistral Mistral-recommended parameters, 2K limit Magistral models, constrained context
magistral-6k.json 6K Magistral Mistral-recommended parameters, 6K limit Magistral models, complex reasoning
magistral-8k.json 8K Magistral Mistral-recommended parameters, 8K limit Magistral models, extended context
o1-high.json - O1 High reasoning effort control OpenAI O1 models, maximum reasoning
o1-medium.json - O1 Medium reasoning effort control OpenAI O1 models, balanced reasoning
o1-low.json - O1 Low reasoning effort control OpenAI O1 models, minimal reasoning
o1-none.json - O1 No reasoning effort control OpenAI O1 models, baseline
qwen3-think-2k.json 2K Qwen3 Think mode enabled, 2K limit Qwen models with explicit reasoning
qwen3-think-4k.json 4K Qwen3 Think mode enabled, 4K limit Qwen models with explicit reasoning
qwen3-think-max.json - Qwen3 Think mode enabled, no limit Qwen models with maximum reasoning
qwen3-nothink-2k.json 2K Qwen3 Think mode disabled, 2K limit Qwen models without explicit reasoning
qwen3-nothink-4k.json 4K Qwen3 Think mode disabled, 4K limit Qwen models without explicit reasoning
rc-high-4k.json 4K Ruminate High reasoning intensity Open-source reasoning control, maximum effort
rc-medium-4k.json 4K Ruminate Medium reasoning intensity Open-source reasoning control, balanced
rc-low-4k.json 4K Ruminate Low reasoning intensity Open-source reasoning control, minimal
rc-none-2k.json 2K Ruminate No reasoning control Open-source baseline, constrained context
rc-none-4k.json 4K Ruminate No reasoning control Open-source baseline, standard context

Sampling Strategy Selection

Choose samplers based on your evaluation goals:

Goal Recommended Samplers Rationale
Reproducible Benchmarking greedy-4k, greedy-max Deterministic results, consistent across runs
Mistral Magistral Models magistral-6k, magistral-8k Vendor-optimized parameters
OpenAI O1 Models o1-medium, o1-high Model-specific reasoning controls
Qwen Models qwen3-think-4k, qwen3-think-max Explicit reasoning mode enabled
Resource Constraints *-2k variants Lower token limits for efficiency
Complex Tasks *-8k, *-max variants Extended context for difficult problems

Sampler File Format

All sampler files use JSON format with OpenAI-compatible parameters:

{
  "temperature": 0.0,
  "top_p": 1.0,
  "max_tokens": 4096
}

Additional model-specific parameters may include: - reasoning_effort: For reasoning-capable models - repetition_penalty: For open-source models

Integration with Runner

Samplers are specified via the --sampler argument:

python runner.py \
    --config configs/c2.json \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json \
    --model your-model \
    --apibase http://localhost:3333

The sampler name becomes part of the result identifier, enabling systematic comparison of generation strategies across identical test conditions.

Configs: Experiment Configuration Format

Experiment configurations define complete evaluation runs with hierarchical sampling strategies and adaptive precision control.

Basic Structure

name: "experiment_name"

precision:
  [precision_levels...]

tasks:
  [task_definitions...]

Precision Levels

Precision levels control the dynamic sampling behavior for statistical confidence. Each level defines sampling parameters that can be selected at test execution time.

precision:
  low:
    count: 32           # Base sample size per batch
    maxrounds: 6        # Maximum batches (max 192 tests total)
    targetci: 0.09      # Target confidence interval
    abortht: 0.2        # Abort if hit rate exceeds this threshold
  medium:
    count: 64
    targetci: 0.06
    maxrounds: 8        # max 512 tests
    targetciht: 0.1     # Alternative CI target for high-truncation scenarios
    abortht: 0.15
  high:
    count: 128          # max 1280 tests
    targetci: 0.04
    targetciht: 0.06
    abortht: 0.1

Precision Parameters

  • count: Number of samples per batch. The system samples in batches to avoid statistical p-hacking.
  • targetci: Target confidence interval. Sampling continues until this precision is reached.
  • targetciht: Alternative confidence interval for high-truncation scenarios. Used when truncation ratio exceeds 2*targetci.
  • maxrounds: Maximum number of batches to run (default: 10). Total max samples = count * maxrounds.
  • abortht: Abort threshold for hit/truncation rate. Stops sampling if this rate is exceeded to avoid wasting tokens.

Tasks

Tasks define the evaluation scenarios and their parameter spaces. Each task references a JSON file containing the task logic and specifies how to sample the parameter space.

List Mode

Direct specification of parameter combinations as a list of dictionaries:

- name: "boolean_legacy"
  file: "tasks/boolean.json"
  mode: "list"
  params:
    - { length: 10, max_depth: 2 }
    - { length: 20, max_depth: 4 }
    - { length: 40, max_depth: 8 }
    - { length: 60, max_depth: 16 }
    - { length: 90, max_depth: 32 }

List mode evaluates each parameter combination exactly as specified. While still technically supported, it's primarily used for legacy configurations and simple debugging scenarios.

Grid Mode

Direct sweep over parameter combinations via Cartesian product:

- name: "arithmetic_simple"
  file: "tasks/arithmetic.json"
  mode: "grid"
  grid:
    min_number: [-9, -99]
    max_number: [9, 99]
    max_depth: [0, 1, 2, 4]
    length: [8, 16, 32]

Grid mode generates all combinations: 2 × 2 × 4 × 3 = 48 parameter combinations.

Manifold Mode

Hierarchical sampling with degree-based difficulty progression and density control:

- name: "arithmetic_adaptive"
  file: "tasks/arithmetic.json"
  mode: "manifold"
  manifolds: [
    {
      "length": {
        "range": [8, 16, 24, 32, 40, 48],
        "window": {"skip": "degree", "body": 4},
        "resample:corner": {"first": 1, "last": 1},
        "resample:lowdef": {"first": 1, "middle": 1, "last": 1}
      },
      "max_depth": {
        "range": [0, 1, 2, 4, 8],
        "window": {"head": 2, "body": "degree"}
      }
    }
  ]

Manifold System

Manifolds provide sophisticated parameter space sampling with three key concepts.

Degree-Based Progression via Window Sampling

The degree parameter (set at execution time) controls difficulty by determining window sizes and sampling patterns. Higher degrees typically increase complexity and parameter coverage.

Each parameter defines a window that extracts values from its range based on the degree:

"parameter_name": {
  "range": [val1, val2, val3, val4, val5, val6],
  "window": {
    "head": 2,           # Always include first 2 values
    "skip": "degree",    # Skip 'degree' values after head
    "body": 4            # Take 4 values after skip
  }
}

Window Logic: 1. head: Always included values from start of range 2. skip: Number of values to skip after head (supports degree expressions) 3. body: Number of values to take after skip (supports degree expressions)

If skip + head + body exceeds range length, body values are taken from the end instead.

Degree Expressions: - Simple integers: "head": 2 - Degree-relative: "skip": "degree", "body": "degree+1" - Mathematical: "skip": "max(0, degree-1)", "body": "2*degree"

Density Resampling

The density parameter (set at execution time) applies secondary sampling to reduce parameter combinations:

"resample:corner": {"first": 1, "last": 1}           # Take first and last only
"resample:lowdef": {"first": 1, "middle": 1, "last": 1}  # Take first, middle, last

Density Types: - corner: Samples boundary values (first and last) - lowdef: Low-definition sampling (first, middle, last) - medium: Medium-density sampling (custom combinations) - normal (no density): Uses all windowed values

Multiple Manifolds

Tasks can define multiple manifolds to cover different parameter regions:

manifolds: [
  {
    # Manifold 1: Small numbers, variable whitespace
    "min_number": {"range": [-9], "window": {"head": 1}},
    "max_number": {"range": [9], "window": {"head": 1}},
    "prob_dewhitespace": {"range": [0.0, 1.0], "window": {"head": 2}}
  },
  {
    # Manifold 2: Large numbers, fixed whitespace
    "min_number": {"range": [-99], "window": {"head": 1}},
    "max_number": {"range": [99], "window": {"head": 1}},
    "prob_dewhitespace": {"range": [0.5], "window": {"head": 1}}
  }
]

Manifold Resolution Example

For a manifold parameter:

"length": {
  "range": [8, 16, 24, 32, 40, 48],
  "window": {"head": 1, "skip": "degree", "body": 3},
  "resample:corner": {"first": 1, "last": 1}
}

Resolution at degree=1, density=corner: 1. Window sampling: head=1 ([8]), skip=1, body=3 ([24,32,40]) → [8,24,32,40] 2. Density resampling: first=1, last=1 → [8,40]

Resolution at degree=2, density=normal: 1. Window sampling: head=1 ([8]), skip=2, body=3 ([32,40,48]) → [8,32,40,48] 2. No density resampling → [8,32,40,48]

Debugging Manifolds

Use the included resolver script to preview manifold resolution:

python resolver.py config.yaml 1  # Preview at degree=1

This shows the concrete grids that will be generated for each density level, helping debug complex manifold definitions before running expensive evaluations.

DataSets: Visualization Display Format

Dataset configurations define how processed evaluation data is structured, labeled, and visualized across ReasonScape's analysis tools. They serve as the crucial bridge between raw evaluation results (bucket.json files from evaluate.py) and the visualization ecosystem (leaderboard.py, explorer.py, and comparison tools).

Basic Structure

{
  "name": "dataset_identifier",
  "inputs": ["data/experiment/buckets-*.json"],
  "manifolds": { /* difficulty tier definitions */ },
  "scenarios": { /* model+template+sampler combinations */ },
  "basetasks": { /* task-specific visualization configuration */ }
}

Core Components

1. Input Aggregation

{
  "name": "m6",
  "inputs": [
    "data/m6/buckets-*.json"
  ]
}
  • inputs: Glob patterns matching bucket.json files from evaluate.py
  • Aggregation: Multiple bucket files are merged into unified dataset
  • Validation: System verifies data availability and completeness

2. Manifold Definitions

Manifolds define difficulty groupings that map the degree/density parameters from experiment configs into human-interpretable complexity tiers:

{
  "manifolds": {
    "lowdef+low+0": { 
      "label": "(easy)", 
      "groups": ["degree0"], 
      "points": {
        "objects": 18, 
        "arithmetic": 24, 
        "dates": 8, 
        "boolean": 6, 
        "movies": 4, 
        "shuffle": 18
      } 
    },
    "normal+low+1": { 
      "label": "(medium)", 
      "groups": ["degree1"], 
      "points": {
        "objects": 24, 
        "arithmetic": 39, 
        "dates": 12, 
        "boolean": 20, 
        "movies": 24, 
        "shuffle": 48
      } 
    },
    "normal+low+2": { 
      "label": "(hard)", 
      "groups": ["degree2"], 
      "points": {
        "objects": 24, 
        "arithmetic": 39, 
        "dates": 16, 
        "boolean": 40, 
        "movies": 32, 
        "shuffle": 64
      } 
    }
  }
}

Manifold Parameters

  • Manifold Key: Format {density}+{precision}+{degree} matching experiment execution parameters
  • label: Human-readable difficulty description for leaderboard display
  • groups: Logical groupings for analysis (enables degree0+degree1 = "easy+medium")
  • points: Expected data points per task for statistical validation

Degree-Density Mapping

The manifold system bridges experiment configuration (configs/) and dataset analysis:

Experiment Execution:

python runner.py --degree 1 --precision low --density normal

Dataset Mapping:

"normal+low+1": { "label": "(medium)", "groups": ["degree1"] }

This enables: - Hierarchical Analysis: Compare easy vs medium vs hard across all models - Statistical Validation: Verify sufficient data points for confidence intervals
- Progressive Evaluation: Add higher degrees without reconfiguring datasets

3. Scenario Management

Scenarios define model+template+sampler combinations with display labels and organizational groupings:

{
  "scenarios": {
    "phi-4-fp16+zerocot-nosys+greedy-4k": { 
      "label": "Microsoft Phi-4", 
      "groups": ["phi", "microsoft"] 
    },
    "gpt-oss-20b+zeroshot-nosys+greedy-4k": { 
      "label": "OpenAI GPT-OSS-20B", 
      "groups": ["openai", "proprietary"] 
    },
    "Meta-Llama-3.1-8B-Instruct-Turbo+zerocot-nosys+greedy-4k": { 
      "label": "Meta Llama-3.1-8B", 
      "groups": ["meta", "opensource"] 
    }
  }
}

Scenario Parameters

  • Scenario Key: Exact format {model}+{template}+{sampler} matching bucket identifiers
  • label: Human-readable name for leaderboard and UI display
  • groups: Organizational categories for filtering and comparison analysis

4. Task Visualization Configuration

Each base task defines visualization parameters for projections (1D analysis) and surfaces (2D analysis):

{
  "basetasks": {
    "arithmetic": {
      "label": "Arithmetic",
      "projections": [ /* 1D line analysis */ ],
      "surfaces": [ /* 2D surface analysis */ ]
    }
  }
}

Projections (1D Analysis)

Projections define controlled parameter sweeps for line-based analysis in compare_project.py:

{
  "projections": [
    {
      "label": "Length (depth=0, whitespace=50%)",
      "axis": "length",
      "filter": {
        "min_number": -9,
        "max_number": 9,
        "prob_dewhitespace": 0.5,
        "max_depth": 0
      },
      "values": [8, 16, 32, 48],
      "labels": ["8","16","32","48"]
    }
  ]
}

Projection Parameters: - label: Descriptive name for the analysis - axis: Parameter to vary along the projection line - filter: Fixed parameters defining experimental controls - values: Specific parameter values to include in analysis - labels: Human-readable labels for axis ticks

Surfaces (2D Analysis)

Surfaces define two-dimensional parameter sweeps for 3D visualization in explorer.py and compare_surface.py:

{
  "surfaces": [
    {
      "label": "Length x Depth (Random Whitespace, -9 to 9)",
      "filter": {
        "min_number": -9,
        "max_number": 9,
        "prob_dewhitespace": 0.5
      },
      "x_data": "length",
      "x_title": "Length", 
      "x_values": [8, 16, 24, 32, 40, 48],
      "x_labels": ["8","16","24","32","40","48"],
      "y_data": "max_depth",
      "y_title": "Depth",
      "y_values": [0, 1, 4],
      "y_labels": ["0","1","4"]
    }
  ]
}

Surface Parameters: - label: Descriptive name for the surface - filter: Fixed parameters defining the surface slice - x_data/y_data: Parameters varying along each surface dimension - x_title/y_title: Axis labels for visualization - x_values/y_values: Specific parameter combinations to include - x_labels/y_labels: Human-readable labels for axis ticks

Integration with Visualization Tools

Leaderboard Integration (leaderboard.py)

  • Manifold Groupings: Create difficulty tiers (easy/medium/hard) for ReasonScore calculation
  • Scenario Labels: Display human-readable model names instead of technical identifiers
  • Point Validation: Verify sufficient statistical power for confidence intervals
  • Group Filtering: Enable model family comparisons

Explorer Integration (explorer.py)

  • Surface Definitions: Generate 3D difficulty manifold visualizations
  • Interactive Selection: Map surface grid lines to projection analysis
  • Multi-Panel Sync: Coordinate FFT, accuracy, and histogram displays
  • Parameter Filtering: Apply surface filters for controlled analysis

Comparison Tools Integration (compare_project.py, compare_surface.py)

  • Projection Matrices: Create systematic parameter sweep grids
  • Cross-Model Analysis: Compare identical projections across multiple scenarios
  • Filter Application: Generate controlled experimental comparisons
  • Batch Visualization: Process multiple tasks and models simultaneously

Complete Task Example

The Boolean task demonstrates comprehensive projection and surface definitions:

{
  "boolean": {
    "label": "Boolean",
    "projections": [
      {
        "label": "Length (Depth=0, Python Format)",
        "axis": "length", 
        "filter": {
          "boolean_format": 0,
          "max_depth": 0
        },
        "values": [8, 16, 24, 40],
        "labels": ["8","16","24","40"]
      },
      {
        "label": "Format (Depth=0, Length=24)",
        "axis": "boolean_format",
        "filter": {
          "length": 24,
          "max_depth": 0  
        },
        "values": [0, 1, 2, 3, 4],
        "labels": ["PYTHON","T/F","ON/OFF","BINARY","YES/NO"]
      }
    ],
    "surfaces": [
      {
        "label": "Length x Format (Depth 0)",
        "x_data": "length",
        "x_title": "Length", 
        "x_values": [8, 16, 24, 32, 40, 56],
        "x_labels": ["8","16","24","32","40","56"],
        "y_data": "boolean_format", 
        "y_title": "Format",
        "y_values": [0, 1, 2, 3, 4],
        "y_labels": ["PYTHON","T/F","ON/OFF","BINARY","YES/NO"]
      }
    ]
  }
}

This configuration enables: - Length Analysis: How boolean expression length affects accuracy in Python format - Format Analysis: How different boolean notations affect parsing at fixed complexity
- Surface Analysis: Complete 2D exploration of length×format difficulty space

Best Practices

Manifold Design

  • Progressive Difficulty: Ensure degree progression represents meaningful complexity increases
  • Statistical Power: Balance point counts with computational costs, if the surface is well-behaved consider corner sampling.
  • Complete Coverage: Define lowdef and corner manifolds for all degree/density combinations in your experiment

Scenario Organization

  • Consistent Naming: Use clear, hierarchical labels (vendor + model + configuration)
  • Logical Groupings: Enable meaningful model family comparisons
  • Template/Sampler Clarity: Make evaluation configuration explicit in labels

Visualization Optimization

  • Projection Selection: Choose parameter sweeps that test specific hypotheses
  • Surface Boundaries: Ensure surface extents capture interesting failure modes
  • Filter Design: Use filters to isolate individual cognitive effects
  • Label Clarity: Prioritize human readability over technical precision

Dataset configurations transform raw evaluation data into structured research tools, enabling systematic exploration of AI reasoning capabilities across multiple dimensions of analysis.