Configuration

Templates: Prompting Strategy¶

Templates transform test cases into model inputs, enabling systematic comparison of different reasoning elicitation strategies:

Template	System Prompt	Turn Structure	Examples	CoT
zeroshot	Task description	Single user input	None	No
zeroshot-nosys	None	Task + input as user	None	No
zerocot-nosys	Task description	Single user input	None	Yes
multishot	Task description	Multi-turn examples	Input/answer pairs	No
multishot-nosys	None	Multi-turn examples	Task+input/answer pairs	No
multishot-cot	Task description	Multi-turn examples	Input/reasoning/answer	Yes
unified-cot	None	Single user message	Input/reasoning/answer	Yes

Templates enable research questions like: - System Prompt Dependency: How much do models rely on system vs user instructions? - Few-Shot Effectiveness: Do examples improve performance without reasoning chains? - Chain-of-Thought Impact: How much do reasoning demonstrations help? - Turn Structure Effects: Does conversation structure affect reasoning quality?

Samplers: Generation Parameter Control¶

Samplers define the generation parameters used during LLM inference, controlling how models produce responses. ReasonScape includes optimized sampling configurations for different model families and use cases.

Available Samplers¶

Sampler	Context	Strategy	Description	Use Case
`greedy-2k.json`	2K	Greedy	Deterministic sampling, 2K token limit	Resource-constrained reproducible benchmarking
`greedy-4k.json`	4K	Greedy	Deterministic sampling, 4K token limit	Standard reproducible benchmarking
`greedy-8k.json`	8K	Greedy	Deterministic sampling, 8K token limit	Extended context reproducible benchmarking
`greedy-max.json`	-	Greedy	Deterministic sampling, no token limit	Maximum context reproducible benchmarking
`magistral-2k.json`	2K	Magistral	Mistral-recommended parameters, 2K limit	Magistral models, constrained context
`magistral-6k.json`	6K	Magistral	Mistral-recommended parameters, 6K limit	Magistral models, complex reasoning
`magistral-8k.json`	8K	Magistral	Mistral-recommended parameters, 8K limit	Magistral models, extended context
`o1-high.json`	-	O1	High reasoning effort control	OpenAI O1 models, maximum reasoning
`o1-medium.json`	-	O1	Medium reasoning effort control	OpenAI O1 models, balanced reasoning
`o1-low.json`	-	O1	Low reasoning effort control	OpenAI O1 models, minimal reasoning
`o1-none.json`	-	O1	No reasoning effort control	OpenAI O1 models, baseline
`qwen3-think-2k.json`	2K	Qwen3	Think mode enabled, 2K limit	Qwen models with explicit reasoning
`qwen3-think-4k.json`	4K	Qwen3	Think mode enabled, 4K limit	Qwen models with explicit reasoning
`qwen3-think-max.json`	-	Qwen3	Think mode enabled, no limit	Qwen models with maximum reasoning
`qwen3-nothink-2k.json`	2K	Qwen3	Think mode disabled, 2K limit	Qwen models without explicit reasoning
`qwen3-nothink-4k.json`	4K	Qwen3	Think mode disabled, 4K limit	Qwen models without explicit reasoning
`rc-high-4k.json`	4K	Ruminate	High reasoning intensity	Open-source reasoning control, maximum effort
`rc-medium-4k.json`	4K	Ruminate	Medium reasoning intensity	Open-source reasoning control, balanced
`rc-low-4k.json`	4K	Ruminate	Low reasoning intensity	Open-source reasoning control, minimal
`rc-none-2k.json`	2K	Ruminate	No reasoning control	Open-source baseline, constrained context
`rc-none-4k.json`	4K	Ruminate	No reasoning control	Open-source baseline, standard context

Sampling Strategy Selection¶

Choose samplers based on your evaluation goals:

Goal	Recommended Samplers	Rationale
Reproducible Benchmarking	`greedy-4k`, `greedy-max`	Deterministic results, consistent across runs
Mistral Magistral Models	`magistral-6k`, `magistral-8k`	Vendor-optimized parameters
OpenAI O1 Models	`o1-medium`, `o1-high`	Model-specific reasoning controls
Qwen Models	`qwen3-think-4k`, `qwen3-think-max`	Explicit reasoning mode enabled
Resource Constraints	`*-2k` variants	Lower token limits for efficiency
Complex Tasks	`-8k`, `-max` variants	Extended context for difficult problems

Sampler File Format¶

All sampler files use JSON format with OpenAI-compatible parameters:

{
  "temperature": 0.0,
  "top_p": 1.0,
  "max_tokens": 4096
}

Additional model-specific parameters may include: - reasoning_effort: For reasoning-capable models - repetition_penalty: For open-source models

Integration with Runner¶

Samplers are specified via the --sampler argument:

python runner.py \
    --config configs/c2.json \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json \
    --model your-model \
    --apibase http://localhost:3333

The sampler name becomes part of the result identifier, enabling systematic comparison of generation strategies across identical test conditions.

Configs: Experiment Configuration Format¶

Experiment configurations define complete evaluation runs with hierarchical sampling strategies and adaptive precision control.

Basic Structure¶

name: "experiment_name"

precision:
  [precision_levels...]

tasks:
  [task_definitions...]

Precision Levels¶

Precision levels control the dynamic sampling behavior for statistical confidence. Each level defines sampling parameters that can be selected at test execution time.

precision:
  low:
    count: 32           # Base sample size per batch
    maxrounds: 6        # Maximum batches (max 192 tests total)
    targetci: 0.09      # Target confidence interval
    abortht: 0.2        # Abort if hit rate exceeds this threshold
  medium:
    count: 64
    targetci: 0.06
    maxrounds: 8        # max 512 tests
    targetciht: 0.1     # Alternative CI target for high-truncation scenarios
    abortht: 0.15
  high:
    count: 128          # max 1280 tests
    targetci: 0.04
    targetciht: 0.06
    abortht: 0.1

Precision Parameters¶

count: Number of samples per batch. The system samples in batches to avoid statistical p-hacking.
targetci: Target confidence interval. Sampling continues until this precision is reached.
targetciht: Alternative confidence interval for high-truncation scenarios. Used when truncation ratio exceeds 2*targetci.
maxrounds: Maximum number of batches to run (default: 10). Total max samples = count * maxrounds.
abortht: Abort threshold for hit/truncation rate. Stops sampling if this rate is exceeded to avoid wasting tokens.

Tasks¶

Tasks define the evaluation scenarios and their parameter spaces. Each task references a JSON file containing the task logic and specifies how to sample the parameter space.

List Mode¶

Direct specification of parameter combinations as a list of dictionaries:

- name: "boolean_legacy"
  file: "tasks/boolean.json"
  mode: "list"
  params:
    - { length: 10, max_depth: 2 }
    - { length: 20, max_depth: 4 }
    - { length: 40, max_depth: 8 }
    - { length: 60, max_depth: 16 }
    - { length: 90, max_depth: 32 }

List mode evaluates each parameter combination exactly as specified. While still technically supported, it's primarily used for legacy configurations and simple debugging scenarios.

Grid Mode¶

Direct sweep over parameter combinations via Cartesian product:

- name: "arithmetic_simple"
  file: "tasks/arithmetic.json"
  mode: "grid"
  grid:
    min_number: [-9, -99]
    max_number: [9, 99]
    max_depth: [0, 1, 2, 4]
    length: [8, 16, 32]

Grid mode generates all combinations: 2 × 2 × 4 × 3 = 48 parameter combinations.

Manifold Mode¶

Hierarchical sampling with degree-based difficulty progression and density control:

- name: "arithmetic_adaptive"
  file: "tasks/arithmetic.json"
  mode: "manifold"
  manifolds: [
    {
      "length": {
        "range": [8, 16, 24, 32, 40, 48],
        "window": {"skip": "degree", "body": 4},
        "resample:corner": {"first": 1, "last": 1},
        "resample:lowdef": {"first": 1, "middle": 1, "last": 1}
      },
      "max_depth": {
        "range": [0, 1, 2, 4, 8],
        "window": {"head": 2, "body": "degree"}
      }
    }
  ]

Manifold System¶

Manifolds provide sophisticated parameter space sampling with three key concepts.

Degree-Based Progression via Window Sampling¶

The degree parameter (set at execution time) controls difficulty by determining window sizes and sampling patterns. Higher degrees typically increase complexity and parameter coverage.

Each parameter defines a window that extracts values from its range based on the degree:

"parameter_name": {
  "range": [val1, val2, val3, val4, val5, val6],
  "window": {
    "head": 2,           # Always include first 2 values
    "skip": "degree",    # Skip 'degree' values after head
    "body": 4            # Take 4 values after skip
  }
}

Window Logic: 1. head: Always included values from start of range 2. skip: Number of values to skip after head (supports degree expressions) 3. body: Number of values to take after skip (supports degree expressions)

If skip + head + body exceeds range length, body values are taken from the end instead.

Degree Expressions: - Simple integers: "head": 2 - Degree-relative: "skip": "degree", "body": "degree+1" - Mathematical: "skip": "max(0, degree-1)", "body": "2*degree"

Density Resampling¶

The density parameter (set at execution time) applies secondary sampling to reduce parameter combinations:

"resample:corner": {"first": 1, "last": 1}           # Take first and last only
"resample:lowdef": {"first": 1, "middle": 1, "last": 1}  # Take first, middle, last

Density Types: - corner: Samples boundary values (first and last) - lowdef: Low-definition sampling (first, middle, last) - medium: Medium-density sampling (custom combinations) - normal (no density): Uses all windowed values

Multiple Manifolds¶

Tasks can define multiple manifolds to cover different parameter regions:

manifolds: [
  {
    # Manifold 1: Small numbers, variable whitespace
    "min_number": {"range": [-9], "window": {"head": 1}},
    "max_number": {"range": [9], "window": {"head": 1}},
    "prob_dewhitespace": {"range": [0.0, 1.0], "window": {"head": 2}}
  },
  {
    # Manifold 2: Large numbers, fixed whitespace
    "min_number": {"range": [-99], "window": {"head": 1}},
    "max_number": {"range": [99], "window": {"head": 1}},
    "prob_dewhitespace": {"range": [0.5], "window": {"head": 1}}
  }
]

Manifold Resolution Example¶

For a manifold parameter:

"length": {
  "range": [8, 16, 24, 32, 40, 48],
  "window": {"head": 1, "skip": "degree", "body": 3},
  "resample:corner": {"first": 1, "last": 1}
}

Resolution at degree=1, density=corner: 1. Window sampling: head=1 ([8]), skip=1, body=3 ([24,32,40]) → [8,24,32,40] 2. Density resampling: first=1, last=1 → [8,40]

Resolution at degree=2, density=normal: 1. Window sampling: head=1 ([8]), skip=2, body=3 ([32,40,48]) → [8,32,40,48] 2. No density resampling → [8,32,40,48]

Debugging Manifolds¶

Use the included resolver script to preview manifold resolution:

python resolver.py config.yaml 1  # Preview at degree=1

This shows the concrete grids that will be generated for each density level, helping debug complex manifold definitions before running expensive evaluations.

DataSets: Visualization Display Format¶

Dataset configurations define how processed evaluation data is structured, labeled, and visualized across ReasonScape's analysis tools. They serve as the crucial bridge between raw evaluation results (bucket.json files from evaluate.py) and the visualization ecosystem (leaderboard.py, explorer.py, and comparison tools).

Basic Structure¶

{
  "name": "dataset_identifier",
  "inputs": ["data/experiment/buckets-*.json"],
  "manifolds": { /* difficulty tier definitions */ },
  "scenarios": { /* model+template+sampler combinations */ },
  "basetasks": { /* task-specific visualization configuration */ }
}

Core Components¶

1. Input Aggregation¶

{
  "name": "m6",
  "inputs": [
    "data/m6/buckets-*.json"
  ]
}

inputs: Glob patterns matching bucket.json files from evaluate.py
Aggregation: Multiple bucket files are merged into unified dataset
Validation: System verifies data availability and completeness

2. Manifold Definitions¶

Manifolds define difficulty groupings that map the degree/density parameters from experiment configs into human-interpretable complexity tiers:

{
  "manifolds": {
    "lowdef+low+0": { 
      "label": "(easy)", 
      "groups": ["degree0"], 
      "points": {
        "objects": 18, 
        "arithmetic": 24, 
        "dates": 8, 
        "boolean": 6, 
        "movies": 4, 
        "shuffle": 18
      } 
    },
    "normal+low+1": { 
      "label": "(medium)", 
      "groups": ["degree1"], 
      "points": {
        "objects": 24, 
        "arithmetic": 39, 
        "dates": 12, 
        "boolean": 20, 
        "movies": 24, 
        "shuffle": 48
      } 
    },
    "normal+low+2": { 
      "label": "(hard)", 
      "groups": ["degree2"], 
      "points": {
        "objects": 24, 
        "arithmetic": 39, 
        "dates": 16, 
        "boolean": 40, 
        "movies": 32, 
        "shuffle": 64
      } 
    }
  }
}

Manifold Parameters¶

Manifold Key: Format {density}+{precision}+{degree} matching experiment execution parameters
label: Human-readable difficulty description for leaderboard display
groups: Logical groupings for analysis (enables degree0+degree1 = "easy+medium")
points: Expected data points per task for statistical validation

Degree-Density Mapping¶

The manifold system bridges experiment configuration (configs/) and dataset analysis:

Experiment Execution:

python runner.py --degree 1 --precision low --density normal

Dataset Mapping:

"normal+low+1": { "label": "(medium)", "groups": ["degree1"] }

This enables: - Hierarchical Analysis: Compare easy vs medium vs hard across all models - Statistical Validation: Verify sufficient data points for confidence intervals
- Progressive Evaluation: Add higher degrees without reconfiguring datasets

3. Scenario Management¶

Scenarios define model+template+sampler combinations with display labels and organizational groupings:

{
  "scenarios": {
    "phi-4-fp16+zerocot-nosys+greedy-4k": { 
      "label": "Microsoft Phi-4", 
      "groups": ["phi", "microsoft"] 
    },
    "gpt-oss-20b+zeroshot-nosys+greedy-4k": { 
      "label": "OpenAI GPT-OSS-20B", 
      "groups": ["openai", "proprietary"] 
    },
    "Meta-Llama-3.1-8B-Instruct-Turbo+zerocot-nosys+greedy-4k": { 
      "label": "Meta Llama-3.1-8B", 
      "groups": ["meta", "opensource"] 
    }
  }
}

Scenario Parameters¶

Scenario Key: Exact format {model}+{template}+{sampler} matching bucket identifiers
label: Human-readable name for leaderboard and UI display
groups: Organizational categories for filtering and comparison analysis

4. Task Visualization Configuration¶

Each base task defines visualization parameters for projections (1D analysis) and surfaces (2D analysis):

{
  "basetasks": {
    "arithmetic": {
      "label": "Arithmetic",
      "projections": [ /* 1D line analysis */ ],
      "surfaces": [ /* 2D surface analysis */ ]
    }
  }
}

Projections (1D Analysis)¶

Projections define controlled parameter sweeps for line-based analysis in compare_project.py:

{
  "projections": [
    {
      "label": "Length (depth=0, whitespace=50%)",
      "axis": "length",
      "filter": {
        "min_number": -9,
        "max_number": 9,
        "prob_dewhitespace": 0.5,
        "max_depth": 0
      },
      "values": [8, 16, 32, 48],
      "labels": ["8","16","32","48"]
    }
  ]
}

Projection Parameters: - label: Descriptive name for the analysis - axis: Parameter to vary along the projection line - filter: Fixed parameters defining experimental controls - values: Specific parameter values to include in analysis - labels: Human-readable labels for axis ticks

Surfaces (2D Analysis)¶

Surfaces define two-dimensional parameter sweeps for 3D visualization in explorer.py and compare_surface.py:

{
  "surfaces": [
    {
      "label": "Length x Depth (Random Whitespace, -9 to 9)",
      "filter": {
        "min_number": -9,
        "max_number": 9,
        "prob_dewhitespace": 0.5
      },
      "x_data": "length",
      "x_title": "Length", 
      "x_values": [8, 16, 24, 32, 40, 48],
      "x_labels": ["8","16","24","32","40","48"],
      "y_data": "max_depth",
      "y_title": "Depth",
      "y_values": [0, 1, 4],
      "y_labels": ["0","1","4"]
    }
  ]
}

Surface Parameters: - label: Descriptive name for the surface - filter: Fixed parameters defining the surface slice - x_data/y_data: Parameters varying along each surface dimension - x_title/y_title: Axis labels for visualization - x_values/y_values: Specific parameter combinations to include - x_labels/y_labels: Human-readable labels for axis ticks

Integration with Visualization Tools¶

Leaderboard Integration (`leaderboard.py`)¶

Manifold Groupings: Create difficulty tiers (easy/medium/hard) for ReasonScore calculation
Scenario Labels: Display human-readable model names instead of technical identifiers
Point Validation: Verify sufficient statistical power for confidence intervals
Group Filtering: Enable model family comparisons

Explorer Integration (`explorer.py`)¶

Surface Definitions: Generate 3D difficulty manifold visualizations
Interactive Selection: Map surface grid lines to projection analysis
Multi-Panel Sync: Coordinate FFT, accuracy, and histogram displays
Parameter Filtering: Apply surface filters for controlled analysis

Comparison Tools Integration (`compare_project.py`, `compare_surface.py`)¶

Projection Matrices: Create systematic parameter sweep grids
Cross-Model Analysis: Compare identical projections across multiple scenarios
Filter Application: Generate controlled experimental comparisons
Batch Visualization: Process multiple tasks and models simultaneously

Complete Task Example¶

The Boolean task demonstrates comprehensive projection and surface definitions:

{
  "boolean": {
    "label": "Boolean",
    "projections": [
      {
        "label": "Length (Depth=0, Python Format)",
        "axis": "length", 
        "filter": {
          "boolean_format": 0,
          "max_depth": 0
        },
        "values": [8, 16, 24, 40],
        "labels": ["8","16","24","40"]
      },
      {
        "label": "Format (Depth=0, Length=24)",
        "axis": "boolean_format",
        "filter": {
          "length": 24,
          "max_depth": 0  
        },
        "values": [0, 1, 2, 3, 4],
        "labels": ["PYTHON","T/F","ON/OFF","BINARY","YES/NO"]
      }
    ],
    "surfaces": [
      {
        "label": "Length x Format (Depth 0)",
        "x_data": "length",
        "x_title": "Length", 
        "x_values": [8, 16, 24, 32, 40, 56],
        "x_labels": ["8","16","24","32","40","56"],
        "y_data": "boolean_format", 
        "y_title": "Format",
        "y_values": [0, 1, 2, 3, 4],
        "y_labels": ["PYTHON","T/F","ON/OFF","BINARY","YES/NO"]
      }
    ]
  }
}

This configuration enables: - Length Analysis: How boolean expression length affects accuracy in Python format - Format Analysis: How different boolean notations affect parsing at fixed complexity
- Surface Analysis: Complete 2D exploration of length×format difficulty space

Best Practices¶

Manifold Design¶

Progressive Difficulty: Ensure degree progression represents meaningful complexity increases
Statistical Power: Balance point counts with computational costs, if the surface is well-behaved consider corner sampling.
Complete Coverage: Define lowdef and corner manifolds for all degree/density combinations in your experiment

Scenario Organization¶

Consistent Naming: Use clear, hierarchical labels (vendor + model + configuration)
Logical Groupings: Enable meaningful model family comparisons
Template/Sampler Clarity: Make evaluation configuration explicit in labels

Visualization Optimization¶

Projection Selection: Choose parameter sweeps that test specific hypotheses
Surface Boundaries: Ensure surface extents capture interesting failure modes
Filter Design: Use filters to isolate individual cognitive effects
Label Clarity: Prioritize human readability over technical precision

Dataset configurations transform raw evaluation data into structured research tools, enabling systematic exploration of AI reasoning capabilities across multiple dimensions of analysis.

Configuration

Templates: Prompting Strategy¶

Samplers: Generation Parameter Control¶

Available Samplers¶

Sampling Strategy Selection¶

Sampler File Format¶

Integration with Runner¶

Configs: Experiment Configuration Format¶

Basic Structure¶

Precision Levels¶

Precision Parameters¶

Tasks¶

List Mode¶

Grid Mode¶

Manifold Mode¶

Manifold System¶

Degree-Based Progression via Window Sampling¶

Density Resampling¶

Multiple Manifolds¶

Manifold Resolution Example¶

Debugging Manifolds¶

DataSets: Visualization Display Format¶

Basic Structure¶

Core Components¶

1. Input Aggregation¶

2. Manifold Definitions¶

Manifold Parameters¶

Degree-Density Mapping¶

3. Scenario Management¶

Scenario Parameters¶

4. Task Visualization Configuration¶

Projections (1D Analysis)¶

Surfaces (2D Analysis)¶

Integration with Visualization Tools¶

Leaderboard Integration (leaderboard.py)¶

Explorer Integration (explorer.py)¶

Comparison Tools Integration (compare_project.py, compare_surface.py)¶

Complete Task Example¶

Best Practices¶

Manifold Design¶

Scenario Organization¶

Visualization Optimization¶

Leaderboard Integration (`leaderboard.py`)¶

Explorer Integration (`explorer.py`)¶

Comparison Tools Integration (`compare_project.py`, `compare_surface.py`)¶