Configuration
Templates: Prompting Strategy¶
Templates transform test cases into model inputs, enabling systematic comparison of different reasoning elicitation strategies:
Template | System Prompt | Turn Structure | Examples | CoT |
---|---|---|---|---|
zeroshot | Task description | Single user input | None | No |
zeroshot-nosys | None | Task + input as user | None | No |
zerocot-nosys | Task description | Single user input | None | Yes |
multishot | Task description | Multi-turn examples | Input/answer pairs | No |
multishot-nosys | None | Multi-turn examples | Task+input/answer pairs | No |
multishot-cot | Task description | Multi-turn examples | Input/reasoning/answer | Yes |
unified-cot | None | Single user message | Input/reasoning/answer | Yes |
Templates enable research questions like: - System Prompt Dependency: How much do models rely on system vs user instructions? - Few-Shot Effectiveness: Do examples improve performance without reasoning chains? - Chain-of-Thought Impact: How much do reasoning demonstrations help? - Turn Structure Effects: Does conversation structure affect reasoning quality?
Samplers: Generation Parameter Control¶
Samplers define the generation parameters used during LLM inference, controlling how models produce responses. ReasonScape includes optimized sampling configurations for different model families and use cases.
Available Samplers¶
Sampler | Context | Strategy | Description | Use Case |
---|---|---|---|---|
greedy-2k.json |
2K | Greedy | Deterministic sampling, 2K token limit | Resource-constrained reproducible benchmarking |
greedy-4k.json |
4K | Greedy | Deterministic sampling, 4K token limit | Standard reproducible benchmarking |
greedy-8k.json |
8K | Greedy | Deterministic sampling, 8K token limit | Extended context reproducible benchmarking |
greedy-max.json |
- | Greedy | Deterministic sampling, no token limit | Maximum context reproducible benchmarking |
magistral-2k.json |
2K | Magistral | Mistral-recommended parameters, 2K limit | Magistral models, constrained context |
magistral-6k.json |
6K | Magistral | Mistral-recommended parameters, 6K limit | Magistral models, complex reasoning |
magistral-8k.json |
8K | Magistral | Mistral-recommended parameters, 8K limit | Magistral models, extended context |
o1-high.json |
- | O1 | High reasoning effort control | OpenAI O1 models, maximum reasoning |
o1-medium.json |
- | O1 | Medium reasoning effort control | OpenAI O1 models, balanced reasoning |
o1-low.json |
- | O1 | Low reasoning effort control | OpenAI O1 models, minimal reasoning |
o1-none.json |
- | O1 | No reasoning effort control | OpenAI O1 models, baseline |
qwen3-think-2k.json |
2K | Qwen3 | Think mode enabled, 2K limit | Qwen models with explicit reasoning |
qwen3-think-4k.json |
4K | Qwen3 | Think mode enabled, 4K limit | Qwen models with explicit reasoning |
qwen3-think-max.json |
- | Qwen3 | Think mode enabled, no limit | Qwen models with maximum reasoning |
qwen3-nothink-2k.json |
2K | Qwen3 | Think mode disabled, 2K limit | Qwen models without explicit reasoning |
qwen3-nothink-4k.json |
4K | Qwen3 | Think mode disabled, 4K limit | Qwen models without explicit reasoning |
rc-high-4k.json |
4K | Ruminate | High reasoning intensity | Open-source reasoning control, maximum effort |
rc-medium-4k.json |
4K | Ruminate | Medium reasoning intensity | Open-source reasoning control, balanced |
rc-low-4k.json |
4K | Ruminate | Low reasoning intensity | Open-source reasoning control, minimal |
rc-none-2k.json |
2K | Ruminate | No reasoning control | Open-source baseline, constrained context |
rc-none-4k.json |
4K | Ruminate | No reasoning control | Open-source baseline, standard context |
Sampling Strategy Selection¶
Choose samplers based on your evaluation goals:
Goal | Recommended Samplers | Rationale |
---|---|---|
Reproducible Benchmarking | greedy-4k , greedy-max |
Deterministic results, consistent across runs |
Mistral Magistral Models | magistral-6k , magistral-8k |
Vendor-optimized parameters |
OpenAI O1 Models | o1-medium , o1-high |
Model-specific reasoning controls |
Qwen Models | qwen3-think-4k , qwen3-think-max |
Explicit reasoning mode enabled |
Resource Constraints | *-2k variants |
Lower token limits for efficiency |
Complex Tasks | *-8k , *-max variants |
Extended context for difficult problems |
Sampler File Format¶
All sampler files use JSON format with OpenAI-compatible parameters:
{
"temperature": 0.0,
"top_p": 1.0,
"max_tokens": 4096
}
Additional model-specific parameters may include:
- reasoning_effort
: For reasoning-capable models
- repetition_penalty
: For open-source models
Integration with Runner¶
Samplers are specified via the --sampler
argument:
python runner.py \
--config configs/c2.json \
--template templates/zerocot-nosys.json \
--sampler samplers/greedy-4k.json \
--model your-model \
--apibase http://localhost:3333
The sampler name becomes part of the result identifier, enabling systematic comparison of generation strategies across identical test conditions.
Configs: Experiment Configuration Format¶
Experiment configurations define complete evaluation runs with hierarchical sampling strategies and adaptive precision control.
Basic Structure¶
name: "experiment_name"
precision:
[precision_levels...]
tasks:
[task_definitions...]
Precision Levels¶
Precision levels control the dynamic sampling behavior for statistical confidence. Each level defines sampling parameters that can be selected at test execution time.
precision:
low:
count: 32 # Base sample size per batch
maxrounds: 6 # Maximum batches (max 192 tests total)
targetci: 0.09 # Target confidence interval
abortht: 0.2 # Abort if hit rate exceeds this threshold
medium:
count: 64
targetci: 0.06
maxrounds: 8 # max 512 tests
targetciht: 0.1 # Alternative CI target for high-truncation scenarios
abortht: 0.15
high:
count: 128 # max 1280 tests
targetci: 0.04
targetciht: 0.06
abortht: 0.1
Precision Parameters¶
count
: Number of samples per batch. The system samples in batches to avoid statistical p-hacking.targetci
: Target confidence interval. Sampling continues until this precision is reached.targetciht
: Alternative confidence interval for high-truncation scenarios. Used when truncation ratio exceeds2*targetci
.maxrounds
: Maximum number of batches to run (default: 10). Total max samples =count * maxrounds
.abortht
: Abort threshold for hit/truncation rate. Stops sampling if this rate is exceeded to avoid wasting tokens.
Tasks¶
Tasks define the evaluation scenarios and their parameter spaces. Each task references a JSON file containing the task logic and specifies how to sample the parameter space.
List Mode¶
Direct specification of parameter combinations as a list of dictionaries:
- name: "boolean_legacy"
file: "tasks/boolean.json"
mode: "list"
params:
- { length: 10, max_depth: 2 }
- { length: 20, max_depth: 4 }
- { length: 40, max_depth: 8 }
- { length: 60, max_depth: 16 }
- { length: 90, max_depth: 32 }
List mode evaluates each parameter combination exactly as specified. While still technically supported, it's primarily used for legacy configurations and simple debugging scenarios.
Grid Mode¶
Direct sweep over parameter combinations via Cartesian product:
- name: "arithmetic_simple"
file: "tasks/arithmetic.json"
mode: "grid"
grid:
min_number: [-9, -99]
max_number: [9, 99]
max_depth: [0, 1, 2, 4]
length: [8, 16, 32]
Grid mode generates all combinations: 2 × 2 × 4 × 3 = 48 parameter combinations.
Manifold Mode¶
Hierarchical sampling with degree-based difficulty progression and density control:
- name: "arithmetic_adaptive"
file: "tasks/arithmetic.json"
mode: "manifold"
manifolds: [
{
"length": {
"range": [8, 16, 24, 32, 40, 48],
"window": {"skip": "degree", "body": 4},
"resample:corner": {"first": 1, "last": 1},
"resample:lowdef": {"first": 1, "middle": 1, "last": 1}
},
"max_depth": {
"range": [0, 1, 2, 4, 8],
"window": {"head": 2, "body": "degree"}
}
}
]
Manifold System¶
Manifolds provide sophisticated parameter space sampling with three key concepts.
Degree-Based Progression via Window Sampling¶
The degree parameter (set at execution time) controls difficulty by determining window sizes and sampling patterns. Higher degrees typically increase complexity and parameter coverage.
Each parameter defines a window
that extracts values from its range
based on the degree:
"parameter_name": {
"range": [val1, val2, val3, val4, val5, val6],
"window": {
"head": 2, # Always include first 2 values
"skip": "degree", # Skip 'degree' values after head
"body": 4 # Take 4 values after skip
}
}
Window Logic: 1. head: Always included values from start of range 2. skip: Number of values to skip after head (supports degree expressions) 3. body: Number of values to take after skip (supports degree expressions)
If skip + head + body
exceeds range length, body values are taken from the end instead.
Degree Expressions:
- Simple integers: "head": 2
- Degree-relative: "skip": "degree"
, "body": "degree+1"
- Mathematical: "skip": "max(0, degree-1)"
, "body": "2*degree"
Density Resampling¶
The density parameter (set at execution time) applies secondary sampling to reduce parameter combinations:
"resample:corner": {"first": 1, "last": 1} # Take first and last only
"resample:lowdef": {"first": 1, "middle": 1, "last": 1} # Take first, middle, last
Density Types:
- corner: Samples boundary values (first
and last
)
- lowdef: Low-definition sampling (first
, middle
, last
)
- medium: Medium-density sampling (custom combinations)
- normal (no density): Uses all windowed values
Multiple Manifolds¶
Tasks can define multiple manifolds to cover different parameter regions:
manifolds: [
{
# Manifold 1: Small numbers, variable whitespace
"min_number": {"range": [-9], "window": {"head": 1}},
"max_number": {"range": [9], "window": {"head": 1}},
"prob_dewhitespace": {"range": [0.0, 1.0], "window": {"head": 2}}
},
{
# Manifold 2: Large numbers, fixed whitespace
"min_number": {"range": [-99], "window": {"head": 1}},
"max_number": {"range": [99], "window": {"head": 1}},
"prob_dewhitespace": {"range": [0.5], "window": {"head": 1}}
}
]
Manifold Resolution Example¶
For a manifold parameter:
"length": {
"range": [8, 16, 24, 32, 40, 48],
"window": {"head": 1, "skip": "degree", "body": 3},
"resample:corner": {"first": 1, "last": 1}
}
Resolution at degree=1, density=corner: 1. Window sampling: head=1 ([8]), skip=1, body=3 ([24,32,40]) → [8,24,32,40] 2. Density resampling: first=1, last=1 → [8,40]
Resolution at degree=2, density=normal: 1. Window sampling: head=1 ([8]), skip=2, body=3 ([32,40,48]) → [8,32,40,48] 2. No density resampling → [8,32,40,48]
Debugging Manifolds¶
Use the included resolver script to preview manifold resolution:
python resolver.py config.yaml 1 # Preview at degree=1
This shows the concrete grids that will be generated for each density level, helping debug complex manifold definitions before running expensive evaluations.
DataSets: Visualization Display Format¶
Dataset configurations define how processed evaluation data is structured, labeled, and visualized across ReasonScape's analysis tools. They serve as the crucial bridge between raw evaluation results (bucket.json
files from evaluate.py
) and the visualization ecosystem (leaderboard.py
, explorer.py
, and comparison tools).
Basic Structure¶
{
"name": "dataset_identifier",
"inputs": ["data/experiment/buckets-*.json"],
"manifolds": { /* difficulty tier definitions */ },
"scenarios": { /* model+template+sampler combinations */ },
"basetasks": { /* task-specific visualization configuration */ }
}
Core Components¶
1. Input Aggregation¶
{
"name": "m6",
"inputs": [
"data/m6/buckets-*.json"
]
}
inputs
: Glob patterns matchingbucket.json
files fromevaluate.py
- Aggregation: Multiple bucket files are merged into unified dataset
- Validation: System verifies data availability and completeness
2. Manifold Definitions¶
Manifolds define difficulty groupings that map the degree/density parameters from experiment configs into human-interpretable complexity tiers:
{
"manifolds": {
"lowdef+low+0": {
"label": "(easy)",
"groups": ["degree0"],
"points": {
"objects": 18,
"arithmetic": 24,
"dates": 8,
"boolean": 6,
"movies": 4,
"shuffle": 18
}
},
"normal+low+1": {
"label": "(medium)",
"groups": ["degree1"],
"points": {
"objects": 24,
"arithmetic": 39,
"dates": 12,
"boolean": 20,
"movies": 24,
"shuffle": 48
}
},
"normal+low+2": {
"label": "(hard)",
"groups": ["degree2"],
"points": {
"objects": 24,
"arithmetic": 39,
"dates": 16,
"boolean": 40,
"movies": 32,
"shuffle": 64
}
}
}
}
Manifold Parameters¶
- Manifold Key: Format
{density}+{precision}+{degree}
matching experiment execution parameters label
: Human-readable difficulty description for leaderboard displaygroups
: Logical groupings for analysis (enables degree0+degree1 = "easy+medium")points
: Expected data points per task for statistical validation
Degree-Density Mapping¶
The manifold system bridges experiment configuration (configs/) and dataset analysis:
Experiment Execution:
python runner.py --degree 1 --precision low --density normal
Dataset Mapping:
"normal+low+1": { "label": "(medium)", "groups": ["degree1"] }
This enables:
- Hierarchical Analysis: Compare easy vs medium vs hard across all models
- Statistical Validation: Verify sufficient data points for confidence intervals
- Progressive Evaluation: Add higher degrees without reconfiguring datasets
3. Scenario Management¶
Scenarios define model+template+sampler combinations with display labels and organizational groupings:
{
"scenarios": {
"phi-4-fp16+zerocot-nosys+greedy-4k": {
"label": "Microsoft Phi-4",
"groups": ["phi", "microsoft"]
},
"gpt-oss-20b+zeroshot-nosys+greedy-4k": {
"label": "OpenAI GPT-OSS-20B",
"groups": ["openai", "proprietary"]
},
"Meta-Llama-3.1-8B-Instruct-Turbo+zerocot-nosys+greedy-4k": {
"label": "Meta Llama-3.1-8B",
"groups": ["meta", "opensource"]
}
}
}
Scenario Parameters¶
- Scenario Key: Exact format
{model}+{template}+{sampler}
matching bucket identifiers label
: Human-readable name for leaderboard and UI displaygroups
: Organizational categories for filtering and comparison analysis
4. Task Visualization Configuration¶
Each base task defines visualization parameters for projections (1D analysis) and surfaces (2D analysis):
{
"basetasks": {
"arithmetic": {
"label": "Arithmetic",
"projections": [ /* 1D line analysis */ ],
"surfaces": [ /* 2D surface analysis */ ]
}
}
}
Projections (1D Analysis)¶
Projections define controlled parameter sweeps for line-based analysis in compare_project.py
:
{
"projections": [
{
"label": "Length (depth=0, whitespace=50%)",
"axis": "length",
"filter": {
"min_number": -9,
"max_number": 9,
"prob_dewhitespace": 0.5,
"max_depth": 0
},
"values": [8, 16, 32, 48],
"labels": ["8","16","32","48"]
}
]
}
Projection Parameters:
- label
: Descriptive name for the analysis
- axis
: Parameter to vary along the projection line
- filter
: Fixed parameters defining experimental controls
- values
: Specific parameter values to include in analysis
- labels
: Human-readable labels for axis ticks
Surfaces (2D Analysis)¶
Surfaces define two-dimensional parameter sweeps for 3D visualization in explorer.py
and compare_surface.py
:
{
"surfaces": [
{
"label": "Length x Depth (Random Whitespace, -9 to 9)",
"filter": {
"min_number": -9,
"max_number": 9,
"prob_dewhitespace": 0.5
},
"x_data": "length",
"x_title": "Length",
"x_values": [8, 16, 24, 32, 40, 48],
"x_labels": ["8","16","24","32","40","48"],
"y_data": "max_depth",
"y_title": "Depth",
"y_values": [0, 1, 4],
"y_labels": ["0","1","4"]
}
]
}
Surface Parameters:
- label
: Descriptive name for the surface
- filter
: Fixed parameters defining the surface slice
- x_data
/y_data
: Parameters varying along each surface dimension
- x_title
/y_title
: Axis labels for visualization
- x_values
/y_values
: Specific parameter combinations to include
- x_labels
/y_labels
: Human-readable labels for axis ticks
Integration with Visualization Tools¶
Leaderboard Integration (leaderboard.py
)¶
- Manifold Groupings: Create difficulty tiers (easy/medium/hard) for ReasonScore calculation
- Scenario Labels: Display human-readable model names instead of technical identifiers
- Point Validation: Verify sufficient statistical power for confidence intervals
- Group Filtering: Enable model family comparisons
Explorer Integration (explorer.py
)¶
- Surface Definitions: Generate 3D difficulty manifold visualizations
- Interactive Selection: Map surface grid lines to projection analysis
- Multi-Panel Sync: Coordinate FFT, accuracy, and histogram displays
- Parameter Filtering: Apply surface filters for controlled analysis
Comparison Tools Integration (compare_project.py
, compare_surface.py
)¶
- Projection Matrices: Create systematic parameter sweep grids
- Cross-Model Analysis: Compare identical projections across multiple scenarios
- Filter Application: Generate controlled experimental comparisons
- Batch Visualization: Process multiple tasks and models simultaneously
Complete Task Example¶
The Boolean task demonstrates comprehensive projection and surface definitions:
{
"boolean": {
"label": "Boolean",
"projections": [
{
"label": "Length (Depth=0, Python Format)",
"axis": "length",
"filter": {
"boolean_format": 0,
"max_depth": 0
},
"values": [8, 16, 24, 40],
"labels": ["8","16","24","40"]
},
{
"label": "Format (Depth=0, Length=24)",
"axis": "boolean_format",
"filter": {
"length": 24,
"max_depth": 0
},
"values": [0, 1, 2, 3, 4],
"labels": ["PYTHON","T/F","ON/OFF","BINARY","YES/NO"]
}
],
"surfaces": [
{
"label": "Length x Format (Depth 0)",
"x_data": "length",
"x_title": "Length",
"x_values": [8, 16, 24, 32, 40, 56],
"x_labels": ["8","16","24","32","40","56"],
"y_data": "boolean_format",
"y_title": "Format",
"y_values": [0, 1, 2, 3, 4],
"y_labels": ["PYTHON","T/F","ON/OFF","BINARY","YES/NO"]
}
]
}
}
This configuration enables:
- Length Analysis: How boolean expression length affects accuracy in Python format
- Format Analysis: How different boolean notations affect parsing at fixed complexity
- Surface Analysis: Complete 2D exploration of length×format difficulty space
Best Practices¶
Manifold Design¶
- Progressive Difficulty: Ensure degree progression represents meaningful complexity increases
- Statistical Power: Balance point counts with computational costs, if the surface is well-behaved consider corner sampling.
- Complete Coverage: Define lowdef and corner manifolds for all degree/density combinations in your experiment
Scenario Organization¶
- Consistent Naming: Use clear, hierarchical labels (vendor + model + configuration)
- Logical Groupings: Enable meaningful model family comparisons
- Template/Sampler Clarity: Make evaluation configuration explicit in labels
Visualization Optimization¶
- Projection Selection: Choose parameter sweeps that test specific hypotheses
- Surface Boundaries: Ensure surface extents capture interesting failure modes
- Filter Design: Use filters to isolate individual cognitive effects
- Label Clarity: Prioritize human readability over technical precision
Dataset configurations transform raw evaluation data into structured research tools, enabling systematic exploration of AI reasoning capabilities across multiple dimensions of analysis.