Configuration
Templates: Prompting Strategy¶
Templates transform test cases into model inputs, enabling systematic comparison of different reasoning elicitation strategies:
| Template | System Prompt | Turn Structure | Examples | CoT |
|---|---|---|---|---|
| zeroshot | Task description | Single user input | None | No |
| zeroshot-nosys | None | Task + input as user | None | No |
| zerocot-nosys | Task description | Single user input | None | Yes |
| multishot | Task description | Multi-turn examples | Input/answer pairs | No |
| multishot-nosys | None | Multi-turn examples | Task+input/answer pairs | No |
| multishot-cot | Task description | Multi-turn examples | Input/reasoning/answer | Yes |
| unified-cot | None | Single user message | Input/reasoning/answer | Yes |
Templates enable research questions like:
- System Prompt Dependency: How much do models rely on system vs user instructions?
- Few-Shot Effectiveness: Do examples improve performance without reasoning chains?
- Chain-of-Thought Impact: How much do reasoning demonstrations help?
- Turn Structure Effects: Does conversation structure affect reasoning quality?
Samplers: Generation Parameter Control¶
Samplers define the generation parameters used during LLM inference, controlling how models produce responses. ReasonScape includes optimized sampling configurations for different model families and use cases.
Available Samplers¶
| Sampler | Context | Strategy | Description | Use Case |
|---|---|---|---|---|
greedy-2k.json |
2K | Greedy | Deterministic sampling, 2K token limit | Resource-constrained reproducible benchmarking |
greedy-4k.json |
4K | Greedy | Deterministic sampling, 4K token limit | Standard reproducible benchmarking |
greedy-8k.json |
8K | Greedy | Deterministic sampling, 8K token limit | Extended context reproducible benchmarking |
greedy-max.json |
- | Greedy | Deterministic sampling, no token limit | Maximum context reproducible benchmarking |
magistral-2k.json |
2K | Magistral | Mistral-recommended parameters, 2K limit | Magistral models, constrained context |
magistral-6k.json |
6K | Magistral | Mistral-recommended parameters, 6K limit | Magistral models, complex reasoning |
magistral-8k.json |
8K | Magistral | Mistral-recommended parameters, 8K limit | Magistral models, extended context |
o1-high.json |
- | O1 | High reasoning effort control | OpenAI O1 models, maximum reasoning |
o1-medium.json |
- | O1 | Medium reasoning effort control | OpenAI O1 models, balanced reasoning |
o1-low.json |
- | O1 | Low reasoning effort control | OpenAI O1 models, minimal reasoning |
o1-none.json |
- | O1 | No reasoning effort control | OpenAI O1 models, baseline |
qwen3-think-2k.json |
2K | Qwen3 | Think mode enabled, 2K limit | Qwen models with explicit reasoning |
qwen3-think-4k.json |
4K | Qwen3 | Think mode enabled, 4K limit | Qwen models with explicit reasoning |
qwen3-think-max.json |
- | Qwen3 | Think mode enabled, no limit | Qwen models with maximum reasoning |
qwen3-nothink-2k.json |
2K | Qwen3 | Think mode disabled, 2K limit | Qwen models without explicit reasoning |
qwen3-nothink-4k.json |
4K | Qwen3 | Think mode disabled, 4K limit | Qwen models without explicit reasoning |
rc-high-4k.json |
4K | Ruminate | High reasoning intensity | Open-source reasoning control, maximum effort |
rc-medium-4k.json |
4K | Ruminate | Medium reasoning intensity | Open-source reasoning control, balanced |
rc-low-4k.json |
4K | Ruminate | Low reasoning intensity | Open-source reasoning control, minimal |
rc-none-2k.json |
2K | Ruminate | No reasoning control | Open-source baseline, constrained context |
rc-none-4k.json |
4K | Ruminate | No reasoning control | Open-source baseline, standard context |
hermes4-think-max.json |
- | Hermes4 | Thinking mode enabled, Chain-of-Thought preservation | Hermes models with explicit reasoning and CoT retention |
hunyuan-think-max.json |
- | Hunyuan | Hunyuan-optimized parameters with repetition penalty | Hunyuan models with specialized sampling controls |
granite-max.json |
- | Granite | Thinking mode enabled, deterministic sampling | IBM Granite models with reasoning capabilities |
magistral-max.json |
- | Magistral | Enhanced Magistral parameters with min_p control | Magistral models with improved sampling diversity |
gpt-oss-high.json |
- | GPT-OSS | High reasoning effort control | Open-source GPT models, maximum reasoning |
gpt-oss-medium.json |
- | GPT-OSS | Medium reasoning effort control | Open-source GPT models, balanced reasoning |
gpt-oss-low.json |
- | GPT-OSS | Low reasoning effort control | Open-source GPT models, minimal reasoning |
seedoss-unltd.json |
- | Seed-OSS | Unlimited thinking budget (-1) | Seed-OSS models with unrestricted reasoning |
seedoss-6k.json |
6K | Seed-OSS | Thinking budget 6K tokens | Seed-OSS models with extended reasoning |
seedoss-4k.json |
4K | Seed-OSS | Thinking budget 4K tokens | Seed-OSS models with moderate reasoning |
seedoss-2k.json |
2K | Seed-OSS | Thinking budget 2K tokens | Seed-OSS models with constrained reasoning |
seedoss-0k.json |
- | Seed-OSS | Thinking budget disabled (0) | Seed-OSS models with no explicit reasoning |
Sampling Strategy Selection¶
Choose samplers based on your evaluation goals:
| Goal | Recommended Samplers | Rationale |
|---|---|---|
| Reproducible Benchmarking | greedy-4k, greedy-max |
Deterministic results, consistent across runs |
| Mistral Magistral Models | magistral-6k, magistral-8k |
Vendor-optimized parameters |
| OpenAI O1 Models | o1-medium, o1-high |
Model-specific reasoning controls |
| Qwen Models | qwen3-think-4k, qwen3-think-max |
Explicit reasoning mode enabled |
| Hermes Models | hermes4-think-max |
Explicit reasoning with CoT preservation |
| Hunyuan Models | hunyuan-think-max |
Specialized sampling with repetition control |
| IBM Granite Models | granite-max |
Deterministic sampling with reasoning |
| Seed-OSS Models | seedoss-4k, seedoss-unltd |
Thinking budget control (0/2K/4K/6K/unlimited) |
| Resource Constraints | *-2k variants |
Lower token limits for efficiency |
| Complex Tasks | *-8k, *-max variants |
Extended context for difficult problems |
Sampler File Format¶
All sampler files use JSON format with OpenAI-compatible parameters:
{
"temperature": 0.0,
"top_p": 1.0,
"max_tokens": 4096
}
Additional model-specific parameters may include:
reasoning_effort: For reasoning-capable modelsrepetition_penalty: For open-source modelschat_template_kwargs: For models supporting special chat template featuresmin_p: Minimum probability threshold for samplingskip_special_tokens: Whether to suppress special tokens (including reasoning tags) in outputstop_token_ids: Specific token IDs that stop generation
Model-Specific Parameter Details¶
chat_template_kwargs: JSON object containing special chat template parameters:
thinking: Boolean to enable/disable explicit reasoning modeenable_thinking: Boolean to enable/disable explicit reasoning modekeep_cots: Boolean to preserve Chain-of-Thought reasoning in output
min_p: Float value (0.0-1.0) setting minimum probability threshold for token sampling, improving output diversity.
skip_special_tokens: Boolean controlling whether special tokens are suppressed in the final output. Some models use special tokens for reasoning tags, and suppressing them can break parsing.
stop_token_ids: Array of integer token IDs that will immediately stop generation when encountered.
repetition_penalty: Float value (>1.0) that penalizes repeated tokens, helping reduce repetitive output patterns.
Integration with Runner¶
Samplers are specified via the --sampler argument:
python runner.py \
--config configs/c2.json \
--template templates/zerocot-nosys.json \
--sampler samplers/greedy-4k.json \
--model your-model \
--apibase http://localhost:3333
The sampler name becomes part of the result identifier, enabling systematic comparison of generation strategies across identical test conditions.
Configs: Experiment Configuration Format¶
Experiment configurations define complete evaluation runs with hierarchical sampling strategies and adaptive precision control.
Basic Structure¶
name: "experiment_name"
precision:
[precision_levels...]
tasks:
[task_definitions...]
Precision Levels¶
Precision levels control the dynamic sampling behavior for statistical confidence. Each level defines sampling parameters that can be selected at test execution time.
precision:
low:
count: 32 # Base sample size per batch
maxrounds: 6 # Maximum batches (max 192 tests total)
targetci: 0.09 # Target confidence interval
abortht: 0.2 # Abort if hit rate exceeds this threshold
medium:
count: 64
targetci: 0.06
maxrounds: 8 # max 512 tests
targetciht: 0.1 # Alternative CI target for high-truncation scenarios
abortht: 0.15
high:
count: 128 # max 1280 tests
targetci: 0.04
targetciht: 0.06
abortht: 0.1
Precision Parameters¶
count: Number of samples per batch. The system samples in batches to avoid statistical p-hacking.targetci: Target confidence interval. Sampling continues until this precision is reached.targetciht: Alternative confidence interval for high-truncation scenarios. Used when truncation ratio exceeds2*targetci.maxrounds: Maximum number of batches to run (default: 10). Total max samples =count * maxrounds.abortht: Abort threshold for hit/truncation rate. Stops sampling if this rate is exceeded to avoid wasting tokens.
Tasks¶
Tasks define the evaluation scenarios and their parameter spaces. Each task references a JSON file containing the task logic and specifies how to sample the parameter space.
List Mode¶
Direct specification of parameter combinations as a list of dictionaries:
- name: "boolean_legacy"
file: "tasks/boolean.json"
mode: "list"
params:
- { length: 10, max_depth: 2 }
- { length: 20, max_depth: 4 }
- { length: 40, max_depth: 8 }
- { length: 60, max_depth: 16 }
- { length: 90, max_depth: 32 }
List mode evaluates each parameter combination exactly as specified. While still technically supported, it's primarily used for legacy configurations and simple debugging scenarios.
Grid Mode¶
Direct sweep over parameter combinations via Cartesian product:
- name: "arithmetic_simple"
file: "tasks/arithmetic.json"
mode: "grid"
grid:
min_number: [-9, -99]
max_number: [9, 99]
max_depth: [0, 1, 2, 4]
length: [8, 16, 32]
Grid mode generates all combinations: 2 × 2 × 4 × 3 = 48 parameter combinations.
Manifold Mode¶
Hierarchical sampling with degree-based difficulty progression and density control:
- name: "arithmetic_adaptive"
file: "tasks/arithmetic.json"
mode: "manifold"
manifolds: [
{
"length": {
"range": [8, 16, 24, 32, 40, 48],
"window": {"skip": "degree", "body": 4},
"resample:corner": {"first": 1, "last": 1},
"resample:lowdef": {"first": 1, "middle": 1, "last": 1}
},
"max_depth": {
"range": [0, 1, 2, 4, 8],
"window": {"head": 2, "body": "degree"}
}
}
]
Manifold System¶
Manifolds provide sophisticated parameter space sampling with three key concepts.
Degree-Based Progression via Window Sampling¶
The degree parameter (set at execution time) controls difficulty by determining window sizes and sampling patterns. Higher degrees typically increase complexity and parameter coverage.
Each parameter defines a window that extracts values from its range based on the degree:
"parameter_name": {
"range": [val1, val2, val3, val4, val5, val6],
"window": {
"head": 2, # Always include first 2 values
"skip": "degree", # Skip 'degree' values after head
"body": 4 # Take 4 values after skip
}
}
Window Logic: 1. head: Always included values from start of range 2. skip: Number of values to skip after head (supports degree expressions) 3. body: Number of values to take after skip (supports degree expressions)
If skip + head + body exceeds range length, body values are taken from the end instead.
Degree Expressions:
- Simple integers: "head": 2
- Degree-relative: "skip": "degree", "body": "degree+1"
- Mathematical: "skip": "max(0, degree-1)", "body": "2*degree"
Density Resampling¶
The density parameter (set at execution time) applies secondary sampling to reduce parameter combinations:
"resample:corner": {"first": 1, "last": 1} # Take first and last only
"resample:lowdef": {"first": 1, "middle": 1, "last": 1} # Take first, middle, last
Density Types:
- corner: Samples boundary values (first and last)
- lowdef: Low-definition sampling (first, middle, last)
- medium: Medium-density sampling (custom combinations)
- normal (no density): Uses all windowed values
Multiple Manifolds¶
Tasks can define multiple manifolds to cover different parameter regions:
manifolds: [
{
# Manifold 1: Small numbers, variable whitespace
"min_number": {"range": [-9], "window": {"head": 1}},
"max_number": {"range": [9], "window": {"head": 1}},
"prob_dewhitespace": {"range": [0.0, 1.0], "window": {"head": 2}}
},
{
# Manifold 2: Large numbers, fixed whitespace
"min_number": {"range": [-99], "window": {"head": 1}},
"max_number": {"range": [99], "window": {"head": 1}},
"prob_dewhitespace": {"range": [0.5], "window": {"head": 1}}
}
]
Manifold Resolution Example¶
For a manifold parameter:
"length": {
"range": [8, 16, 24, 32, 40, 48],
"window": {"head": 1, "skip": "degree", "body": 3},
"resample:corner": {"first": 1, "last": 1}
}
Resolution at degree=1, density=corner: 1. Window sampling: head=1 ([8]), skip=1, body=3 ([24,32,40]) → [8,24,32,40] 2. Density resampling: first=1, last=1 → [8,40]
Resolution at degree=2, density=normal: 1. Window sampling: head=1 ([8]), skip=2, body=3 ([32,40,48]) → [8,32,40,48] 2. No density resampling → [8,32,40,48]
Debugging Manifolds¶
Use the included resolver script to preview manifold resolution:
python resolver.py config.yaml 1 # Preview at degree=1
This shows the concrete grids that will be generated for each density level, helping debug complex manifold definitions before running expensive evaluations.
DataSets: Analysis Configuration Format¶
Dataset configurations define how evaluation data is structured, labeled, and analyzed using the DuckDB-backed PointsDB system. They bridge raw evaluation results (NDJSON interviews from runner.py) and the analysis ecosystem (analyze.py, leaderboard.py, explorer.py).
Basic Structure¶
{
"name": "dataset_identifier",
"db": "data/experiment/dataset.db",
"evals": [ /* evaluation configurations */ ],
"tiers": [ /* difficulty tier definitions */ ],
"basetasks": { /* task-specific visualization configuration */ }
}
Core Components¶
1. Database Path¶
{
"name": "m12x",
"db": "data/m12x.db"
}
db: Path to DuckDB database file (replacesinputs[]glob patterns)- Created by:
evaluate.py --datasetautomatically processesevals[]and writes to this database - Contains: All evaluation points with native LIST types for grouped dimensions
2. Evaluation Definitions¶
Evals define available (model, template, sampler) combinations with explicit filters, evaluation configuration, and metadata:
{
"evals": [
{
"evaluate": {
"glob": "data/m12x/*phi-4-fp16*/*"
},
"filters": {
"model": "phi-4-fp16",
"template": "zerocot-nosys",
"sampler": "greedy-max"
},
"label": "Microsoft Phi-4 (FP16)",
"groups": ["family:phi4", "arch:dense", "size:mid"],
"hf_id": "microsoft/Phi-4",
"hf_quant_id": null
},
{
"evaluate": {
"glob": "data/m12x/*phi-4-fp16*/*",
"context": 8192
},
"filters": {
"model": "phi-4-fp16",
"template": "zerocot-nosys",
"sampler": "greedy-max-ctx8192"
},
"label": "Microsoft Phi-4 (FP16, 8k ctx)",
"groups": ["family:phi4", "arch:dense", "size:mid"],
"hf_id": "microsoft/Phi-4"
}
]
}
Eval Parameters¶
evaluate: Evaluation configuration blockglob: Glob pattern matching raw NDJSON interview filescontext: (Optional) Simulate lower context by clipping responsesfilters: Identity dimensions that uniquely identify this evaluation's pointsmodel: Model identifier (must match runner.py output)template: Template identifiersampler: Sampler identifier (auto-suffixed with-ctx{N}if context specified)label: Human-readable name for leaderboards and visualizationsgroups: Classification tags for peer comparison and filtering (see Group Taxonomy)hf_id: Hugging Face model ID for tokenizer loadinghf_quant_id: (Optional) Alternative HF ID for quantized variants
Context Simulation¶
Context downsampling creates separate evaluations with modified sampler field:
{
"evaluate": {
"glob": "data/m12x/*phi-4-fp16*/*",
"context": 8192
},
"filters": {
"model": "phi-4-fp16",
"template": "zerocot-nosys",
"sampler": "greedy-max-ctx8192" // Note the suffix
},
"label": "Phi-4 (8k ctx)"
}
How it works:
1. evaluate.py --dataset processes with --context 8192 --tokenizer HF_ID
2. Clips responses to 8192 total tokens (prompt + completion)
3. Clipped responses marked as truncated, incorrect
4. Sampler field auto-suffixed: greedy-max → greedy-max-ctx8192
5. Stored as separate eval in database with unique eval_id
Use cases: - Compare same model at different context windows - Simulate lower-context deployment from high-context runs - Study truncation patterns vs native context limits
3. Tier Definitions¶
Tiers define difficulty groupings with explicit filter objects:
{
"tiers": [
{
"filters": {
"degrees": ["0"],
"densities": ["normal"]
},
"label": "easy",
"points": {
"objects": 24,
"arithmetic": 26,
"dates": 8,
"boolean": 8,
"movies": 4,
"shuffle": 18
}
},
{
"filters": {
"degrees": ["1"],
"densities": ["normal"]
},
"label": "medium",
"points": {
"objects": 24,
"arithmetic": 39,
"dates": 12,
"boolean": 20,
"movies": 24,
"shuffle": 48
}
},
{
"filters": {
"degrees": ["2"],
"densities": ["normal"]
},
"label": "hard",
"points": {
"objects": 24,
"arithmetic": 39,
"dates": 16,
"boolean": 40,
"movies": 32,
"shuffle": 64
}
}
]
}
Tier Parameters¶
filters: Filter object defining tier membershipdegrees: List of degree values (stored as VARCHAR[] in database)densities: List of density values (stored as VARCHAR[] in database)label: Human-readable difficulty descriptionpoints: Expected data points per task for validation
4. Task Visualization Configuration¶
Each base task defines visualization parameters for projections (1D analysis) and surfaces (2D analysis):
{
"basetasks": {
"arithmetic": {
"label": "Arithmetic",
"projections": [ /* 1D line analysis */ ],
"surfaces": [ /* 2D surface analysis */ ]
}
}
}
Projections (1D Analysis)¶
Projections define controlled parameter sweeps for FFT analysis:
{
"projections": [
{
"id": "arith_length_d0",
"label": "Length (depth=0, whitespace=50%)",
"axis": "length",
"filter": {
"min_number": -9,
"max_number": 9,
"prob_dewhitespace": 0.5,
"max_depth": 0
},
"values": [8, 16, 32, 48],
"labels": ["8","16","32","48"]
}
]
}
Projection Parameters:
- id: Unique projection identifier (stored in projections[] column)
- label: Descriptive name for the analysis
- axis: Parameter to vary along the projection line
- filter: Fixed parameters defining experimental controls
- values: Specific parameter values to include in analysis
- labels: Human-readable labels for axis ticks
Surfaces (2D Analysis)¶
Surfaces define parameter sweeps for 3D visualization:
{
"surfaces": [
{
"id": "arith_len_x_depth",
"label": "Length x Depth (Random Whitespace, -9 to 9)",
"filter": {
"min_number": -9,
"max_number": 9,
"prob_dewhitespace": 0.5
},
"x_data": "length",
"x_title": "Length",
"x_values": [8, 16, 24, 32, 40, 48],
"x_labels": ["8","16","24","32","40","48"],
"y_data": "max_depth",
"y_title": "Depth",
"y_values": [0, 1, 4],
"y_labels": ["0","1","4"]
}
]
}
Surface Parameters:
- id: Unique surface identifier (stored in surfaces[] column)
- label: Descriptive name for the surface
- filter: Fixed parameters defining the surface slice
- x_data/y_data: Parameters varying along each surface dimension
- x_title/y_title: Axis labels for visualization
- x_values/y_values: Specific parameter combinations to include
- x_labels/y_labels: Human-readable labels for axis ticks
Integration with Analysis Tools¶
evaluate.py --dataset - Automated Processing¶
# Process all evaluations and update tags
python evaluate.py --dataset data/dataset-m12x.json
# With parallel bucket workers
python evaluate.py --dataset data/dataset-m12x.json --parallel 16
What it does:
1. Reads evals[] from config
2. For each eval: processes interview files to DuckDB
3. De-duplicates points by 5D identity
4. Post-processing: updates groups[], surfaces[], projections[] tags
analyze.py - Unified Analysis CLI¶
# List evaluations
python analyze.py evals data/dataset-m12x.json
# Generate leaderboard
python analyze.py scores data/dataset-m12x.json --format markdown
# Cluster analysis by task
python analyze.py cluster data/dataset-m12x.json --split base_task --format png
# Surface visualization
python analyze.py surface data/dataset-m12x.json \
--filters '{"base_task": "arithmetic", "groups": [["arch:moe"]]}' \
--output surfaces.png
# FFT analysis
python analyze.py fft data/dataset-m12x.json \
--filters '{"base_task": ["arithmetic", "shuffle"], "projections": ["arith_length_d0"]}' \
--output fft_comparison.png
leaderboard.py - Interactive Web App¶
python leaderboard.py data/dataset-m12x.json 8050
Uses src/scores.py helper functions and PointsDB for data access.
explorer.py - 3D Interactive Visualization¶
python explorer.py data/dataset-m12x.json 8052
Uses surface/FFT/dance helper functions with PointsDB backend.
Group Taxonomy¶
Groups are classification tags that enable peer comparison and lattice navigation.
The specific group dimensions available depend on the dataset configuration. For example, the m12x dataset uses dimensions like architecture type (arch:), size category (size:), and model family (family:).
See datasets/m12x.md for the m12x dataset's group taxonomy.
Complete Example¶
See datasets/m12x.md and data/dataset-m12x.json for a complete real-world example with 62 evaluations, 3 tiers, and comprehensive surface/projection definitions across 12 tasks.