Skip to content

Configuration

Templates: Prompting Strategy

Templates transform test cases into model inputs, enabling systematic comparison of different reasoning elicitation strategies:

Template System Prompt Turn Structure Examples CoT
zeroshot Task description Single user input None No
zeroshot-nosys None Task + input as user None No
zerocot-nosys Task description Single user input None Yes
multishot Task description Multi-turn examples Input/answer pairs No
multishot-nosys None Multi-turn examples Task+input/answer pairs No
multishot-cot Task description Multi-turn examples Input/reasoning/answer Yes
unified-cot None Single user message Input/reasoning/answer Yes

Templates enable research questions like:

  • System Prompt Dependency: How much do models rely on system vs user instructions?
  • Few-Shot Effectiveness: Do examples improve performance without reasoning chains?
  • Chain-of-Thought Impact: How much do reasoning demonstrations help?
  • Turn Structure Effects: Does conversation structure affect reasoning quality?

Samplers: Generation Parameter Control

Samplers define the generation parameters used during LLM inference, controlling how models produce responses. ReasonScape includes optimized sampling configurations for different model families and use cases.

Available Samplers

Sampler Context Strategy Description Use Case
greedy-2k.json 2K Greedy Deterministic sampling, 2K token limit Resource-constrained reproducible benchmarking
greedy-4k.json 4K Greedy Deterministic sampling, 4K token limit Standard reproducible benchmarking
greedy-8k.json 8K Greedy Deterministic sampling, 8K token limit Extended context reproducible benchmarking
greedy-max.json - Greedy Deterministic sampling, no token limit Maximum context reproducible benchmarking
magistral-2k.json 2K Magistral Mistral-recommended parameters, 2K limit Magistral models, constrained context
magistral-6k.json 6K Magistral Mistral-recommended parameters, 6K limit Magistral models, complex reasoning
magistral-8k.json 8K Magistral Mistral-recommended parameters, 8K limit Magistral models, extended context
o1-high.json - O1 High reasoning effort control OpenAI O1 models, maximum reasoning
o1-medium.json - O1 Medium reasoning effort control OpenAI O1 models, balanced reasoning
o1-low.json - O1 Low reasoning effort control OpenAI O1 models, minimal reasoning
o1-none.json - O1 No reasoning effort control OpenAI O1 models, baseline
qwen3-think-2k.json 2K Qwen3 Think mode enabled, 2K limit Qwen models with explicit reasoning
qwen3-think-4k.json 4K Qwen3 Think mode enabled, 4K limit Qwen models with explicit reasoning
qwen3-think-max.json - Qwen3 Think mode enabled, no limit Qwen models with maximum reasoning
qwen3-nothink-2k.json 2K Qwen3 Think mode disabled, 2K limit Qwen models without explicit reasoning
qwen3-nothink-4k.json 4K Qwen3 Think mode disabled, 4K limit Qwen models without explicit reasoning
rc-high-4k.json 4K Ruminate High reasoning intensity Open-source reasoning control, maximum effort
rc-medium-4k.json 4K Ruminate Medium reasoning intensity Open-source reasoning control, balanced
rc-low-4k.json 4K Ruminate Low reasoning intensity Open-source reasoning control, minimal
rc-none-2k.json 2K Ruminate No reasoning control Open-source baseline, constrained context
rc-none-4k.json 4K Ruminate No reasoning control Open-source baseline, standard context
hermes4-think-max.json - Hermes4 Thinking mode enabled, Chain-of-Thought preservation Hermes models with explicit reasoning and CoT retention
hunyuan-think-max.json - Hunyuan Hunyuan-optimized parameters with repetition penalty Hunyuan models with specialized sampling controls
granite-max.json - Granite Thinking mode enabled, deterministic sampling IBM Granite models with reasoning capabilities
magistral-max.json - Magistral Enhanced Magistral parameters with min_p control Magistral models with improved sampling diversity
gpt-oss-high.json - GPT-OSS High reasoning effort control Open-source GPT models, maximum reasoning
gpt-oss-medium.json - GPT-OSS Medium reasoning effort control Open-source GPT models, balanced reasoning
gpt-oss-low.json - GPT-OSS Low reasoning effort control Open-source GPT models, minimal reasoning
seedoss-unltd.json - Seed-OSS Unlimited thinking budget (-1) Seed-OSS models with unrestricted reasoning
seedoss-6k.json 6K Seed-OSS Thinking budget 6K tokens Seed-OSS models with extended reasoning
seedoss-4k.json 4K Seed-OSS Thinking budget 4K tokens Seed-OSS models with moderate reasoning
seedoss-2k.json 2K Seed-OSS Thinking budget 2K tokens Seed-OSS models with constrained reasoning
seedoss-0k.json - Seed-OSS Thinking budget disabled (0) Seed-OSS models with no explicit reasoning

Sampling Strategy Selection

Choose samplers based on your evaluation goals:

Goal Recommended Samplers Rationale
Reproducible Benchmarking greedy-4k, greedy-max Deterministic results, consistent across runs
Mistral Magistral Models magistral-6k, magistral-8k Vendor-optimized parameters
OpenAI O1 Models o1-medium, o1-high Model-specific reasoning controls
Qwen Models qwen3-think-4k, qwen3-think-max Explicit reasoning mode enabled
Hermes Models hermes4-think-max Explicit reasoning with CoT preservation
Hunyuan Models hunyuan-think-max Specialized sampling with repetition control
IBM Granite Models granite-max Deterministic sampling with reasoning
Seed-OSS Models seedoss-4k, seedoss-unltd Thinking budget control (0/2K/4K/6K/unlimited)
Resource Constraints *-2k variants Lower token limits for efficiency
Complex Tasks *-8k, *-max variants Extended context for difficult problems

Sampler File Format

All sampler files use JSON format with OpenAI-compatible parameters:

{
  "temperature": 0.0,
  "top_p": 1.0,
  "max_tokens": 4096
}

Additional model-specific parameters may include:

  • reasoning_effort: For reasoning-capable models
  • repetition_penalty: For open-source models
  • chat_template_kwargs: For models supporting special chat template features
  • min_p: Minimum probability threshold for sampling
  • skip_special_tokens: Whether to suppress special tokens (including reasoning tags) in output
  • stop_token_ids: Specific token IDs that stop generation

Model-Specific Parameter Details

chat_template_kwargs: JSON object containing special chat template parameters:

  • thinking: Boolean to enable/disable explicit reasoning mode
  • enable_thinking: Boolean to enable/disable explicit reasoning mode
  • keep_cots: Boolean to preserve Chain-of-Thought reasoning in output

min_p: Float value (0.0-1.0) setting minimum probability threshold for token sampling, improving output diversity.

skip_special_tokens: Boolean controlling whether special tokens are suppressed in the final output. Some models use special tokens for reasoning tags, and suppressing them can break parsing.

stop_token_ids: Array of integer token IDs that will immediately stop generation when encountered.

repetition_penalty: Float value (>1.0) that penalizes repeated tokens, helping reduce repetitive output patterns.

Integration with Runner

Samplers are specified via the --sampler argument:

python runner.py \
    --config configs/c2.json \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json \
    --model your-model \
    --apibase http://localhost:3333

The sampler name becomes part of the result identifier, enabling systematic comparison of generation strategies across identical test conditions.

Configs: Experiment Configuration Format

Experiment configurations define complete evaluation runs with hierarchical sampling strategies and adaptive precision control.

Basic Structure

name: "experiment_name"

precision:
  [precision_levels...]

tasks:
  [task_definitions...]

Precision Levels

Precision levels control the dynamic sampling behavior for statistical confidence. Each level defines sampling parameters that can be selected at test execution time.

precision:
  low:
    count: 32           # Base sample size per batch
    maxrounds: 6        # Maximum batches (max 192 tests total)
    targetci: 0.09      # Target confidence interval
    abortht: 0.2        # Abort if hit rate exceeds this threshold
  medium:
    count: 64
    targetci: 0.06
    maxrounds: 8        # max 512 tests
    targetciht: 0.1     # Alternative CI target for high-truncation scenarios
    abortht: 0.15
  high:
    count: 128          # max 1280 tests
    targetci: 0.04
    targetciht: 0.06
    abortht: 0.1

Precision Parameters

  • count: Number of samples per batch. The system samples in batches to avoid statistical p-hacking.
  • targetci: Target confidence interval. Sampling continues until this precision is reached.
  • targetciht: Alternative confidence interval for high-truncation scenarios. Used when truncation ratio exceeds 2*targetci.
  • maxrounds: Maximum number of batches to run (default: 10). Total max samples = count * maxrounds.
  • abortht: Abort threshold for hit/truncation rate. Stops sampling if this rate is exceeded to avoid wasting tokens.

Tasks

Tasks define the evaluation scenarios and their parameter spaces. Each task references a JSON file containing the task logic and specifies how to sample the parameter space.

List Mode

Direct specification of parameter combinations as a list of dictionaries:

- name: "boolean_legacy"
  file: "tasks/boolean.json"
  mode: "list"
  params:
    - { length: 10, max_depth: 2 }
    - { length: 20, max_depth: 4 }
    - { length: 40, max_depth: 8 }
    - { length: 60, max_depth: 16 }
    - { length: 90, max_depth: 32 }

List mode evaluates each parameter combination exactly as specified. While still technically supported, it's primarily used for legacy configurations and simple debugging scenarios.

Grid Mode

Direct sweep over parameter combinations via Cartesian product:

- name: "arithmetic_simple"
  file: "tasks/arithmetic.json"
  mode: "grid"
  grid:
    min_number: [-9, -99]
    max_number: [9, 99]
    max_depth: [0, 1, 2, 4]
    length: [8, 16, 32]

Grid mode generates all combinations: 2 × 2 × 4 × 3 = 48 parameter combinations.

Manifold Mode

Hierarchical sampling with degree-based difficulty progression and density control:

- name: "arithmetic_adaptive"
  file: "tasks/arithmetic.json"
  mode: "manifold"
  manifolds: [
    {
      "length": {
        "range": [8, 16, 24, 32, 40, 48],
        "window": {"skip": "degree", "body": 4},
        "resample:corner": {"first": 1, "last": 1},
        "resample:lowdef": {"first": 1, "middle": 1, "last": 1}
      },
      "max_depth": {
        "range": [0, 1, 2, 4, 8],
        "window": {"head": 2, "body": "degree"}
      }
    }
  ]

Manifold System

Manifolds provide sophisticated parameter space sampling with three key concepts.

Degree-Based Progression via Window Sampling

The degree parameter (set at execution time) controls difficulty by determining window sizes and sampling patterns. Higher degrees typically increase complexity and parameter coverage.

Each parameter defines a window that extracts values from its range based on the degree:

"parameter_name": {
  "range": [val1, val2, val3, val4, val5, val6],
  "window": {
    "head": 2,           # Always include first 2 values
    "skip": "degree",    # Skip 'degree' values after head
    "body": 4            # Take 4 values after skip
  }
}

Window Logic: 1. head: Always included values from start of range 2. skip: Number of values to skip after head (supports degree expressions) 3. body: Number of values to take after skip (supports degree expressions)

If skip + head + body exceeds range length, body values are taken from the end instead.

Degree Expressions: - Simple integers: "head": 2 - Degree-relative: "skip": "degree", "body": "degree+1" - Mathematical: "skip": "max(0, degree-1)", "body": "2*degree"

Density Resampling

The density parameter (set at execution time) applies secondary sampling to reduce parameter combinations:

"resample:corner": {"first": 1, "last": 1}           # Take first and last only
"resample:lowdef": {"first": 1, "middle": 1, "last": 1}  # Take first, middle, last

Density Types: - corner: Samples boundary values (first and last) - lowdef: Low-definition sampling (first, middle, last) - medium: Medium-density sampling (custom combinations) - normal (no density): Uses all windowed values

Multiple Manifolds

Tasks can define multiple manifolds to cover different parameter regions:

manifolds: [
  {
    # Manifold 1: Small numbers, variable whitespace
    "min_number": {"range": [-9], "window": {"head": 1}},
    "max_number": {"range": [9], "window": {"head": 1}},
    "prob_dewhitespace": {"range": [0.0, 1.0], "window": {"head": 2}}
  },
  {
    # Manifold 2: Large numbers, fixed whitespace
    "min_number": {"range": [-99], "window": {"head": 1}},
    "max_number": {"range": [99], "window": {"head": 1}},
    "prob_dewhitespace": {"range": [0.5], "window": {"head": 1}}
  }
]

Manifold Resolution Example

For a manifold parameter:

"length": {
  "range": [8, 16, 24, 32, 40, 48],
  "window": {"head": 1, "skip": "degree", "body": 3},
  "resample:corner": {"first": 1, "last": 1}
}

Resolution at degree=1, density=corner: 1. Window sampling: head=1 ([8]), skip=1, body=3 ([24,32,40]) → [8,24,32,40] 2. Density resampling: first=1, last=1 → [8,40]

Resolution at degree=2, density=normal: 1. Window sampling: head=1 ([8]), skip=2, body=3 ([32,40,48]) → [8,32,40,48] 2. No density resampling → [8,32,40,48]

Debugging Manifolds

Use the included resolver script to preview manifold resolution:

python resolver.py config.yaml 1  # Preview at degree=1

This shows the concrete grids that will be generated for each density level, helping debug complex manifold definitions before running expensive evaluations.

DataSets: Analysis Configuration Format

Dataset configurations define how evaluation data is structured, labeled, and analyzed using the DuckDB-backed PointsDB system. They bridge raw evaluation results (NDJSON interviews from runner.py) and the analysis ecosystem (analyze.py, leaderboard.py, explorer.py).

Basic Structure

{
  "name": "dataset_identifier",
  "db": "data/experiment/dataset.db",
  "evals": [ /* evaluation configurations */ ],
  "tiers": [ /* difficulty tier definitions */ ],
  "basetasks": { /* task-specific visualization configuration */ }
}

Core Components

1. Database Path

{
  "name": "m12x",
  "db": "data/m12x.db"
}
  • db: Path to DuckDB database file (replaces inputs[] glob patterns)
  • Created by: evaluate.py --dataset automatically processes evals[] and writes to this database
  • Contains: All evaluation points with native LIST types for grouped dimensions

2. Evaluation Definitions

Evals define available (model, template, sampler) combinations with explicit filters, evaluation configuration, and metadata:

{
  "evals": [
    {
      "evaluate": {
        "glob": "data/m12x/*phi-4-fp16*/*"
      },
      "filters": {
        "model": "phi-4-fp16",
        "template": "zerocot-nosys",
        "sampler": "greedy-max"
      },
      "label": "Microsoft Phi-4 (FP16)",
      "groups": ["family:phi4", "arch:dense", "size:mid"],
      "hf_id": "microsoft/Phi-4",
      "hf_quant_id": null
    },
    {
      "evaluate": {
        "glob": "data/m12x/*phi-4-fp16*/*",
        "context": 8192
      },
      "filters": {
        "model": "phi-4-fp16",
        "template": "zerocot-nosys",
        "sampler": "greedy-max-ctx8192"
      },
      "label": "Microsoft Phi-4 (FP16, 8k ctx)",
      "groups": ["family:phi4", "arch:dense", "size:mid"],
      "hf_id": "microsoft/Phi-4"
    }
  ]
}

Eval Parameters

  • evaluate: Evaluation configuration block
  • glob: Glob pattern matching raw NDJSON interview files
  • context: (Optional) Simulate lower context by clipping responses
  • filters: Identity dimensions that uniquely identify this evaluation's points
  • model: Model identifier (must match runner.py output)
  • template: Template identifier
  • sampler: Sampler identifier (auto-suffixed with -ctx{N} if context specified)
  • label: Human-readable name for leaderboards and visualizations
  • groups: Classification tags for peer comparison and filtering (see Group Taxonomy)
  • hf_id: Hugging Face model ID for tokenizer loading
  • hf_quant_id: (Optional) Alternative HF ID for quantized variants

Context Simulation

Context downsampling creates separate evaluations with modified sampler field:

{
  "evaluate": {
    "glob": "data/m12x/*phi-4-fp16*/*",
    "context": 8192
  },
  "filters": {
    "model": "phi-4-fp16",
    "template": "zerocot-nosys",
    "sampler": "greedy-max-ctx8192"  // Note the suffix
  },
  "label": "Phi-4 (8k ctx)"
}

How it works: 1. evaluate.py --dataset processes with --context 8192 --tokenizer HF_ID 2. Clips responses to 8192 total tokens (prompt + completion) 3. Clipped responses marked as truncated, incorrect 4. Sampler field auto-suffixed: greedy-maxgreedy-max-ctx8192 5. Stored as separate eval in database with unique eval_id

Use cases: - Compare same model at different context windows - Simulate lower-context deployment from high-context runs - Study truncation patterns vs native context limits

3. Tier Definitions

Tiers define difficulty groupings with explicit filter objects:

{
  "tiers": [
    {
      "filters": {
        "degrees": ["0"],
        "densities": ["normal"]
      },
      "label": "easy",
      "points": {
        "objects": 24,
        "arithmetic": 26,
        "dates": 8,
        "boolean": 8,
        "movies": 4,
        "shuffle": 18
      }
    },
    {
      "filters": {
        "degrees": ["1"],
        "densities": ["normal"]
      },
      "label": "medium",
      "points": {
        "objects": 24,
        "arithmetic": 39,
        "dates": 12,
        "boolean": 20,
        "movies": 24,
        "shuffle": 48
      }
    },
    {
      "filters": {
        "degrees": ["2"],
        "densities": ["normal"]
      },
      "label": "hard",
      "points": {
        "objects": 24,
        "arithmetic": 39,
        "dates": 16,
        "boolean": 40,
        "movies": 32,
        "shuffle": 64
      }
    }
  ]
}

Tier Parameters

  • filters: Filter object defining tier membership
  • degrees: List of degree values (stored as VARCHAR[] in database)
  • densities: List of density values (stored as VARCHAR[] in database)
  • label: Human-readable difficulty description
  • points: Expected data points per task for validation

4. Task Visualization Configuration

Each base task defines visualization parameters for projections (1D analysis) and surfaces (2D analysis):

{
  "basetasks": {
    "arithmetic": {
      "label": "Arithmetic",
      "projections": [ /* 1D line analysis */ ],
      "surfaces": [ /* 2D surface analysis */ ]
    }
  }
}

Projections (1D Analysis)

Projections define controlled parameter sweeps for FFT analysis:

{
  "projections": [
    {
      "id": "arith_length_d0",
      "label": "Length (depth=0, whitespace=50%)",
      "axis": "length",
      "filter": {
        "min_number": -9,
        "max_number": 9,
        "prob_dewhitespace": 0.5,
        "max_depth": 0
      },
      "values": [8, 16, 32, 48],
      "labels": ["8","16","32","48"]
    }
  ]
}

Projection Parameters: - id: Unique projection identifier (stored in projections[] column) - label: Descriptive name for the analysis - axis: Parameter to vary along the projection line - filter: Fixed parameters defining experimental controls - values: Specific parameter values to include in analysis - labels: Human-readable labels for axis ticks

Surfaces (2D Analysis)

Surfaces define parameter sweeps for 3D visualization:

{
  "surfaces": [
    {
      "id": "arith_len_x_depth",
      "label": "Length x Depth (Random Whitespace, -9 to 9)",
      "filter": {
        "min_number": -9,
        "max_number": 9,
        "prob_dewhitespace": 0.5
      },
      "x_data": "length",
      "x_title": "Length",
      "x_values": [8, 16, 24, 32, 40, 48],
      "x_labels": ["8","16","24","32","40","48"],
      "y_data": "max_depth",
      "y_title": "Depth",
      "y_values": [0, 1, 4],
      "y_labels": ["0","1","4"]
    }
  ]
}

Surface Parameters: - id: Unique surface identifier (stored in surfaces[] column) - label: Descriptive name for the surface - filter: Fixed parameters defining the surface slice - x_data/y_data: Parameters varying along each surface dimension - x_title/y_title: Axis labels for visualization - x_values/y_values: Specific parameter combinations to include - x_labels/y_labels: Human-readable labels for axis ticks

Integration with Analysis Tools

evaluate.py --dataset - Automated Processing

# Process all evaluations and update tags
python evaluate.py --dataset data/dataset-m12x.json

# With parallel bucket workers
python evaluate.py --dataset data/dataset-m12x.json --parallel 16

What it does: 1. Reads evals[] from config 2. For each eval: processes interview files to DuckDB 3. De-duplicates points by 5D identity 4. Post-processing: updates groups[], surfaces[], projections[] tags

analyze.py - Unified Analysis CLI

# List evaluations
python analyze.py evals data/dataset-m12x.json

# Generate leaderboard
python analyze.py scores data/dataset-m12x.json --format markdown

# Cluster analysis by task
python analyze.py cluster data/dataset-m12x.json --split base_task --format png

# Surface visualization
python analyze.py surface data/dataset-m12x.json \
  --filters '{"base_task": "arithmetic", "groups": [["arch:moe"]]}' \
  --output surfaces.png

# FFT analysis
python analyze.py fft data/dataset-m12x.json \
  --filters '{"base_task": ["arithmetic", "shuffle"], "projections": ["arith_length_d0"]}' \
  --output fft_comparison.png

leaderboard.py - Interactive Web App

python leaderboard.py data/dataset-m12x.json 8050

Uses src/scores.py helper functions and PointsDB for data access.

explorer.py - 3D Interactive Visualization

python explorer.py data/dataset-m12x.json 8052

Uses surface/FFT/dance helper functions with PointsDB backend.

Group Taxonomy

Groups are classification tags that enable peer comparison and lattice navigation.

The specific group dimensions available depend on the dataset configuration. For example, the m12x dataset uses dimensions like architecture type (arch:), size category (size:), and model family (family:).

See datasets/m12x.md for the m12x dataset's group taxonomy.

Complete Example

See datasets/m12x.md and data/dataset-m12x.json for a complete real-world example with 62 evaluations, 3 tiers, and comprehensive surface/projection definitions across 12 tasks.