Skip to content

Evaluate

ReasonScape Evaluator (evaluate.py)

evaluate.py is the unified evaluation tool that processes NDJSON interview files, performs statistical analysis, and manages dataset evaluation workflows. It operates in two distinct modes:

  1. Interview Mode - Process specific interview files (legacy, ad-hoc evaluation)
  2. Dataset Mode - Process entire datasets from configuration (managed workflow)
usage: evaluate.py [-h] (--interview INTERVIEW | --dataset DATASET)
                   [--output OUTPUT] [--offline] [--inplace] [--zstd ZSTD]
                   [--quiet] [--tokenizer TOKENIZER] [--fftsamples FFTSAMPLES]
                   [--context CONTEXT] [--parallel PARALLEL]

Evaluate LLM interview results

options:
  -h, --help            show this help message and exit
  --interview INTERVIEW
                        Path, glob pattern, or comma-separated list of NDJSON files
  --dataset DATASET     Dataset config JSON (processes all evals)
  --quiet               Less output
  --tokenizer TOKENIZER
                        Tokenizer name for FFT analysis (optional)
  --fftsamples FFTSAMPLES
                        Number of samples for FFT (default: 128)
  --context CONTEXT     Clip responses to this context length
  --parallel PARALLEL   Number of parallel workers for bucket processing (default: 8)

Interview mode options:
  --output OUTPUT       Write evaluation results to JSON file (legacy bucket format)
  --offline             Enable answer re-processing
  --inplace             Write results back to NDJSON files (requires --offline)

Dataset mode options:
  --zstd ZSTD          Create .tar.zst archives in this directory

Dataset mode is the standard workflow for managed datasets. It reads dataset configurations, processes all defined evaluations, and maintains a DuckDB database with comprehensive metadata.

Basic Usage

# Process all evaluations in dataset (idempotent - skips existing)
python evaluate.py --dataset data/dataset-m12x.json

# Process with 16 parallel workers for faster compression/FFT
python evaluate.py --dataset data/dataset-m12x.json --parallel 16

# Process and create .tar.zst archives of raw data
python evaluate.py --dataset data/dataset-m12x.json --zstd data/m12x/

What Dataset Mode Does

  1. Discovers Evaluations - Reads evals[] array from dataset config
  2. Processes Missing Evals - Only processes evaluations not yet in database (idempotent)
  3. Writes to DuckDB - Stores points with de-duplication by 5D identity
  4. Computes Compression - Always pre-computes gzip compression data for entropy analysis
  5. Computes FFT (optional) - Analyzes token frequency spectra if tokenizer available
  6. Updates Facets - Applies groups[], surfaces[], and projections[] tags
  7. Creates Archives (optional) - Generates .tar.zst archives with --zstd

Key Features

Idempotent Processing:

  • Safe to re-run without duplicating work
  • Automatically skips evaluations already in database
  • To force re-processing: delete the .db file manually

Automatic Tier Mapping:

  • Maps (degree, density) pairs to semantic tier labels (e.g., "easy", "medium", "hard")
  • Defined in dataset config tiers[] section
  • Multiple tiers can be assigned to same point if it spans difficulty levels

5D De-duplication:

  • Points uniquely identified by (model, template, sampler, base_task, params)
  • Buckets with same identity but different degree/density are merged
  • Single database point may belong to multiple tiers

Parallel Bucket Processing:

  • --parallel N spawns N workers for compression/FFT computation
  • Default: 8 workers (matches modern hardware)
  • Scales nearly linearly up to ~16 workers
  • Most beneficial for large datasets (1000+ buckets)

Always-On Compression:

  • Gzip compression data computed for every sample automatically
  • No flag needed - always available for entropy analysis
  • Minimal storage overhead (~16 bytes per sample)
  • Enables instant compression/hazard analysis later

Dataset Configuration

Dataset configs live in data/dataset-*.json and define (see Dataset Configuration Reference for complete specification):

{
  "name": "m12x",
  "db": "data/m12x.db",
  "evals": [
    {
      "evaluate": {
        "glob": "data/m12x/*phi-4-fp16*/*",
        "context": 8192  // optional: simulate lower context
      },
      "filters": {
        "model": "phi-4-fp16",
        "template": "zerocot-nosys",
        "sampler": "greedy-max"
      },
      "label": "Microsoft Phi-4 (FP16)",
      "groups": ["family:phi4", "arch:dense", "size:mid"],
      "hf_id": "microsoft/Phi-4",
      "hf_quant_id": null
    }
  ],
  "tiers": [
    {
      "filters": {"degrees": ["0"], "densities": ["normal"]},
      "label": "easy"
    },
    {
      "filters": {"degrees": ["1"], "densities": ["normal"]},
      "label": "medium"
    }
  ]
}

Context Simulation

You can simulate lower context limits from high-context evaluations:

{
  "evaluate": {
    "glob": "data/m12x/*phi-4-fp16*/*",
    "context": 8192
  },
  "filters": {
    "model": "phi-4-fp16",
    "template": "zerocot-nosys",
    "sampler": "greedy-max-ctx8192"  // Auto-suffixed sampler
  }
}

How it works: - Uses same raw interview data as base eval - Clips responses to fit within context limit - Marks clipped samples as truncated and incorrect - Stored as separate eval with unique eval_id

Use cases: - Compare model at different context windows - Simulate lower-context deployment from high-context runs - Study truncation patterns vs native context limits


Interview Mode (Legacy)

Interview mode processes specific interview files and writes results to JSON buckets (legacy format) or displays statistics. Use this for ad-hoc evaluation or when you need JSON output.

Basic Usage

# Process specific files and display statistics
python evaluate.py --interview 'data/m12x/*phi-4-fp16*/*'

# Write to JSON bucket format
python evaluate.py --interview 'data/m12x/*phi-4-fp16*/*' \
  --output buckets.json

# Process with FFT analysis
python evaluate.py --interview 'data/m12x/*phi-4-fp16*/*' \
  --tokenizer microsoft/Phi-4 \
  --output buckets.json

Interview Mode Features

Glob Patterns:

# Single pattern
--interview 'data/m12x/*phi-4*/*'

# Comma-separated patterns
--interview 'data/m12x/*phi-4*/*,data/m12x/*llama*/*'

Answer Re-processing (--offline):

# Re-extract and validate answers using current task logic
python evaluate.py --interview 'results/**/*.ndjson' \
  --offline --output buckets.json

Useful for fixing parsing bugs without re-running evaluations.

In-place Updates (--inplace):

# Write corrected answers back to NDJSON files
python evaluate.py --interview 'results/**/*.ndjson' \
  --offline --inplace

Warning: Modifies original NDJSON files. Use with caution.

Context Clipping:

# Simulate 8K context from 16K evaluation
python evaluate.py --interview 'data/m12x/*phi-4-fp16*/*' \
  --tokenizer microsoft/Phi-4 \
  --context 8192 \
  --output buckets-8k.json

Requires --tokenizer for accurate token counting.


Statistical Processing

The evaluator performs several statistical computations:

Excess Accuracy Correction

Traditional benchmarks suffer from guessing inflation. ReasonScape removes expected guessing successes:

# For each sample, compute guess probability
guess_chance = 1/num_options  # e.g., 0.25 for 4-choice, 0.0 for write-in

# Adjust aggregated statistics
adjusted_successes = correct_count - sum(guess_chance for all samples)
adjusted_trials = total_count - sum(guess_chance for all samples)

# Compute Wilson confidence interval on adjusted values
adjusted_center, adjusted_margin = wilson_ci(adjusted_successes, adjusted_trials)

Result: 0.000 = "no better than guessing", 1.000 = "perfect knowledge"

Truncation Tracking

Truncations are not wrong answers—they're context limit failures:

  • Tracked separately from incorrect answers
  • Excluded from accuracy calculations
  • Visible in truncated_ratio metric
  • Models aren't penalized, but limits are visible

Effect on confidence intervals:

  • Fewer valid samples → wider confidence intervals
  • Honest uncertainty representation

Compression Pre-Computation

Compression data computed for every sample automatically:

# For each sample
reasoning_trace = sample['answer']
compressed_size = len(gzip.compress(reasoning_trace.encode()))

# Stored as parallel arrays
point.completion_tokens_list = [45, 52, 67, ...]
point.compressed_sizes_list = [89, 102, 134, ...]
point.answer_status_list = [0, 0, 1, ...]  # 0=correct, 1=incorrect, 2=truncated

Benefits: - 10-100x speedup for compression analysis - Minimal storage overhead (~16 bytes per sample) - Enables instant entropy/hazard analysis

FFT Analysis (Optional)

When --tokenizer provided, performs frequency analysis:

# Tokenize prompt+completion
prompt_tokens = tokenizer.encode(prompt + completion)

# Apply FFT to analyze token sequence patterns
fft_spectrum = np.fft.fft(prompt_tokens)

# Store mean and std across samples
point.fft_mean_list = [mean_freq_0, mean_freq_1, ...]
point.fft_std_list = [std_freq_0, std_freq_1, ...]

Purpose: Identify tokenization artifacts affecting performance

Wilson Confidence Intervals

95% confidence intervals calculated using Wilson score method:

  • More accurate than normal approximation for small samples
  • Properly handles edge cases (0% or 100% accuracy)
  • Provides honest uncertainty bounds

Parallel Bucket Processing

The --parallel option enables multi-worker processing:

# Process with 16 workers
python evaluate.py --dataset data/dataset-m12x.json --parallel 16

What it does:

  • Spawns N worker processes using ProcessPoolExecutor
  • Each worker processes bucket independently
  • Workers have separate tokenizer instances (cached per-process)
  • Progress bar shows overall completion

Performance:

  • Near-linear speedup up to ~16 workers (CPU-bound)
  • Most beneficial for large datasets (1000+ buckets)
  • Recommended: --parallel $(nproc) for maximum throughput

When to skip:

  • Small datasets (<100 buckets) - overhead outweighs benefit
  • Low core count (2-4 cores) - minimal speedup
  • Memory-constrained systems - each worker loads tokenizer

Output Formats

DuckDB Database (Dataset Mode)

Single database file with native LIST types:

data/m12x.db

Contains: - All evaluation points with de-duplication - Facet dimensions as VARCHAR[] (tiers, groups, surfaces, projections) - Compression data as INTEGER[] (answer_status, tokens, sizes) - FFT data as DOUBLE[] (fft_mean, fft_std) - Statistical measures (adjusted accuracy, Wilson CI)

Access via: - analyze.py - Unified analysis interface - src.points_db.PointsDB - Python API - Direct SQL queries via DuckDB CLI

JSON Buckets (Interview Mode)

Legacy bucket format for backward compatibility:

{
  "buckets": {
    "model+template+sampler+density+precision+degree+base_task+task": {
      "btype": "point",
      "total": 100,
      "correct": 85,
      "adjusted_center": 0.847,
      "adjusted_margin": 0.035,
      "compression": [[0, 45, 89], [0, 52, 102], ...],
      "fft": {
        "avg_spectrum": [...],
        "std_spectrum": [...]
      }
    }
  }
}

Use cases: - Ad-hoc evaluation without database setup - Legacy analysis scripts expecting JSON format - Exploratory data analysis


Troubleshooting

No interview files found

Error: No files found matching pattern 'data/m12x/*phi-4*/*'

Fix: 1. Check pattern matches actual files: ls data/m12x/*phi-4*/* 2. Verify interview data exists in expected location 3. Update glob pattern in dataset config or command line

Missing tokenizer

Error loading tokenizer microsoft/Phi-4

Fix:

# Install tokenizer
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('microsoft/Phi-4')"

Or specify alternative in dataset config:

{
  "hf_id": "microsoft/Phi-4",
  "hf_quant_id": "microsoft/Phi-4"  // fallback for quantized models
}