Evaluate
ReasonScape Evaluator (evaluate.py)¶
evaluate.py is the unified evaluation tool that processes NDJSON interview files, performs statistical analysis, and manages dataset evaluation workflows. It operates in two distinct modes:
- Interview Mode - Process specific interview files (legacy, ad-hoc evaluation)
- Dataset Mode - Process entire datasets from configuration (managed workflow)
usage: evaluate.py [-h] (--interview INTERVIEW | --dataset DATASET)
[--output OUTPUT] [--offline] [--inplace] [--zstd ZSTD]
[--quiet] [--tokenizer TOKENIZER] [--fftsamples FFTSAMPLES]
[--context CONTEXT] [--parallel PARALLEL]
Evaluate LLM interview results
options:
-h, --help show this help message and exit
--interview INTERVIEW
Path, glob pattern, or comma-separated list of NDJSON files
--dataset DATASET Dataset config JSON (processes all evals)
--quiet Less output
--tokenizer TOKENIZER
Tokenizer name for FFT analysis (optional)
--fftsamples FFTSAMPLES
Number of samples for FFT (default: 128)
--context CONTEXT Clip responses to this context length
--parallel PARALLEL Number of parallel workers for bucket processing (default: 8)
Interview mode options:
--output OUTPUT Write evaluation results to JSON file (legacy bucket format)
--offline Enable answer re-processing
--inplace Write results back to NDJSON files (requires --offline)
Dataset mode options:
--zstd ZSTD Create .tar.zst archives in this directory
Dataset Mode (Recommended)¶
Dataset mode is the standard workflow for managed datasets. It reads dataset configurations, processes all defined evaluations, and maintains a DuckDB database with comprehensive metadata.
Basic Usage¶
# Process all evaluations in dataset (idempotent - skips existing)
python evaluate.py --dataset data/dataset-m12x.json
# Process with 16 parallel workers for faster compression/FFT
python evaluate.py --dataset data/dataset-m12x.json --parallel 16
# Process and create .tar.zst archives of raw data
python evaluate.py --dataset data/dataset-m12x.json --zstd data/m12x/
What Dataset Mode Does¶
- Discovers Evaluations - Reads
evals[]array from dataset config - Processes Missing Evals - Only processes evaluations not yet in database (idempotent)
- Writes to DuckDB - Stores points with de-duplication by 5D identity
- Computes Compression - Always pre-computes gzip compression data for entropy analysis
- Computes FFT (optional) - Analyzes token frequency spectra if tokenizer available
- Updates Facets - Applies groups[], surfaces[], and projections[] tags
- Creates Archives (optional) - Generates .tar.zst archives with
--zstd
Key Features¶
Idempotent Processing:
- Safe to re-run without duplicating work
- Automatically skips evaluations already in database
- To force re-processing: delete the .db file manually
Automatic Tier Mapping:
- Maps (degree, density) pairs to semantic tier labels (e.g., "easy", "medium", "hard")
- Defined in dataset config
tiers[]section - Multiple tiers can be assigned to same point if it spans difficulty levels
5D De-duplication:
- Points uniquely identified by (model, template, sampler, base_task, params)
- Buckets with same identity but different degree/density are merged
- Single database point may belong to multiple tiers
Parallel Bucket Processing:
--parallel Nspawns N workers for compression/FFT computation- Default: 8 workers (matches modern hardware)
- Scales nearly linearly up to ~16 workers
- Most beneficial for large datasets (1000+ buckets)
Always-On Compression:
- Gzip compression data computed for every sample automatically
- No flag needed - always available for entropy analysis
- Minimal storage overhead (~16 bytes per sample)
- Enables instant compression/hazard analysis later
Dataset Configuration¶
Dataset configs live in data/dataset-*.json and define (see Dataset Configuration Reference for complete specification):
{
"name": "m12x",
"db": "data/m12x.db",
"evals": [
{
"evaluate": {
"glob": "data/m12x/*phi-4-fp16*/*",
"context": 8192 // optional: simulate lower context
},
"filters": {
"model": "phi-4-fp16",
"template": "zerocot-nosys",
"sampler": "greedy-max"
},
"label": "Microsoft Phi-4 (FP16)",
"groups": ["family:phi4", "arch:dense", "size:mid"],
"hf_id": "microsoft/Phi-4",
"hf_quant_id": null
}
],
"tiers": [
{
"filters": {"degrees": ["0"], "densities": ["normal"]},
"label": "easy"
},
{
"filters": {"degrees": ["1"], "densities": ["normal"]},
"label": "medium"
}
]
}
Context Simulation¶
You can simulate lower context limits from high-context evaluations:
{
"evaluate": {
"glob": "data/m12x/*phi-4-fp16*/*",
"context": 8192
},
"filters": {
"model": "phi-4-fp16",
"template": "zerocot-nosys",
"sampler": "greedy-max-ctx8192" // Auto-suffixed sampler
}
}
How it works: - Uses same raw interview data as base eval - Clips responses to fit within context limit - Marks clipped samples as truncated and incorrect - Stored as separate eval with unique eval_id
Use cases: - Compare model at different context windows - Simulate lower-context deployment from high-context runs - Study truncation patterns vs native context limits
Interview Mode (Legacy)¶
Interview mode processes specific interview files and writes results to JSON buckets (legacy format) or displays statistics. Use this for ad-hoc evaluation or when you need JSON output.
Basic Usage¶
# Process specific files and display statistics
python evaluate.py --interview 'data/m12x/*phi-4-fp16*/*'
# Write to JSON bucket format
python evaluate.py --interview 'data/m12x/*phi-4-fp16*/*' \
--output buckets.json
# Process with FFT analysis
python evaluate.py --interview 'data/m12x/*phi-4-fp16*/*' \
--tokenizer microsoft/Phi-4 \
--output buckets.json
Interview Mode Features¶
Glob Patterns:
# Single pattern
--interview 'data/m12x/*phi-4*/*'
# Comma-separated patterns
--interview 'data/m12x/*phi-4*/*,data/m12x/*llama*/*'
Answer Re-processing (--offline):
# Re-extract and validate answers using current task logic
python evaluate.py --interview 'results/**/*.ndjson' \
--offline --output buckets.json
Useful for fixing parsing bugs without re-running evaluations.
In-place Updates (--inplace):
# Write corrected answers back to NDJSON files
python evaluate.py --interview 'results/**/*.ndjson' \
--offline --inplace
Warning: Modifies original NDJSON files. Use with caution.
Context Clipping:
# Simulate 8K context from 16K evaluation
python evaluate.py --interview 'data/m12x/*phi-4-fp16*/*' \
--tokenizer microsoft/Phi-4 \
--context 8192 \
--output buckets-8k.json
Requires --tokenizer for accurate token counting.
Statistical Processing¶
The evaluator performs several statistical computations:
Excess Accuracy Correction¶
Traditional benchmarks suffer from guessing inflation. ReasonScape removes expected guessing successes:
# For each sample, compute guess probability
guess_chance = 1/num_options # e.g., 0.25 for 4-choice, 0.0 for write-in
# Adjust aggregated statistics
adjusted_successes = correct_count - sum(guess_chance for all samples)
adjusted_trials = total_count - sum(guess_chance for all samples)
# Compute Wilson confidence interval on adjusted values
adjusted_center, adjusted_margin = wilson_ci(adjusted_successes, adjusted_trials)
Result: 0.000 = "no better than guessing", 1.000 = "perfect knowledge"
Truncation Tracking¶
Truncations are not wrong answers—they're context limit failures:
- Tracked separately from incorrect answers
- Excluded from accuracy calculations
- Visible in
truncated_ratiometric - Models aren't penalized, but limits are visible
Effect on confidence intervals:
- Fewer valid samples → wider confidence intervals
- Honest uncertainty representation
Compression Pre-Computation¶
Compression data computed for every sample automatically:
# For each sample
reasoning_trace = sample['answer']
compressed_size = len(gzip.compress(reasoning_trace.encode()))
# Stored as parallel arrays
point.completion_tokens_list = [45, 52, 67, ...]
point.compressed_sizes_list = [89, 102, 134, ...]
point.answer_status_list = [0, 0, 1, ...] # 0=correct, 1=incorrect, 2=truncated
Benefits: - 10-100x speedup for compression analysis - Minimal storage overhead (~16 bytes per sample) - Enables instant entropy/hazard analysis
FFT Analysis (Optional)¶
When --tokenizer provided, performs frequency analysis:
# Tokenize prompt+completion
prompt_tokens = tokenizer.encode(prompt + completion)
# Apply FFT to analyze token sequence patterns
fft_spectrum = np.fft.fft(prompt_tokens)
# Store mean and std across samples
point.fft_mean_list = [mean_freq_0, mean_freq_1, ...]
point.fft_std_list = [std_freq_0, std_freq_1, ...]
Purpose: Identify tokenization artifacts affecting performance
Wilson Confidence Intervals¶
95% confidence intervals calculated using Wilson score method:
- More accurate than normal approximation for small samples
- Properly handles edge cases (0% or 100% accuracy)
- Provides honest uncertainty bounds
Parallel Bucket Processing¶
The --parallel option enables multi-worker processing:
# Process with 16 workers
python evaluate.py --dataset data/dataset-m12x.json --parallel 16
What it does:
- Spawns N worker processes using ProcessPoolExecutor
- Each worker processes bucket independently
- Workers have separate tokenizer instances (cached per-process)
- Progress bar shows overall completion
Performance:
- Near-linear speedup up to ~16 workers (CPU-bound)
- Most beneficial for large datasets (1000+ buckets)
- Recommended:
--parallel $(nproc)for maximum throughput
When to skip:
- Small datasets (<100 buckets) - overhead outweighs benefit
- Low core count (2-4 cores) - minimal speedup
- Memory-constrained systems - each worker loads tokenizer
Output Formats¶
DuckDB Database (Dataset Mode)¶
Single database file with native LIST types:
data/m12x.db
Contains: - All evaluation points with de-duplication - Facet dimensions as VARCHAR[] (tiers, groups, surfaces, projections) - Compression data as INTEGER[] (answer_status, tokens, sizes) - FFT data as DOUBLE[] (fft_mean, fft_std) - Statistical measures (adjusted accuracy, Wilson CI)
Access via:
- analyze.py - Unified analysis interface
- src.points_db.PointsDB - Python API
- Direct SQL queries via DuckDB CLI
JSON Buckets (Interview Mode)¶
Legacy bucket format for backward compatibility:
{
"buckets": {
"model+template+sampler+density+precision+degree+base_task+task": {
"btype": "point",
"total": 100,
"correct": 85,
"adjusted_center": 0.847,
"adjusted_margin": 0.035,
"compression": [[0, 45, 89], [0, 52, 102], ...],
"fft": {
"avg_spectrum": [...],
"std_spectrum": [...]
}
}
}
}
Use cases: - Ad-hoc evaluation without database setup - Legacy analysis scripts expecting JSON format - Exploratory data analysis
Troubleshooting¶
No interview files found¶
Error: No files found matching pattern 'data/m12x/*phi-4*/*'
Fix:
1. Check pattern matches actual files: ls data/m12x/*phi-4*/*
2. Verify interview data exists in expected location
3. Update glob pattern in dataset config or command line
Missing tokenizer¶
Error loading tokenizer microsoft/Phi-4
Fix:
# Install tokenizer
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('microsoft/Phi-4')"
Or specify alternative in dataset config:
{
"hf_id": "microsoft/Phi-4",
"hf_quant_id": "microsoft/Phi-4" // fallback for quantized models
}
Related Documentation¶
- Adding an Evaluation - Complete guide to adding new models
- Dataset Configuration - Dataset config format reference
- analyze.py Tool - Analysis and discovery interface
- PointsDB Architecture - Database schema and API