ReasonScape Analysis Tool - Complete Documentation¶
Table of Contents¶
- Overview
- Quick Start
- Analysis Workflow
- Subcommand Reference
- Filter System
- Output Formats
- Requirements
- Error Handling
- Getting Help
Overview¶
The analyze.py script provides a unified interface to all ReasonScape analysis operations. It replaces the scattered V1 scripts with a consistent, agent-friendly design backed by DuckDB for efficient data access.
Analysis Flow Tiers¶
The ReasonScape analysis flow is organized into tiers, each answering a specific question:
- evals - What evaluations are available? Which tiers have data? (evaluation discovery)
- tasks - What task structures exist? What surfaces and projections? (task discovery)
- scores - Which models struggle overall? (aggregate rankings with fair-sort)
- spiderweb - What's the per-task breakdown? (per-model diagnostic)
- cluster - Which models are statistically indistinguishable? (CI-overlap clustering)
- surface - Where do failures occur in parameter space? (3D visualization)
- fft - Is it tokenizer or model capability? (frequency domain analysis)
- compression - What's the information-theoretic mechanism? (entropy analysis)
- hazard - When do failures occur during generation? (survival analysis)
- modelinfo - What are the model architectures? (metadata collection)
Quick Start¶
Basic Syntax¶
python analyze.py <subcommand> <config> [options]
Quick Examples¶
# List available evaluations
python analyze.py evals data/dataset-micro-minimax.json
# Generate leaderboard
python analyze.py scores data/dataset-micro-minimax.json --format png --output leaderboard.png
# Generate spider plots
python analyze.py spiderweb data/dataset-micro-minimax.json --format webpng
# Cluster analysis
python analyze.py cluster data/dataset-micro-minimax.json --stack base_task --format png
# Surface visualization
python analyze.py surface data/dataset-micro-minimax.json
# FFT analysis
python analyze.py fft data/dataset-micro-minimax.json
# Compression analysis
python analyze.py compression data/dataset-micro-minimax.json --output-dir compression/
# Hazard analysis
python analyze.py hazard data/dataset-micro-minimax.json --output-dir hazard/
# Fetch model metadata
python analyze.py modelinfo data/dataset-micro-minimax.json --output-dir metadata/
Analysis Workflow¶
Tier 1: Discovery¶
Start with evals to see what evaluations are available and which tiers have data:
# List all evaluations with tier availability
python analyze.py evals data/dataset.json
# Search for specific model
python analyze.py evals data/dataset.json --search "phi-4"
# Filter by groups to find comparable baselines
python analyze.py evals data/dataset.json --groups size:small,arch:moe
Use tasks to explore task structures (surfaces and projections):
# Show all tasks with their surfaces and projections
python analyze.py tasks data/dataset.json
# Filter to specific task
python analyze.py tasks data/dataset.json --search "arithmetic"
Tier 2: Aggregate Rankings¶
Use scores to see overall performance:
# Full leaderboard
python analyze.py scores data/dataset.json
# Filtered by groups
python analyze.py scores data/dataset.json --filters '{"groups": ["runtime:vllm"]}'
# PNG output
python analyze.py scores data/dataset.json --format png --output leaderboard.png --width 1400
Tier 3: Per-Model Diagnostics¶
Use spiderweb for detailed per-task breakdowns:
# Spider webs for all evaluations
python analyze.py spiderweb data/dataset.json --format webpng
# Bar plots for specific evaluations
python analyze.py spiderweb data/dataset.json --filters '{"eval_id": [0, 1]}' --format barpng
Tier 4: Statistical Grouping¶
Use cluster to find statistically indistinguishable models:
# Cluster by task (stack)
python analyze.py cluster data/dataset.json --stack base_task
# Cluster by tier (stack)
python analyze.py cluster data/dataset.json --stack tier
# PNG visualization with output directory
python analyze.py cluster data/dataset.json --stack base_task --format png --output-dir clusters/
# Split by model (separate file per model)
python analyze.py cluster data/dataset.json --stack base_task --split model --output-dir by-model/
Tier 5-8: Deep Analysis¶
For deeper investigation:
# 3D surface visualization
python analyze.py surface data/dataset.json --filters '{"base_task": "arithmetic"}'
# FFT frequency analysis
python analyze.py fft data/dataset.json --filters '{"base_task": "arithmetic"}'
# Compression entropy analysis
python analyze.py compression data/dataset.json --split model --output-dir compression/
# Hazard temporal analysis
python analyze.py hazard data/dataset.json --split base_task,model --output-dir hazard/
Subcommand Reference¶
evals - Evaluation Discovery with Tier Availability¶
Purpose: List all evaluations and show which difficulty tiers have data available.
Usage:
python analyze.py evals <config> [options]
Options:
- --search TERM: Filter evaluations by search term (case-insensitive)
- --groups GROUPS: Comma-separated list of groups to filter by (e.g., "size:small,arch:moe"). All specified groups must be present (AND logic).
- --format {json,markdown,table}: Output format (default: table)
- --output PATH: Write output to file instead of stdout
- --show-token-counts: Show token statistics table after main table
Note on Filtering: Unlike other subcommands, evals does not use the standard --filters parameter. This is because evals is a simple wrapper that reads the dataset configuration file, not a database operation. Instead, it provides specific --groups and --search options for filtering the evaluation list.
Output Structure:
Shows tier definitions at the top, then a table with:
- eval_id: Numeric index (0, 1, 2, ...)
- label: Human-readable name
- model: Model identifier
- template: Prompt template name
- sampler: Sampler configuration name
- groups: Tags for filtering
- hf_id: Hugging Face model identifier
- Tier columns: ✓ (has data) or ✗ (no data) for each difficulty tier
When --show-token-counts is used, an additional table is displayed showing:
- eval_id, label: Evaluation identifier
- tests: Total number of test completions
- tokens: Total tokens used across all completions
- avg: Average tokens per completion
- One column per task showing: avg_correct_tokens/avg_incorrect_tokens for that task
Examples:
# List all evaluations with tier availability
python analyze.py evals data/dataset-m12x.json
# Search for specific evaluation
python analyze.py evals data/dataset-m12x.json --search "FP8"
# Filter by groups to find comparable baselines (AND logic)
python analyze.py evals data/dataset-m12x.json --groups size:small,arch:moe
# Combine groups filter with search
python analyze.py evals data/dataset-m12x.json --groups size:large --search "llama"
# JSON output
python analyze.py evals data/dataset-m12x.json --format json --output evals.json
# Show token usage statistics
python analyze.py evals data/dataset-m12x.json --show-token-counts
Execution Time: ~1-2 seconds (includes database checks for tier availability, slightly longer with --show-token-counts)
tasks - Task Structure Discovery¶
Purpose: List all base tasks with their surfaces (2D) and projections (1D).
Usage:
python analyze.py tasks <config> [options]
Options:
- --search TERM: Filter tasks by search term (case-insensitive)
Output Structure:
Shows hierarchical task structure: - Task name and label - Surfaces (2D parameter spaces for 3D visualization): - Surface ID and label - x-axis: parameter name with labels/values - y-axis: parameter name with labels/values - filter: additional constraints - Projections (1D parameter sweeps for FFT analysis): - Projection ID and label - axis: parameter name with labels/values - filter: additional constraints
Key Insight: Surfaces and projections are the same concept (parameter space slices) - just 2D vs 1D. Both show axes with labels (preferred) or values, plus filters.
Examples:
# Show all tasks with surfaces and projections
python analyze.py tasks data/dataset-m12x.json
# Filter to specific task
python analyze.py tasks data/dataset-m12x.json --search "arithmetic"
# Search for tasks with "letter" in name
python analyze.py tasks data/dataset-m12x.json --search "letter"
Execution Time: <1 second (config parsing only)
scores - Generate Aggregate Leaderboard¶
Purpose: Ranked performance table with fair-sort algorithm and multi-tier support.
Usage:
python analyze.py scores <config> [options]
Options:
- --filters JSON: Filter evaluations (see Filter System)
- --format {json,markdown,table,csv,png}: Output format (default: table)
- --width PIXELS: PNG image width (default: 1400, only for PNG output)
- --output PATH: Write output to file instead of stdout
Output Structure:
Columns: Rank | Model | Tier | ReasonScore | task1 | task2 | ...
Task format: .XX - .YY [-.ZZ]
- .XX: Center (median accuracy)
- .YY: Upper bound (center + margin)
- [-.ZZ]: Truncation penalty (only shown if > 2%)
Fair-Sort Ranking: Models with overlapping confidence intervals share ranks.
Examples:
# Full leaderboard
python analyze.py scores data/dataset-micro-minimax.json
# Filter by eval_id
python analyze.py scores data/dataset-micro-minimax.json --filters '{"eval_id": [0, 1]}'
# Filter by groups
python analyze.py scores data/dataset-micro-minimax.json --filters '{"groups": ["runtime:vllm"]}'
# JSON output
python analyze.py scores data/dataset-micro-minimax.json --format json --output scores.json
# PNG output (rendered HTML table)
python analyze.py scores data/dataset-micro-minimax.json --format png --output scores.png --width 1400
Execution Time: ~5-10 seconds
spiderweb - Per-Model Diagnostic Plots¶
Purpose: Visualize per-task breakdown with difficulty tiers.
Usage:
python analyze.py spiderweb <config> [options]
Options:
- --filters JSON: Filter evaluations
- --format {webpng,barpng}: Output format - spider web or bar chart (default: webpng)
- --width PIXELS: PNG width (default: 1000)
- --height PIXELS: PNG height (default: 700)
- --output PATH: Output file path (single eval) or base path (multiple evals)
Features: - Shows per-task accuracy across difficulty tiers - Includes token usage and truncation rates - Creates one file per evaluation matching filters - Auto-generated filenames include eval_id and label
Examples:
# Spider web plots for all evaluations
python analyze.py spiderweb data/dataset-micro-minimax.json --format webpng
# Filter by eval_id
python analyze.py spiderweb data/dataset-micro-minimax.json --filters '{"eval_id": [0]}' --format webpng
# Bar plots instead
python analyze.py spiderweb data/dataset-micro-minimax.json --format barpng
# Custom dimensions
python analyze.py spiderweb data/dataset-micro-minimax.json --format webpng --width 1200 --height 800
Execution Time: ~15 seconds per evaluation
cluster - Statistical Clustering Analysis¶
Purpose: Group models by confidence interval overlap using maximal clique enumeration.
Usage:
python analyze.py cluster <config> [options]
Options:
- --filters JSON: Filter evaluations
- --stack {tier,base_task,surface}: Dimension to stack as subplots/sections within output (default: base_task)
- --split DIMENSION: Dimension to split into separate output files (e.g., model, tier, template)
- --format {markdown,json,png}: Output format (default: markdown)
- --width PIXELS: PNG width (default: 1200, only for PNG output)
- --output-dir PATH: Directory for output files (default: current directory)
Stack vs Split:
- Stack (--stack): Creates subplots/sections within a single output file
- tier: One section per difficulty tier (easy/medium/hard)
- base_task (default): One section per task type (arithmetic, objects, etc.)
- surface: One section per surface definition
- Split (
--split): Creates separate output files for each unique value - Any valid dimension:
model,tier,template,sampler,base_task, etc. - Uses efficient discovery via PointsDB to find unique values
- Each value gets its own file with complete clustering analysis
Key Insight: Models can belong to multiple clusters. If A,B,C are indistinguishable AND B,C,D are indistinguishable, we report BOTH clusters (not just one from arbitrary perspective).
Output Formats:
- markdown: Human-readable summary with cluster tables
- json: Structured data for programmatic analysis
- png: Stacked confidence band visualization with consistent colors
Examples:
# Default: clustering by base_task, markdown output to current directory
python analyze.py cluster data/dataset-m12x.json
# Clustering by difficulty tier with output directory
python analyze.py cluster data/dataset-m12x.json --stack tier --output-dir results/
# Clustering by surface with PNG visualization
python analyze.py cluster data/dataset-m12x.json --stack surface --format png --output-dir figures/
# Split by model, stack by base_task (separate PNG per model)
python analyze.py cluster data/dataset-m12x.json --stack base_task --split model --format png --output-dir clusters/
# Split by tier (separate file per difficulty level)
python analyze.py cluster data/dataset-m12x.json --stack base_task --split tier --output-dir by-tier/
# Filter to specific groups before clustering
python analyze.py cluster data/dataset-m12x.json --filters '{"groups": [["arch:moe", "size:large"]]}'
# Filter to specific task with JSON output
python analyze.py cluster data/dataset-m12x.json --filters '{"base_task": "arithmetic"}' --format json
File Naming:
- Without --split: {dataset}_cluster_{stack}.{ext}
- With --split: {dataset}_cluster_{stack}_{split}-{value}.{ext}
Execution Time: ~10-20 seconds (per output file when using --split)
surface - 3D Difficulty Manifold Visualization¶
Purpose: Visualize accuracy across 2D parameter grids (e.g., length × depth).
Usage:
python analyze.py surface <config> [options]
Options:
- --filters JSON: Filter evaluations (supports eval_id, base_task, surfaces)
- --split DIMENSION: Dimension to split into separate output files (default: base_task)
- Use none for single combined file
- Any valid dimension: model, template, sampler, base_task, tier, etc.
- --output-dir PATH: Directory for output files (default: current directory)
- --output PATH: Output PNG file path (for single file mode, ignored if --output-dir is used)
Grid Layout: - Rows: Evaluations (models) - Columns: Surfaces (task parameter grids) - Z-axis: Accuracy (0-1 scale) - Confidence intervals: Shown as semi-transparent lower bound surface
Split Mode (Default):
By default, surface splits by base_task to avoid creating very tall images. Each unique value of the split dimension generates a separate output file:
{dataset}_surface_{split_dim}-{value}.png- Multiple files with split{dataset}_surface_grid.png- Single file without split
Requirements:
Surfaces must be defined in config under basetasks[task_id]['surfaces']. Each surface specifies:
- x_data/y_data: Parameter names (e.g., "length", "max_depth")
- x_values/y_values: Grid coordinates
- filter: Additional param constraints for this surface
Discovery:
python analyze.py tasks <config> # Discover surfaces and projections
python analyze.py evals <config> # Discover dimensions and values
Examples:
# Split by base_task (default) - one file per task
python analyze.py surface data/dataset-m12x.json
# Output to specific directory
python analyze.py surface data/dataset-m12x.json --output-dir results/
# Split by model instead
python analyze.py surface data/dataset-m12x.json --split model --output-dir figures/
# No split - single combined file
python analyze.py surface data/dataset-m12x.json --split none --output my_surfaces.png
# Filter to specific evaluations then split
python analyze.py surface data/dataset-m12x.json --filters '{"eval_id": [0, 1, 2]}' --output-dir results/
# Filter to specific task (legacy single file mode)
python analyze.py surface data/dataset-m12x.json --filters '{"base_task": "arithmetic"}' --split none --output arithmetic.png
Execution Time: ~2-3 minutes
fft - Token-Frequency Spectral Analysis¶
Purpose: Analyze frequency domain characteristics of token generation patterns.
Usage:
python analyze.py fft <config> [options]
Options:
- --filters JSON: Filter evaluations (supports eval_id, base_task, projections)
- --output PATH: Output PNG file path (default: auto-generated)
Grid Layout: - Rows: Task projections (parameter sweeps) - Columns: Evaluations (models) - Legend: Parameter values with consistent colors
What It Reveals: - Tokenizer differences vs model differences - Repetitive patterns in reasoning traces - Structural differences between models - Impact of prompt/sampler settings on output patterns
Requirements:
Projections must be defined in config under basetasks[task_id]['projections']. Each projection specifies:
- axis: Parameter name being swept
- values: Parameter values
- labels: Human-readable labels
- filter: Base filter for this projection
Discovery:
python analyze.py evals <config> --show-projections
Examples:
# All tasks and projections
python analyze.py fft data/dataset-m12x.json
# Only arithmetic task
python analyze.py fft data/dataset-m12x.json --filters '{"base_task": "arithmetic"}'
# Multiple specific tasks
python analyze.py fft data/dataset-m12x.json --filters '{"base_task": ["arithmetic", "objects"]}'
# Specific projections
python analyze.py fft data/dataset-m12x.json --filters '{"projections": ["arith_easy", "arith_hard"]}'
# Specific evaluations
python analyze.py fft data/dataset-m12x.json --filters '{"eval_id": [0, 1]}'
# Combined filters
python analyze.py fft data/dataset-m12x.json --filters '{"eval_id": [0], "base_task": "arithmetic"}'
Execution Time: ~1-2 minutes
compression - Entropy Pattern Analysis¶
Purpose: Analyze information-theoretic failure mechanisms via compression correlation.
Usage:
python analyze.py compression <config> [options]
Options:
- --filters JSON: Filter evaluations
- --split DIMENSION: Dimension for separate files (default: base_task)
- --stack DIMENSION: Dimension for rows within files (default: eval_id, use "none" to disable)
- --series DIMENSION: Dimension for colored series (default: tier)
- --output-dir PATH: Directory for output PNG files (default: current directory)
Processing Hierarchy:
1. --split: Creates separate files
2. --stack: Creates rows within each file (default: eval_id)
3. --series: Creates colored series/columns
Output Structure:
Each file contains a three-panel composite: - Left panel: Correct answers - Center panel: Incorrect answers - Right panel: Truncated answers
Each panel shows: tokens vs compressed bytes, with histograms.
Requirements: - Database must contain compression data (automatically included with --dataset mode) - Token-level data is always generated in dataset mode
Examples:
# Default: split by base_task, stack by eval_id, series by tier
python analyze.py compression data/dataset-m12x.json --output-dir compression/
# Disable stacking (one row per file)
python analyze.py compression data/dataset-m12x.json --stack none --output-dir compression/
# Split by model, stack by base_task, series by tier
python analyze.py compression data/dataset-m12x.json --split model --stack base_task --output-dir compression/
# Change series dimension
python analyze.py compression data/dataset-m12x.json --series length --output-dir compression/
# Filter to specific evaluation
python analyze.py compression data/dataset-m12x.json --filters '{"eval_id": [0]}' --output-dir compression/
Execution Time: ~30 seconds
hazard - Survival Analysis¶
Purpose: Model temporal failure risk during token generation.
Usage:
python analyze.py hazard <config> [options]
Options:
- --filters JSON: Filter evaluations
- --split DIMENSIONS: Comma-separated dimension names for separate files (default: base_task,model)
- --bucket-size TOKENS: Token bucket size for time grid (default: 256)
- --output-dir PATH: Directory for output PNG files (default: current directory)
Survival Analysis Framework:
Treats token generation as a temporal process: - Time: Token count - Event: Model produces answer (correct or incorrect) - Censoring: Truncation (stopped observing before completion)
Split Parameter:
Comma-separated dimensions create one file per unique combination:
- base_task,model (default): One file per task+model combo
- base_task: One file per task (all models combined)
- model,tier: One file per model+tier combo
What It Shows: - When failures occur during generation - Hazard curves for correct/incorrect outcomes - Truncation patterns
Requirements: - Database must contain compression data (includes token counts)
Examples:
# Default: split by base_task and model
python analyze.py hazard data/dataset-m12x.json --output-dir hazard/
# Split by base_task only
python analyze.py hazard data/dataset-m12x.json --split base_task --output-dir hazard/
# Split by model and tier with specific bucket size
python analyze.py hazard data/dataset-m12x.json --split model,tier --bucket-size 128 --output-dir hazard/
# Filter to specific evaluation
python analyze.py hazard data/dataset-m12x.json --filters '{"eval_id": [0]}' --output-dir hazard/
Execution Time: ~1-2 minutes
modelinfo - Collect Model Metadata¶
Purpose: Download and cache model architecture information from Hugging Face hub.
Usage:
python analyze.py modelinfo <config> --output-dir <path> [options]
Options:
- --filters JSON: Filter evaluations (to select specific models)
- --output-dir PATH: Directory for metadata cache (required)
Downloads Per Model:
- All .md files (README, MODEL_CARD, etc.)
- config.json (architecture configuration)
- recipe.yaml (quantization info, if exists)
Output Structure:
metadata/
├── index.json # Quick lookup
├── model-org/
│ └── model-name/
│ ├── README.md
│ ├── config.json
│ └── recipe.yaml (if exists)
index.json Structure:
{
"model-id": {
"model_id": "org/name",
"output_dir": "/absolute/path",
"file_paths": { "README.md": "/path", ... },
"metadata_complete": true,
"missing_files": [],
"fetch_errors": [],
"architecture": "LlamaForCausalLM",
"num_layers": 80,
"hidden_size": 8192,
"vocab_size": 128256,
"cached_at": "2025-01-15T10:30:00Z"
}
}
Examples:
# Collect for all models in config
python analyze.py modelinfo data/dataset-micro-minimax.json --output-dir metadata/
# Filter by eval_id
python analyze.py modelinfo data/dataset-micro-minimax.json --filters '{"eval_id": [0, 1]}' --output-dir metadata/
# Filter by groups
python analyze.py modelinfo data/dataset-micro-minimax.json --filters '{"groups": ["runtime:vllm"]}' --output-dir metadata/
Execution Time: ~30-60 seconds (with 500ms rate limiting between requests)
Filter System¶
All analysis commands (except evals discovery modes) support the --filters parameter for flexible data selection.
Filter Structure¶
The --filters parameter accepts JSON with these keys:
{
"eval_id": [0, 1, 2], // List of evaluation indices
"groups": ["runtime:vllm", "size:large"], // All must match (AND logic)
"tiers": ["easy", "medium"], // Difficulty tiers
"base_task": "arithmetic", // Single task or list
"projections": ["arith_easy"], // For FFT analysis
"surfaces": ["arithmetic_length_x_depth"] // For surface analysis
}
Filter Keys¶
| Key | Type | Description | Used By |
|---|---|---|---|
eval_id |
List[int] | Evaluation indices (0, 1, 2, ...) | All commands |
groups |
List[str] | Group tags (all must match) | All commands |
tiers |
List[str] | Difficulty tier labels (e.g., "easy", "medium", "hard") | scores, spiderweb, compression, hazard |
base_task |
str or List[str] | Task name(s) | All commands |
projections |
List[str] | Projection IDs | fft |
surfaces |
List[str] | Surface IDs | surface |
Shell Escaping¶
Important: Use single quotes around JSON, double quotes inside:
# Correct
--filters '{"eval_id": [0, 1]}'
# Wrong (shell will interpret double quotes)
--filters "{"eval_id": [0, 1]}"
Filter Examples¶
# By eval_id only
--filters '{"eval_id": [0, 1, 2]}'
# By groups (all must match)
--filters '{"groups": ["runtime:vllm"]}'
# Multiple groups (AND logic)
--filters '{"groups": ["runtime:vllm", "size:large"]}'
# By base_task (single)
--filters '{"base_task": "arithmetic"}'
# By base_task (multiple)
--filters '{"base_task": ["arithmetic", "letters"]}'
# Combined filters
--filters '{"eval_id": [0], "groups": ["family:phi"], "base_task": "arithmetic"}'
# FFT-specific: filter projections
--filters '{"projections": ["arith_easy", "arith_hard"]}'
# Surface-specific: filter surfaces
--filters '{"surfaces": ["arithmetic_length_x_depth"]}'
Discovering Valid Filter Values¶
Use evals and tasks subcommands to discover valid values:
# List all evaluations (shows eval_id, groups, tier availability, etc.)
python analyze.py evals data/dataset.json
# Search for specific evaluations
python analyze.py evals data/dataset.json --search "vllm"
# Show all tasks with surfaces and projections
python analyze.py tasks data/dataset.json
# Show specific task structure
python analyze.py tasks data/dataset.json --search "arithmetic"
Output Formats¶
Most commands support multiple output formats via the --format parameter.
Available Formats by Command¶
| Command | Formats | Default |
|---|---|---|
evals |
table, json, markdown | table |
tasks |
(always text output) | N/A |
scores |
table, json, markdown, csv, png | table |
spiderweb |
webpng, barpng | webpng |
cluster |
markdown, json, png | markdown |
surface |
png | png |
fft |
png | png |
compression |
png | png |
hazard |
png | png |
modelinfo |
(directory structure) | N/A |
Format Descriptions¶
table (ASCII table)
- Terminal-friendly monospace output
- Uses pandas to_string()
- Best for quick inspection in shell
markdown
- GitHub-flavored markdown tables
- Human-readable and version-controllable
- Uses pandas to_markdown()
json - Machine-readable structured data - Suitable for programmatic analysis - Pretty-printed with 2-space indent
csv
- Spreadsheet-compatible format
- Only available for scores command
- Standard comma-separated values
png - Visual rendering for charts and plots - Requires system dependencies (see Requirements) - Various widths/heights depending on command
webpng (spiderweb only) - Spider web / radar chart visualization - Shows multi-dimensional task breakdown
barpng (spiderweb only) - Vertical bar chart visualization - Alternative to spider web for same data
Output Destination¶
Stdout (default):
# Print to terminal
python analyze.py scores data/dataset.json
File output:
# Single file
python analyze.py scores data/dataset.json --output scores.md
# Multiple files (directory)
python analyze.py compression data/dataset.json --output-dir compression/
Auto-Generated Filenames¶
When --output is not specified, commands that generate files use auto-generated names based on:
- Dataset name (from config "name" field)
- Filter values (eval_id, tiers, etc.)
- Command type (e.g., _spider, _cluster)
Example auto-generated names:
micro-minimax_leaderboard.png
micro-minimax_eval0_phi-4_spider.png
micro-minimax_cluster_base_task.png
micro-minimax_surface_grid.png
Requirements¶
Python Dependencies¶
pip install pandas duckdb plotly kaleido imgkit huggingface_hub
Package purposes:
- pandas: DataFrame manipulation and table formatting
- duckdb: Database backend for efficient queries
- plotly: Interactive plotting (spider webs, surfaces, FFT)
- kaleido: Static image export for plotly (PNG output)
- imgkit: HTML to PNG conversion (for scores PNG)
- huggingface_hub: Model metadata fetching (modelinfo)
System Requirements¶
For PNG generation:
-
wkhtmltoimage (required by imgkit for scores PNG):
# Ubuntu/Debian sudo apt-get install wkhtmltopdf # macOS brew install wkhtmltopdf -
kaleido (required by plotly for static export):
pip install kaleido
Data Requirements¶
Basic analysis (evals, scores, spiderweb, cluster):
- Standard evaluation data in DuckDB
- Generate with: python evaluate.py --dataset data/dataset.json
Compression and hazard analysis:
- Token-level data required (automatically included)
- Generate with: python evaluate.py --dataset data/dataset.json
FFT analysis:
- Projection definitions in config
- Under basetasks[task_id]['projections']
Surface analysis:
- Surface definitions in config
- Under basetasks[task_id]['surfaces']
Configuration Requirements¶
Minimum config structure:
{
"db": "path/to/database.db",
"name": "Dataset Name",
"evals": [
{
"label": "Model Name",
"hf_id": "org/model-name",
"filters": {
"model": "model-id",
"template": "template-id",
"sampler": "sampler-id"
},
"groups": ["runtime:vllm", "family:llama"]
}
],
"basetasks": {
"task_id": {
"label": "Task Name",
"projections": [...], // Optional (for FFT)
"surfaces": [...] // Optional (for surface)
}
},
"tiers": [...] // Optional (for tier-based analysis)
}
Error Handling¶
Common Errors and Solutions¶
Missing Database¶
Error:
Error: Database not found: /path/to/database.db
Run evaluate.py with --db to create it
Solution:
python evaluate.py --dataset /path/to/dataset.json
Invalid JSON Filters¶
Error:
Error: Invalid JSON in --filters: {...}
Solutions:
- Check JSON syntax (use a JSON validator)
- Use single quotes around JSON: '{"key": "value"}'
- Use double quotes inside JSON: "key", not 'key'
- Escape properly if using double quotes in shell
No Matching Data¶
Error:
No results found matching filters
Solutions: 1. Verify filters are correct:
python analyze.py evals data/dataset.json
-
Check if data exists in database:
python analyze.py scores data/dataset.json # Try without filters -
Relax filter constraints
Missing Config Fields¶
Error:
Error: Config must specify 'db' path
Solution:
Add "db" field to config JSON:
{
"db": "path/to/database.db",
...
}
PNG Generation Fails¶
Error:
Error writing PNG: ...
Note: PNG export requires kaleido package: pip install kaleido
Solution:
pip install kaleido
Error (imgkit):
No wkhtmltoimage executable found
Solution:
# Ubuntu/Debian
sudo apt-get install wkhtmltopdf
# macOS
brew install wkhtmltopdf
Getting Help¶
Command-Line Help¶
# List all subcommands
python analyze.py --help
# Subcommand-specific help
python analyze.py evals --help
python analyze.py scores --help
python analyze.py spiderweb --help
python analyze.py cluster --help
python analyze.py surface --help
python analyze.py fft --help
python analyze.py compression --help
python analyze.py hazard --help
python analyze.py modelinfo --help
Help Output Includes¶
- Description of the subcommand
- Positional arguments
- Optional arguments with defaults
- Output format options
- Examples
- Filter structure (where applicable)
Example Help Output¶
$ python analyze.py scores --help
usage: analyze.py scores [-h] [--filters FILTERS]
[--format {json,markdown,table,csv,png}]
[--width WIDTH] [--output OUTPUT]
config
Generate leaderboard/rankings across evaluations
positional arguments:
config Dataset configuration file
optional arguments:
-h, --help show this help message and exit
--filters FILTERS Filter as JSON (e.g., '{"eval_id": [0, 1],
"groups": ["runtime:vllm"]}')
--format {json,markdown,table,csv,png}
Output format (default: table)
--width WIDTH PNG image width in pixels (default: 1400,
only used with --format png)
--output OUTPUT Write output to file instead of stdout
See Also¶
Documentation¶
- Architecture Overview - System architecture and design
- Methodology - Evaluation methodology and metrics
- Workflow Guide - Analysis workflow guide
Source Code¶
- analyze.py - Main entry point (1345 lines)
- src/scores.py - Leaderboard and fair-sort
- src/cluster.py - Clustering algorithms
- src/surface.py - 3D surface visualization
- src/fft.py - FFT analysis
- src/compression.py - Compression analysis
- src/hazard.py - Hazard analysis
- src/points_db.py - DuckDB interface