Skip to content

ReasonScape Analysis Tool - Complete Documentation

Table of Contents

  1. Overview
  2. Quick Start
  3. Analysis Workflow
  4. Subcommand Reference
  5. Filter System
  6. Output Formats
  7. Requirements
  8. Error Handling
  9. Getting Help

Overview

The analyze.py script provides a unified interface to all ReasonScape analysis operations. It replaces the scattered V1 scripts with a consistent, agent-friendly design backed by DuckDB for efficient data access.

Analysis Flow Tiers

The ReasonScape analysis flow is organized into tiers, each answering a specific question:

  1. evals - What evaluations are available? Which tiers have data? (evaluation discovery)
  2. tasks - What task structures exist? What surfaces and projections? (task discovery)
  3. scores - Which models struggle overall? (aggregate rankings with fair-sort)
  4. spiderweb - What's the per-task breakdown? (per-model diagnostic)
  5. cluster - Which models are statistically indistinguishable? (CI-overlap clustering)
  6. surface - Where do failures occur in parameter space? (3D visualization)
  7. fft - Is it tokenizer or model capability? (frequency domain analysis)
  8. compression - What's the information-theoretic mechanism? (entropy analysis)
  9. hazard - When do failures occur during generation? (survival analysis)
  10. modelinfo - What are the model architectures? (metadata collection)

Quick Start

Basic Syntax

python analyze.py <subcommand> <config> [options]

Quick Examples

# List available evaluations
python analyze.py evals data/dataset-micro-minimax.json

# Generate leaderboard
python analyze.py scores data/dataset-micro-minimax.json --format png --output leaderboard.png

# Generate spider plots
python analyze.py spiderweb data/dataset-micro-minimax.json --format webpng

# Cluster analysis
python analyze.py cluster data/dataset-micro-minimax.json --stack base_task --format png

# Surface visualization
python analyze.py surface data/dataset-micro-minimax.json

# FFT analysis
python analyze.py fft data/dataset-micro-minimax.json

# Compression analysis
python analyze.py compression data/dataset-micro-minimax.json --output-dir compression/

# Hazard analysis
python analyze.py hazard data/dataset-micro-minimax.json --output-dir hazard/

# Fetch model metadata
python analyze.py modelinfo data/dataset-micro-minimax.json --output-dir metadata/

Analysis Workflow

Tier 1: Discovery

Start with evals to see what evaluations are available and which tiers have data:

# List all evaluations with tier availability
python analyze.py evals data/dataset.json

# Search for specific model
python analyze.py evals data/dataset.json --search "phi-4"

# Filter by groups to find comparable baselines
python analyze.py evals data/dataset.json --groups size:small,arch:moe

Use tasks to explore task structures (surfaces and projections):

# Show all tasks with their surfaces and projections
python analyze.py tasks data/dataset.json

# Filter to specific task
python analyze.py tasks data/dataset.json --search "arithmetic"

Tier 2: Aggregate Rankings

Use scores to see overall performance:

# Full leaderboard
python analyze.py scores data/dataset.json

# Filtered by groups
python analyze.py scores data/dataset.json --filters '{"groups": ["runtime:vllm"]}'

# PNG output
python analyze.py scores data/dataset.json --format png --output leaderboard.png --width 1400

Tier 3: Per-Model Diagnostics

Use spiderweb for detailed per-task breakdowns:

# Spider webs for all evaluations
python analyze.py spiderweb data/dataset.json --format webpng

# Bar plots for specific evaluations
python analyze.py spiderweb data/dataset.json --filters '{"eval_id": [0, 1]}' --format barpng

Tier 4: Statistical Grouping

Use cluster to find statistically indistinguishable models:

# Cluster by task (stack)
python analyze.py cluster data/dataset.json --stack base_task

# Cluster by tier (stack)
python analyze.py cluster data/dataset.json --stack tier

# PNG visualization with output directory
python analyze.py cluster data/dataset.json --stack base_task --format png --output-dir clusters/

# Split by model (separate file per model)
python analyze.py cluster data/dataset.json --stack base_task --split model --output-dir by-model/

Tier 5-8: Deep Analysis

For deeper investigation:

# 3D surface visualization
python analyze.py surface data/dataset.json --filters '{"base_task": "arithmetic"}'

# FFT frequency analysis
python analyze.py fft data/dataset.json --filters '{"base_task": "arithmetic"}'

# Compression entropy analysis
python analyze.py compression data/dataset.json --split model --output-dir compression/

# Hazard temporal analysis
python analyze.py hazard data/dataset.json --split base_task,model --output-dir hazard/

Subcommand Reference

evals - Evaluation Discovery with Tier Availability

Purpose: List all evaluations and show which difficulty tiers have data available.

Usage:

python analyze.py evals <config> [options]

Options: - --search TERM: Filter evaluations by search term (case-insensitive) - --groups GROUPS: Comma-separated list of groups to filter by (e.g., "size:small,arch:moe"). All specified groups must be present (AND logic). - --format {json,markdown,table}: Output format (default: table) - --output PATH: Write output to file instead of stdout - --show-token-counts: Show token statistics table after main table

Note on Filtering: Unlike other subcommands, evals does not use the standard --filters parameter. This is because evals is a simple wrapper that reads the dataset configuration file, not a database operation. Instead, it provides specific --groups and --search options for filtering the evaluation list.

Output Structure:

Shows tier definitions at the top, then a table with: - eval_id: Numeric index (0, 1, 2, ...) - label: Human-readable name - model: Model identifier - template: Prompt template name - sampler: Sampler configuration name - groups: Tags for filtering - hf_id: Hugging Face model identifier - Tier columns: ✓ (has data) or ✗ (no data) for each difficulty tier

When --show-token-counts is used, an additional table is displayed showing: - eval_id, label: Evaluation identifier - tests: Total number of test completions - tokens: Total tokens used across all completions - avg: Average tokens per completion - One column per task showing: avg_correct_tokens/avg_incorrect_tokens for that task

Examples:

# List all evaluations with tier availability
python analyze.py evals data/dataset-m12x.json

# Search for specific evaluation
python analyze.py evals data/dataset-m12x.json --search "FP8"

# Filter by groups to find comparable baselines (AND logic)
python analyze.py evals data/dataset-m12x.json --groups size:small,arch:moe

# Combine groups filter with search
python analyze.py evals data/dataset-m12x.json --groups size:large --search "llama"

# JSON output
python analyze.py evals data/dataset-m12x.json --format json --output evals.json

# Show token usage statistics
python analyze.py evals data/dataset-m12x.json --show-token-counts

Execution Time: ~1-2 seconds (includes database checks for tier availability, slightly longer with --show-token-counts)


tasks - Task Structure Discovery

Purpose: List all base tasks with their surfaces (2D) and projections (1D).

Usage:

python analyze.py tasks <config> [options]

Options: - --search TERM: Filter tasks by search term (case-insensitive)

Output Structure:

Shows hierarchical task structure: - Task name and label - Surfaces (2D parameter spaces for 3D visualization): - Surface ID and label - x-axis: parameter name with labels/values - y-axis: parameter name with labels/values - filter: additional constraints - Projections (1D parameter sweeps for FFT analysis): - Projection ID and label - axis: parameter name with labels/values - filter: additional constraints

Key Insight: Surfaces and projections are the same concept (parameter space slices) - just 2D vs 1D. Both show axes with labels (preferred) or values, plus filters.

Examples:

# Show all tasks with surfaces and projections
python analyze.py tasks data/dataset-m12x.json

# Filter to specific task
python analyze.py tasks data/dataset-m12x.json --search "arithmetic"

# Search for tasks with "letter" in name
python analyze.py tasks data/dataset-m12x.json --search "letter"

Execution Time: <1 second (config parsing only)


scores - Generate Aggregate Leaderboard

Purpose: Ranked performance table with fair-sort algorithm and multi-tier support.

Usage:

python analyze.py scores <config> [options]

Options: - --filters JSON: Filter evaluations (see Filter System) - --format {json,markdown,table,csv,png}: Output format (default: table) - --width PIXELS: PNG image width (default: 1400, only for PNG output) - --output PATH: Write output to file instead of stdout

Output Structure:

Columns: Rank | Model | Tier | ReasonScore | task1 | task2 | ...

Task format: .XX - .YY [-.ZZ] - .XX: Center (median accuracy) - .YY: Upper bound (center + margin) - [-.ZZ]: Truncation penalty (only shown if > 2%)

Fair-Sort Ranking: Models with overlapping confidence intervals share ranks.

Examples:

# Full leaderboard
python analyze.py scores data/dataset-micro-minimax.json

# Filter by eval_id
python analyze.py scores data/dataset-micro-minimax.json --filters '{"eval_id": [0, 1]}'

# Filter by groups
python analyze.py scores data/dataset-micro-minimax.json --filters '{"groups": ["runtime:vllm"]}'

# JSON output
python analyze.py scores data/dataset-micro-minimax.json --format json --output scores.json

# PNG output (rendered HTML table)
python analyze.py scores data/dataset-micro-minimax.json --format png --output scores.png --width 1400

Execution Time: ~5-10 seconds


spiderweb - Per-Model Diagnostic Plots

Purpose: Visualize per-task breakdown with difficulty tiers.

Usage:

python analyze.py spiderweb <config> [options]

Options: - --filters JSON: Filter evaluations - --format {webpng,barpng}: Output format - spider web or bar chart (default: webpng) - --width PIXELS: PNG width (default: 1000) - --height PIXELS: PNG height (default: 700) - --output PATH: Output file path (single eval) or base path (multiple evals)

Features: - Shows per-task accuracy across difficulty tiers - Includes token usage and truncation rates - Creates one file per evaluation matching filters - Auto-generated filenames include eval_id and label

Examples:

# Spider web plots for all evaluations
python analyze.py spiderweb data/dataset-micro-minimax.json --format webpng

# Filter by eval_id
python analyze.py spiderweb data/dataset-micro-minimax.json --filters '{"eval_id": [0]}' --format webpng

# Bar plots instead
python analyze.py spiderweb data/dataset-micro-minimax.json --format barpng

# Custom dimensions
python analyze.py spiderweb data/dataset-micro-minimax.json --format webpng --width 1200 --height 800

Execution Time: ~15 seconds per evaluation


cluster - Statistical Clustering Analysis

Purpose: Group models by confidence interval overlap using maximal clique enumeration.

Usage:

python analyze.py cluster <config> [options]

Options: - --filters JSON: Filter evaluations - --stack {tier,base_task,surface}: Dimension to stack as subplots/sections within output (default: base_task) - --split DIMENSION: Dimension to split into separate output files (e.g., model, tier, template) - --format {markdown,json,png}: Output format (default: markdown) - --width PIXELS: PNG width (default: 1200, only for PNG output) - --output-dir PATH: Directory for output files (default: current directory)

Stack vs Split: - Stack (--stack): Creates subplots/sections within a single output file - tier: One section per difficulty tier (easy/medium/hard) - base_task (default): One section per task type (arithmetic, objects, etc.) - surface: One section per surface definition

  • Split (--split): Creates separate output files for each unique value
  • Any valid dimension: model, tier, template, sampler, base_task, etc.
  • Uses efficient discovery via PointsDB to find unique values
  • Each value gets its own file with complete clustering analysis

Key Insight: Models can belong to multiple clusters. If A,B,C are indistinguishable AND B,C,D are indistinguishable, we report BOTH clusters (not just one from arbitrary perspective).

Output Formats: - markdown: Human-readable summary with cluster tables - json: Structured data for programmatic analysis - png: Stacked confidence band visualization with consistent colors

Examples:

# Default: clustering by base_task, markdown output to current directory
python analyze.py cluster data/dataset-m12x.json

# Clustering by difficulty tier with output directory
python analyze.py cluster data/dataset-m12x.json --stack tier --output-dir results/

# Clustering by surface with PNG visualization
python analyze.py cluster data/dataset-m12x.json --stack surface --format png --output-dir figures/

# Split by model, stack by base_task (separate PNG per model)
python analyze.py cluster data/dataset-m12x.json --stack base_task --split model --format png --output-dir clusters/

# Split by tier (separate file per difficulty level)
python analyze.py cluster data/dataset-m12x.json --stack base_task --split tier --output-dir by-tier/

# Filter to specific groups before clustering
python analyze.py cluster data/dataset-m12x.json --filters '{"groups": [["arch:moe", "size:large"]]}'

# Filter to specific task with JSON output
python analyze.py cluster data/dataset-m12x.json --filters '{"base_task": "arithmetic"}' --format json

File Naming: - Without --split: {dataset}_cluster_{stack}.{ext} - With --split: {dataset}_cluster_{stack}_{split}-{value}.{ext}

Execution Time: ~10-20 seconds (per output file when using --split)


surface - 3D Difficulty Manifold Visualization

Purpose: Visualize accuracy across 2D parameter grids (e.g., length × depth).

Usage:

python analyze.py surface <config> [options]

Options: - --filters JSON: Filter evaluations (supports eval_id, base_task, surfaces) - --split DIMENSION: Dimension to split into separate output files (default: base_task) - Use none for single combined file - Any valid dimension: model, template, sampler, base_task, tier, etc. - --output-dir PATH: Directory for output files (default: current directory) - --output PATH: Output PNG file path (for single file mode, ignored if --output-dir is used)

Grid Layout: - Rows: Evaluations (models) - Columns: Surfaces (task parameter grids) - Z-axis: Accuracy (0-1 scale) - Confidence intervals: Shown as semi-transparent lower bound surface

Split Mode (Default):

By default, surface splits by base_task to avoid creating very tall images. Each unique value of the split dimension generates a separate output file:

  • {dataset}_surface_{split_dim}-{value}.png - Multiple files with split
  • {dataset}_surface_grid.png - Single file without split

Requirements:

Surfaces must be defined in config under basetasks[task_id]['surfaces']. Each surface specifies: - x_data/y_data: Parameter names (e.g., "length", "max_depth") - x_values/y_values: Grid coordinates - filter: Additional param constraints for this surface

Discovery:

python analyze.py tasks <config>  # Discover surfaces and projections
python analyze.py evals <config>  # Discover dimensions and values

Examples:

# Split by base_task (default) - one file per task
python analyze.py surface data/dataset-m12x.json

# Output to specific directory
python analyze.py surface data/dataset-m12x.json --output-dir results/

# Split by model instead
python analyze.py surface data/dataset-m12x.json --split model --output-dir figures/

# No split - single combined file
python analyze.py surface data/dataset-m12x.json --split none --output my_surfaces.png

# Filter to specific evaluations then split
python analyze.py surface data/dataset-m12x.json --filters '{"eval_id": [0, 1, 2]}' --output-dir results/

# Filter to specific task (legacy single file mode)
python analyze.py surface data/dataset-m12x.json --filters '{"base_task": "arithmetic"}' --split none --output arithmetic.png

Execution Time: ~2-3 minutes


fft - Token-Frequency Spectral Analysis

Purpose: Analyze frequency domain characteristics of token generation patterns.

Usage:

python analyze.py fft <config> [options]

Options: - --filters JSON: Filter evaluations (supports eval_id, base_task, projections) - --output PATH: Output PNG file path (default: auto-generated)

Grid Layout: - Rows: Task projections (parameter sweeps) - Columns: Evaluations (models) - Legend: Parameter values with consistent colors

What It Reveals: - Tokenizer differences vs model differences - Repetitive patterns in reasoning traces - Structural differences between models - Impact of prompt/sampler settings on output patterns

Requirements:

Projections must be defined in config under basetasks[task_id]['projections']. Each projection specifies: - axis: Parameter name being swept - values: Parameter values - labels: Human-readable labels - filter: Base filter for this projection

Discovery:

python analyze.py evals <config> --show-projections

Examples:

# All tasks and projections
python analyze.py fft data/dataset-m12x.json

# Only arithmetic task
python analyze.py fft data/dataset-m12x.json --filters '{"base_task": "arithmetic"}'

# Multiple specific tasks
python analyze.py fft data/dataset-m12x.json --filters '{"base_task": ["arithmetic", "objects"]}'

# Specific projections
python analyze.py fft data/dataset-m12x.json --filters '{"projections": ["arith_easy", "arith_hard"]}'

# Specific evaluations
python analyze.py fft data/dataset-m12x.json --filters '{"eval_id": [0, 1]}'

# Combined filters
python analyze.py fft data/dataset-m12x.json --filters '{"eval_id": [0], "base_task": "arithmetic"}'

Execution Time: ~1-2 minutes


compression - Entropy Pattern Analysis

Purpose: Analyze information-theoretic failure mechanisms via compression correlation.

Usage:

python analyze.py compression <config> [options]

Options: - --filters JSON: Filter evaluations - --split DIMENSION: Dimension for separate files (default: base_task) - --stack DIMENSION: Dimension for rows within files (default: eval_id, use "none" to disable) - --series DIMENSION: Dimension for colored series (default: tier) - --output-dir PATH: Directory for output PNG files (default: current directory)

Processing Hierarchy: 1. --split: Creates separate files 2. --stack: Creates rows within each file (default: eval_id) 3. --series: Creates colored series/columns

Output Structure:

Each file contains a three-panel composite: - Left panel: Correct answers - Center panel: Incorrect answers - Right panel: Truncated answers

Each panel shows: tokens vs compressed bytes, with histograms.

Requirements: - Database must contain compression data (automatically included with --dataset mode) - Token-level data is always generated in dataset mode

Examples:

# Default: split by base_task, stack by eval_id, series by tier
python analyze.py compression data/dataset-m12x.json --output-dir compression/

# Disable stacking (one row per file)
python analyze.py compression data/dataset-m12x.json --stack none --output-dir compression/

# Split by model, stack by base_task, series by tier
python analyze.py compression data/dataset-m12x.json --split model --stack base_task --output-dir compression/

# Change series dimension
python analyze.py compression data/dataset-m12x.json --series length --output-dir compression/

# Filter to specific evaluation
python analyze.py compression data/dataset-m12x.json --filters '{"eval_id": [0]}' --output-dir compression/

Execution Time: ~30 seconds


hazard - Survival Analysis

Purpose: Model temporal failure risk during token generation.

Usage:

python analyze.py hazard <config> [options]

Options: - --filters JSON: Filter evaluations - --split DIMENSIONS: Comma-separated dimension names for separate files (default: base_task,model) - --bucket-size TOKENS: Token bucket size for time grid (default: 256) - --output-dir PATH: Directory for output PNG files (default: current directory)

Survival Analysis Framework:

Treats token generation as a temporal process: - Time: Token count - Event: Model produces answer (correct or incorrect) - Censoring: Truncation (stopped observing before completion)

Split Parameter:

Comma-separated dimensions create one file per unique combination: - base_task,model (default): One file per task+model combo - base_task: One file per task (all models combined) - model,tier: One file per model+tier combo

What It Shows: - When failures occur during generation - Hazard curves for correct/incorrect outcomes - Truncation patterns

Requirements: - Database must contain compression data (includes token counts)

Examples:

# Default: split by base_task and model
python analyze.py hazard data/dataset-m12x.json --output-dir hazard/

# Split by base_task only
python analyze.py hazard data/dataset-m12x.json --split base_task --output-dir hazard/

# Split by model and tier with specific bucket size
python analyze.py hazard data/dataset-m12x.json --split model,tier --bucket-size 128 --output-dir hazard/

# Filter to specific evaluation
python analyze.py hazard data/dataset-m12x.json --filters '{"eval_id": [0]}' --output-dir hazard/

Execution Time: ~1-2 minutes


modelinfo - Collect Model Metadata

Purpose: Download and cache model architecture information from Hugging Face hub.

Usage:

python analyze.py modelinfo <config> --output-dir <path> [options]

Options: - --filters JSON: Filter evaluations (to select specific models) - --output-dir PATH: Directory for metadata cache (required)

Downloads Per Model: - All .md files (README, MODEL_CARD, etc.) - config.json (architecture configuration) - recipe.yaml (quantization info, if exists)

Output Structure:

metadata/
├── index.json                     # Quick lookup
├── model-org/
│   └── model-name/
│       ├── README.md
│       ├── config.json
│       └── recipe.yaml (if exists)

index.json Structure:

{
  "model-id": {
    "model_id": "org/name",
    "output_dir": "/absolute/path",
    "file_paths": { "README.md": "/path", ... },
    "metadata_complete": true,
    "missing_files": [],
    "fetch_errors": [],
    "architecture": "LlamaForCausalLM",
    "num_layers": 80,
    "hidden_size": 8192,
    "vocab_size": 128256,
    "cached_at": "2025-01-15T10:30:00Z"
  }
}

Examples:

# Collect for all models in config
python analyze.py modelinfo data/dataset-micro-minimax.json --output-dir metadata/

# Filter by eval_id
python analyze.py modelinfo data/dataset-micro-minimax.json --filters '{"eval_id": [0, 1]}' --output-dir metadata/

# Filter by groups
python analyze.py modelinfo data/dataset-micro-minimax.json --filters '{"groups": ["runtime:vllm"]}' --output-dir metadata/

Execution Time: ~30-60 seconds (with 500ms rate limiting between requests)


Filter System

All analysis commands (except evals discovery modes) support the --filters parameter for flexible data selection.

Filter Structure

The --filters parameter accepts JSON with these keys:

{
  "eval_id": [0, 1, 2],                    // List of evaluation indices
  "groups": ["runtime:vllm", "size:large"], // All must match (AND logic)
  "tiers": ["easy", "medium"],             // Difficulty tiers
  "base_task": "arithmetic",               // Single task or list
  "projections": ["arith_easy"],           // For FFT analysis
  "surfaces": ["arithmetic_length_x_depth"] // For surface analysis
}

Filter Keys

Key Type Description Used By
eval_id List[int] Evaluation indices (0, 1, 2, ...) All commands
groups List[str] Group tags (all must match) All commands
tiers List[str] Difficulty tier labels (e.g., "easy", "medium", "hard") scores, spiderweb, compression, hazard
base_task str or List[str] Task name(s) All commands
projections List[str] Projection IDs fft
surfaces List[str] Surface IDs surface

Shell Escaping

Important: Use single quotes around JSON, double quotes inside:

# Correct
--filters '{"eval_id": [0, 1]}'

# Wrong (shell will interpret double quotes)
--filters "{"eval_id": [0, 1]}"

Filter Examples

# By eval_id only
--filters '{"eval_id": [0, 1, 2]}'

# By groups (all must match)
--filters '{"groups": ["runtime:vllm"]}'

# Multiple groups (AND logic)
--filters '{"groups": ["runtime:vllm", "size:large"]}'

# By base_task (single)
--filters '{"base_task": "arithmetic"}'

# By base_task (multiple)
--filters '{"base_task": ["arithmetic", "letters"]}'

# Combined filters
--filters '{"eval_id": [0], "groups": ["family:phi"], "base_task": "arithmetic"}'

# FFT-specific: filter projections
--filters '{"projections": ["arith_easy", "arith_hard"]}'

# Surface-specific: filter surfaces
--filters '{"surfaces": ["arithmetic_length_x_depth"]}'

Discovering Valid Filter Values

Use evals and tasks subcommands to discover valid values:

# List all evaluations (shows eval_id, groups, tier availability, etc.)
python analyze.py evals data/dataset.json

# Search for specific evaluations
python analyze.py evals data/dataset.json --search "vllm"

# Show all tasks with surfaces and projections
python analyze.py tasks data/dataset.json

# Show specific task structure
python analyze.py tasks data/dataset.json --search "arithmetic"

Output Formats

Most commands support multiple output formats via the --format parameter.

Available Formats by Command

Command Formats Default
evals table, json, markdown table
tasks (always text output) N/A
scores table, json, markdown, csv, png table
spiderweb webpng, barpng webpng
cluster markdown, json, png markdown
surface png png
fft png png
compression png png
hazard png png
modelinfo (directory structure) N/A

Format Descriptions

table (ASCII table) - Terminal-friendly monospace output - Uses pandas to_string() - Best for quick inspection in shell

markdown - GitHub-flavored markdown tables - Human-readable and version-controllable - Uses pandas to_markdown()

json - Machine-readable structured data - Suitable for programmatic analysis - Pretty-printed with 2-space indent

csv - Spreadsheet-compatible format - Only available for scores command - Standard comma-separated values

png - Visual rendering for charts and plots - Requires system dependencies (see Requirements) - Various widths/heights depending on command

webpng (spiderweb only) - Spider web / radar chart visualization - Shows multi-dimensional task breakdown

barpng (spiderweb only) - Vertical bar chart visualization - Alternative to spider web for same data

Output Destination

Stdout (default):

# Print to terminal
python analyze.py scores data/dataset.json

File output:

# Single file
python analyze.py scores data/dataset.json --output scores.md

# Multiple files (directory)
python analyze.py compression data/dataset.json --output-dir compression/

Auto-Generated Filenames

When --output is not specified, commands that generate files use auto-generated names based on: - Dataset name (from config "name" field) - Filter values (eval_id, tiers, etc.) - Command type (e.g., _spider, _cluster)

Example auto-generated names:

micro-minimax_leaderboard.png
micro-minimax_eval0_phi-4_spider.png
micro-minimax_cluster_base_task.png
micro-minimax_surface_grid.png


Requirements

Python Dependencies

pip install pandas duckdb plotly kaleido imgkit huggingface_hub

Package purposes: - pandas: DataFrame manipulation and table formatting - duckdb: Database backend for efficient queries - plotly: Interactive plotting (spider webs, surfaces, FFT) - kaleido: Static image export for plotly (PNG output) - imgkit: HTML to PNG conversion (for scores PNG) - huggingface_hub: Model metadata fetching (modelinfo)

System Requirements

For PNG generation:

  1. wkhtmltoimage (required by imgkit for scores PNG):

    # Ubuntu/Debian
    sudo apt-get install wkhtmltopdf
    
    # macOS
    brew install wkhtmltopdf
    

  2. kaleido (required by plotly for static export):

    pip install kaleido
    

Data Requirements

Basic analysis (evals, scores, spiderweb, cluster): - Standard evaluation data in DuckDB - Generate with: python evaluate.py --dataset data/dataset.json

Compression and hazard analysis: - Token-level data required (automatically included) - Generate with: python evaluate.py --dataset data/dataset.json

FFT analysis: - Projection definitions in config - Under basetasks[task_id]['projections']

Surface analysis: - Surface definitions in config - Under basetasks[task_id]['surfaces']

Configuration Requirements

Minimum config structure:

{
  "db": "path/to/database.db",
  "name": "Dataset Name",
  "evals": [
    {
      "label": "Model Name",
      "hf_id": "org/model-name",
      "filters": {
        "model": "model-id",
        "template": "template-id",
        "sampler": "sampler-id"
      },
      "groups": ["runtime:vllm", "family:llama"]
    }
  ],
  "basetasks": {
    "task_id": {
      "label": "Task Name",
      "projections": [...],  // Optional (for FFT)
      "surfaces": [...]      // Optional (for surface)
    }
  },
  "tiers": [...]  // Optional (for tier-based analysis)
}


Error Handling

Common Errors and Solutions

Missing Database

Error:

Error: Database not found: /path/to/database.db
Run evaluate.py with --db to create it

Solution:

python evaluate.py --dataset /path/to/dataset.json


Invalid JSON Filters

Error:

Error: Invalid JSON in --filters: {...}

Solutions: - Check JSON syntax (use a JSON validator) - Use single quotes around JSON: '{"key": "value"}' - Use double quotes inside JSON: "key", not 'key' - Escape properly if using double quotes in shell


No Matching Data

Error:

No results found matching filters

Solutions: 1. Verify filters are correct:

python analyze.py evals data/dataset.json

  1. Check if data exists in database:

    python analyze.py scores data/dataset.json  # Try without filters
    

  2. Relax filter constraints


Missing Config Fields

Error:

Error: Config must specify 'db' path

Solution: Add "db" field to config JSON:

{
  "db": "path/to/database.db",
  ...
}


PNG Generation Fails

Error:

Error writing PNG: ...
Note: PNG export requires kaleido package: pip install kaleido

Solution:

pip install kaleido

Error (imgkit):

No wkhtmltoimage executable found

Solution:

# Ubuntu/Debian
sudo apt-get install wkhtmltopdf

# macOS
brew install wkhtmltopdf


Getting Help

Command-Line Help

# List all subcommands
python analyze.py --help

# Subcommand-specific help
python analyze.py evals --help
python analyze.py scores --help
python analyze.py spiderweb --help
python analyze.py cluster --help
python analyze.py surface --help
python analyze.py fft --help
python analyze.py compression --help
python analyze.py hazard --help
python analyze.py modelinfo --help

Help Output Includes

  • Description of the subcommand
  • Positional arguments
  • Optional arguments with defaults
  • Output format options
  • Examples
  • Filter structure (where applicable)

Example Help Output

$ python analyze.py scores --help

usage: analyze.py scores [-h] [--filters FILTERS]
                         [--format {json,markdown,table,csv,png}]
                         [--width WIDTH] [--output OUTPUT]
                         config

Generate leaderboard/rankings across evaluations

positional arguments:
  config                Dataset configuration file

optional arguments:
  -h, --help            show this help message and exit
  --filters FILTERS     Filter as JSON (e.g., '{"eval_id": [0, 1],
                        "groups": ["runtime:vllm"]}')
  --format {json,markdown,table,csv,png}
                        Output format (default: table)
  --width WIDTH         PNG image width in pixels (default: 1400,
                        only used with --format png)
  --output OUTPUT       Write output to file instead of stdout

See Also

Documentation

Source Code