Skip to content

ReasonScape Tool Reference

Quick reference for all ReasonScape tools. For what to run when, see workflow.md. For architectural context, see implementation.md.

Filter Syntax

Most analyze.py subcommands accept --filters as a JSON string. Always use single quotes outside, double quotes inside.

--filters '{"eval_id": ["a1b2c3", "d4e5f6"]}'
--filters '{"base_task": "arithmetic"}'
--filters '{"eval_id": ["a1b2c3"], "base_task": "arithmetic"}'
Key Type Description
eval_id list[str] 6-char hash(es) identifying model+template+sampler
base_task str or list Task name(s)
groups list[str] Group tags, e.g. ["arch:moe", "size:large"] (AND logic within item, OR across items)

Discover valid values with analyze.py evals and analyze.py tasks before filtering.


Webapps

All three webapps take a dataset config and open an interactive browser UI.

All three webapps share common flags: --port (override default), --url-base-pathname (for reverse-proxy deployments), --debug.

leaderboard.py

Interactive rankings with heatmap visualization.

python leaderboard.py data/r12.json
# Open http://localhost:8050

python leaderboard.py data/tables-16k.json --group-by manifold.target_tokens --port 8060
Flag Default Description
--group-by base_task Aggregation dimension. Fixed at startup — not interactive. Use manifold.target_tokens or facets.operations for single-task datasets.
--port 8050 Web server port
--url-base-pathname / URL base path (reverse-proxy deployments)

Features: ReasonScore rankings, token efficiency, heatmap cells with truncation indicators, group/manifold filtering, pagination.

explorer.py

Interactive 3D surface visualization of reasoning landscapes.

python explorer.py data/r12.json
# Open http://localhost:8051
Flag Default Description
--port 8051 Web server port
--url-base-pathname / URL base path (reverse-proxy deployments)

Features: 3D accuracy surfaces, multi-panel synchronized analysis (FFT, accuracy, histograms), cross-model comparison, point inspection.

spiderweb.py

Per-model radar chart web UI (distinct from analyze.py spiderweb which outputs static PNG).

python spiderweb.py data/r12.json
# Open http://localhost:8051

# Single-task datasets need an alternative group-by axis
python spiderweb.py data/tables-16k.json --group-by manifold.target_tokens
python spiderweb.py data/tables-16k.json --group-by params.operation
Flag Default Description
--group-by base_task Radar slice dimension. Required for single-task datasets — use manifold.target_tokens or params.operation instead of base_task to get a meaningful web.
--port 8051 Web server port
--url-base-pathname / URL base path (reverse-proxy deployments)

Features: Interactive radar/bar toggle, cognitive archetype identification, token efficiency overlay.


Pipeline CLI Tools

runner.py

Generates prompts and executes an experiment configuration against a concrete LLM.

API mode (default) — streams requests to a live OpenAI-compatible endpoint:

python runner.py \
  --config configs/r12.yaml \
  --template templates/zeroshot-nosys.json \
  --sampler samplers/greedy-max.json \
  --model your-model \
  --apibase http://localhost:3333

Batch mode — for offline inference engines (e.g., vLLM offline batch). Three-step process:

# Step 1: generate batch input NDJSON
python runner.py \
  --config configs/r12.yaml \
  --template templates/zeroshot-nosys.json \
  --sampler samplers/greedy-max.json \
  --model your-model \
  --mode write-batch
# Produces: results/2026-03-30_22-29-15_r12_..._flash-offline.ndjson

# Step 2: run vLLM offline batch
vllm run-batch \
  -i results/2026-03-30_22-29-15_r12_..._flash-offline.ndjson \
  -o output.ndjson \
  --model /path/to/model \
  -tp 4 \
  --max-model-len 16384 \
  --served-model-name your-model

# Step 3: import vLLM output and evaluate
python runner.py \
  --config configs/r12.yaml \
  --template templates/zeroshot-nosys.json \
  --sampler samplers/greedymax.json \
  --model your-model \
  --mode read-batch \
  --batch output.ndjson
Flag Default Description
--config required YAML experiment configuration — see config.md
--template required Prompt template file — see templates-samplers.md
--sampler required Generation parameters file — see templates-samplers.md
--model required Model identifier
--mode api api, write-batch, or read-batch
--apibase required (api mode) API base URL
--apikey API key
--batch vLLM output file to import (read-batch mode only)
--precision first defined Precision level from config
--seed Random seed
--parallel Parallel completions (api mode)
--timeout Request timeout in seconds (api mode)
--cache cache/ Cache DB file or directory
--task all Filter to specific tasks (repeatable)
--output results/ Output base directory
--no-shuffle Run steps in order instead of random
--quiet Suppress per-step output
--tokenizer HuggingFace tokenizer ID (required for calibrate tasks)

Output: Timestamped folders under results/ containing NDJSON files with complete inference traces. Batch write-mode produces a single flat NDJSON instead of a folder.


evaluate.py

Processes raw NDJSON results into the PointsDB DuckDB database.

# Process all evals in a dataset config
python evaluate.py --dataset data/r12.json --parallel 16

# Process specific NDJSON files directly
python evaluate.py --interview results/2025-*/arithmetic*.ndjson
Flag Default Description
--dataset Dataset config JSON (processes all evals)
--interview Path, glob, or comma-separated list of NDJSON files
--parallel 16 Worker count for bucket processing
--offline Re-process answers (interview mode only)
--inplace Write results back to NDJSON (requires --offline)
--quiet Less output

--dataset and --interview are mutually exclusive. Run this after every new result import; it is idempotent.


cohort.py

Postprocesses cohorts — lists, verifies, and creates context-limited variants.

python cohort.py list                              # All cohorts in data/
python cohort.py list data/r12/Qwen3-Next-80B     # Specific cohort
python cohort.py list --search qwen               # Regex search
python cohort.py verify                           # Verify all r12 cohorts
python cohort.py verify data/r12/Qwen3.5-27B      # Specific cohort
python cohort.py context data/r12/ModelName \
  --eval-id a1b2c3 --context 12288               # Create 12k context variant

Subcommands:

Subcommand Description
list [path] List evals with markdown table output. Accepts glob patterns.
verify [path] Verify glob patterns in evals.json match expected scenarios.
context <cohort> Create context-limited variant from an existing eval. Non-destructive (new timestamped folder).

context key flags: --eval-id (required), --context (token limit, required).

After running context, rebuild the database: python evaluate.py --dataset data/<name>.json.


data.py

Content-addressed blob storage for sharing evaluation data.

python data.py pull dataset data/r12.json           # Pull DB only (~2GB)
python data.py pull dataset data/r12.json --full    # Pull everything (~20GB compressed)
python data.py pull cohort data/r12/GLM-4.5        # Pull one cohort (~150MB)
python data.py push cohort data/r12/NewModel-7B    # Publish new cohort
python data.py status                               # Show local/remote state
python data.py prune                                # Remove unreferenced blobs

Subcommands:

Subcommand Scope options Description
pull eval, cohort, dataset Download and unpack blobs
push eval, cohort, dataset Pack, upload, update manifest
status eval, cohort, dataset Show local vs remote state
prune Remove blobs not referenced by manifest

Blobs stored in data/.blobs/ (gitignored). Manifest at data/manifest.json (tracked).


analyze.py

Unified analysis CLI. All subcommands share the same invocation pattern:

python analyze.py <subcommand> <dataset> [options]

Discovery

evals

List all evaluations in a dataset. Run this first to discover eval_id values.

python analyze.py evals data/r12.json
python analyze.py evals data/r12.json --search "phi-4"
python analyze.py evals data/r12.json --facets "size:small,arch:moe"
Flag Default Description
--search Regex filter on model name (case-insensitive)
--facets Comma-separated key:value pairs (AND logic)
--format table table, markdown, json

Output: eval_id, label, model, template, sampler, groups, hf_id, point counts.

tasks

List base tasks with their facet definitions and point counts.

python analyze.py tasks data/r12.json
python analyze.py tasks data/r12.json --search "arith"
Flag Default Description
--search Regex filter on task name

Output: task names, view definitions (surfaces and projections), point counts.

tokens

Token utilization statistics grouped by an arbitrary dimension.

python analyze.py tokens data/r12.json
python analyze.py tokens data/r12.json --group-by manifold.id
python analyze.py tokens data/r12.json --filters '{"eval_id": ["a1b2c3"]}' --group-by facets.operations
Flag Default Description
--group-by base_task Grouping dimension (any aggregate dimension)
--filters Standard filter JSON
--format markdown markdown, json
--output-dir stdout Directory for output file

Output: flat table of prompt/completion/correct/incorrect token averages per (eval, group) pair.

points

List per-point evaluation data. Bridge between aggregated analysis and raw trace inspection.

python analyze.py points data/r12.json --output-dir points/
python analyze.py points data/r12.json \
  --filters '{"eval_id": ["a1b2c3"], "base_task": "arithmetic"}' \
  --sort max_number --sort length \
  --output-dir points/
Flag Default Description
--filters Standard filter JSON
--split-by base_task Dimension for separate output files
--sort Sort by param column, without p. prefix (repeatable)
--format md md, json
--output-dir . Output directory

Output: one file per split value with point IDs, params, adjusted center/margin, and raw counters (correct/incorrect/invalid/truncated). Point IDs are inputs to probe.py.

modelinfo

Download model metadata from Hugging Face hub.

python analyze.py modelinfo data/r12.json --output-dir metadata/
python analyze.py modelinfo --hf-id meta-llama/Llama-3.1-70B --output-dir metadata/
Flag Default Description
--filters Standard filter JSON
--hf-id Specific HF model ID (skips dataset loading)
--output-dir required Cache directory

Downloads README, *config.json, recipe.yaml per model. Generates MODELINFO.md summary.


Position

scores

Absolute ranking by ReasonScore (distance-from-ideal metric).

python analyze.py scores data/r12.json
python analyze.py scores data/r12.json --format png --output-dir results/
python analyze.py scores data/r12.json --sort ratio --top 10
Flag Default Description
--filters Standard filter JSON
--group-by base_task Aggregation dimension for ReasonScore computation
--sort score score (ReasonScore) or ratio (score/token)
--top Limit to top N by rank
--format markdown markdown, json, png
--width 1400 PNG width in pixels
--output-dir . Output directory

cluster

Statistical clustering via confidence interval overlap. Models that are statistically indistinguishable are grouped together.

python analyze.py cluster data/r12.json
python analyze.py cluster data/r12.json --facet-by base_task --split-by eval_id
python analyze.py cluster data/r12.json --filters '{"eval_id": ["a1b2c3"]}' --format png --output-dir results/
python analyze.py cluster data/r12.json --facet-by none   # collapse all tasks into one ranking
Flag Default Description
--filters Standard filter JSON
--group-by eval_id Series dimension: things being compared within each panel
--facet-by base_task Facet dimension: sections within each output file
--split-by Split dimension: one output file per unique value
--mode C_P Statistical mode: E_I, E_P, E_O, C_I, C_P, C_O
--format markdown markdown, json, png
--width 1200 PNG width in pixels
--output-dir . Output directory

rank

Rank aggregation across facets. Clusters models per facet, then scores each model by its best cluster position in each facet, summing penalties into an overall leaderboard. Missing data in any facet invalidates that model's row.

python analyze.py rank data/r12.json
python analyze.py rank data/r12.json --facet-by base_task --filters '{"groups": [["arch:moe"]]}'
python analyze.py rank data/tables-16k.json --facet-by params.operation
Flag Default Description
--filters Standard filter JSON
--group-by eval_id Series dimension: things being ranked
--facet-by base_task Facet dimension: what is ranked across (becomes columns)
--mode C_P Statistical mode: E_I, E_P, E_O, C_I, C_P, C_O
--format markdown markdown, json
--output-dir . Output directory

pairwise

Head-to-head win probabilities using Bradley-Terry probabilistic ranking.

python analyze.py pairwise data/r12.json
python analyze.py pairwise data/r12.json --sort bradley-terry --format png --output pairwise.png
Flag Default Description
--filters Standard filter JSON
--group-by base_task Comparison dimension
--mode C_P C_P (pessimistic) or C_I (independence)
--sort expected-wins expected-wins or bradley-terry
--format markdown markdown, png
--width 1200 PNG width in pixels
--output stdout/auto Output file path

Profile / Point

spiderweb

Radar or bar chart of per-task accuracy for one or more models. Static PNG output (distinct from the spiderweb.py webapp).

python analyze.py spiderweb data/r12.json --filters '{"eval_id": ["a1b2c3"]}' --output-dir results/
python analyze.py spiderweb data/r12.json --format barpng
Flag Default Description
--filters Standard filter JSON
--group-by base_task Slice dimension
--format webpng webpng (radar) or barpng (bar chart)
--width 1000 PNG width
--height 700 PNG height
--output-dir . Output directory (one file per eval)

surface

3D accuracy surfaces across 2D parameter grids (e.g., depth × length). Grid layout: rows = views, cols = evals.

python analyze.py surface data/r12.json --output-dir results/
python analyze.py surface data/r12.json --filters '{"eval_id": ["a1b2c3"]}' --split-by none
python analyze.py surface data/r12.json --view depth_length_single --output-dir results/
Flag Default Description
--filters Standard filter JSON
--view all 2-dim views View ID(s) to include (repeatable)
--split-by base_task Dimension for separate output files (none for single file)
--output-dir . Output directory

compression

Information-theoretic failure analysis via compression correlation. Shows correct/incorrect/truncated answer patterns by entropy characteristics.

python analyze.py compression data/r12.json --output-dir compression/
python analyze.py compression data/r12.json \
  --filters '{"eval_id": ["a1b2c3"]}' \
  --split-by base_task --facet-by eval_id --group-by manifold.id \
  --output-dir compression/
Flag Default Description
--filters Standard filter JSON
--split-by base_task Separate output files
--facet-by eval_id Subplot rows within each file (none to disable)
--group-by manifold.id Colored series within each panel
--output-dir . Output directory

hazard

Survival analysis showing when during token generation failures occur.

python analyze.py hazard data/r12.json --output-dir hazard/
python analyze.py hazard data/r12.json \
  --filters '{"eval_id": ["a1b2c3"]}' \
  --split-by base_task --split-by eval_id \
  --output-dir hazard/
Flag Default Description
--filters Standard filter JSON
--split-by base_task eval_id Dimension(s) for separate files (repeatable)
--bucket-size 256 Token bucket size for time grid
--output-dir . Output directory

aggregate

Raw aggregated statistics with optional multi-mode comparison. Lower-level than scores; useful for debugging or custom analysis.

python analyze.py aggregate data/r12.json --filters '{"eval_id": ["a1b2c3"]}'
python analyze.py aggregate data/r12.json --compare-modes --raw-counters --tokens
Flag Default Description
--filters Standard filter JSON
--group-by eval_id Series dimension
--facet-by base_task Sections within file
--split-by One file per unique value
--mode C_P Statistical mode
--compare-modes Show all 6 estimators side by side
--raw-counters Include n_u, n_t, n_e, g columns
--tokens Include token average columns
--format markdown markdown, json
--output-dir . Output directory

capacity

Effective capacity analysis: accuracy vs complexity frontier, or peak capacity vs token budget.

python analyze.py capacity data/r12.json --frontier --output-dir results/
python analyze.py capacity data/r12.json --peak --format png --output-dir results/
python analyze.py capacity data/r12.json --frontier manifold.id --facet-by facets.arch
Flag Default Description
--filters Standard filter JSON
--frontier [GROUP_BY] Accuracy vs num_rows sigmoid plots
--peak [GROUP_BY] Effective capacity vs target_tokens plots
--facet-by Subplot columns dimension
--format markdown markdown, json, png
--width 1200 PNG width
--output-dir . Output directory

--frontier and --peak are mutually exclusive.


probe.py

Deep inspection of raw evaluation traces. Requires point IDs discovered via analyze.py points.

python probe.py <subcommand> <dataset> [options]

fft

Frequency domain analysis of prompt token sequences. Computed on-the-fly from raw NDJSON files (no pre-computed DB columns needed). Uses 1-dim views from the dataset config.

python probe.py fft data/r12.json --output-dir research/fft/
python probe.py fft data/r12.json --filters '{"eval_id": ["a1b2c3"]}' --split-by none
Flag Default Description
--filters Standard filter JSON
--view all 1-dim views View ID(s) to include (repeatable)
--split-by base_task Separate output files (none for single file)
--output-dir . Output directory

Grid layout: rows = view entries, cols = evals. Color axis = the view's single group_by dimension.

failure

Analyze failure patterns for a specific point: raw answer text and genresult facet breakdown.

python probe.py failure data/r12.json --point-id 1234
python probe.py failure data/r12.json --point-id 1234 --point-id 5678 --limit 5 --full
Flag Default Description
--point-id required Point ID to analyze (repeatable to aggregate)
--limit all Show only first N failure examples
--full Show full answer text without truncation

Output: raw answer text per failure with extracted/reference answers; facet analysis on genresult metadata.

truncation

Segment-based loop detection in truncated model outputs.

python probe.py truncation data/r12.json --point-id 5
Flag Default Description
--point-id required Point ID to analyze

Loop classifications:

Type Pattern Interpretation
1-LOOP-STATIC Fixed-length segments (Δtok ≈ 0) Model stuck, repeating verbatim
1-LOOP-GROWING Growing segments (Δtok > 0) Counter loops — appears to progress, isn't
1-LOOP-SHRINKING Shrinking segments (Δtok < 0) Attempting to escape verbose state
2-LOOP-STATIC ABAB alternation Two-state oscillation (check-act cycles)
3-LOOP-STATIC ABCABC pattern Three-phase reasoning template loop
DEGENERATE Character-level repetition Sampler collapse

Loop coverage > 90% = severe failure. 1-LOOP-GROWING at high coverage is the most dangerous: the model appears to be making progress while stuck.

Discover point IDs first: python analyze.py points data/r12.json --filters '{"eval_id": ["a1b2c3"]}'