ReasonScape Tool Reference¶

Quick reference for all ReasonScape tools. For what to run when, see workflow.md. For architectural context, see implementation.md.

Filter Syntax¶

Most analyze.py subcommands accept --filters as a JSON string. Always use single quotes outside, double quotes inside.

--filters '{"eval_id": ["a1b2c3", "d4e5f6"]}'
--filters '{"base_task": "arithmetic"}'
--filters '{"eval_id": ["a1b2c3"], "base_task": "arithmetic"}'

Key	Type	Description
`eval_id`	list[str]	6-char hash(es) identifying model+template+sampler
`base_task`	str or list	Task name(s)
`groups`	list[str]	Group tags, e.g. `["arch:moe", "size:large"]` (AND logic within item, OR across items)

Discover valid values with analyze.py evals and analyze.py tasks before filtering.

Webapps¶

All three webapps take a dataset config and open an interactive browser UI.

All three webapps share common flags: --port (override default), --url-base-pathname (for reverse-proxy deployments), --debug.

leaderboard.py¶

Interactive rankings with heatmap visualization.

python leaderboard.py data/r12.json
# Open http://localhost:8050

python leaderboard.py data/tables-16k.json --group-by manifold.target_tokens --port 8060

Flag	Default	Description
`--group-by`	`base_task`	Aggregation dimension. Fixed at startup — not interactive. Use `manifold.target_tokens` or `facets.operations` for single-task datasets.
`--port`	8050	Web server port
`--url-base-pathname`	`/`	URL base path (reverse-proxy deployments)

Features: ReasonScore rankings, token efficiency, heatmap cells with truncation indicators, group/manifold filtering, pagination.

explorer.py¶

Interactive 3D surface visualization of reasoning landscapes.

python explorer.py data/r12.json
# Open http://localhost:8051

Flag	Default	Description
`--port`	8051	Web server port
`--url-base-pathname`	`/`	URL base path (reverse-proxy deployments)

Features: 3D accuracy surfaces, multi-panel synchronized analysis (FFT, accuracy, histograms), cross-model comparison, point inspection.

spiderweb.py¶

Per-model radar chart web UI (distinct from analyze.py spiderweb which outputs static PNG).

python spiderweb.py data/r12.json
# Open http://localhost:8051

# Single-task datasets need an alternative group-by axis
python spiderweb.py data/tables-16k.json --group-by manifold.target_tokens
python spiderweb.py data/tables-16k.json --group-by params.operation

Flag	Default	Description
`--group-by`	`base_task`	Radar slice dimension. Required for single-task datasets — use `manifold.target_tokens` or `params.operation` instead of `base_task` to get a meaningful web.
`--port`	8051	Web server port
`--url-base-pathname`	`/`	URL base path (reverse-proxy deployments)

Features: Interactive radar/bar toggle, cognitive archetype identification, token efficiency overlay.

Pipeline CLI Tools¶

runner.py¶

Generates prompts and executes an experiment configuration against a concrete LLM.

API mode (default) — streams requests to a live OpenAI-compatible endpoint:

python runner.py \
  --config configs/r12.yaml \
  --template templates/zeroshot-nosys.json \
  --sampler samplers/greedy-max.json \
  --model your-model \
  --apibase http://localhost:3333

Batch mode — for offline inference engines (e.g., vLLM offline batch). Three-step process:

# Step 1: generate batch input NDJSON
python runner.py \
  --config configs/r12.yaml \
  --template templates/zeroshot-nosys.json \
  --sampler samplers/greedy-max.json \
  --model your-model \
  --mode write-batch
# Produces: results/2026-03-30_22-29-15_r12_..._flash-offline.ndjson

# Step 2: run vLLM offline batch
vllm run-batch \
  -i results/2026-03-30_22-29-15_r12_..._flash-offline.ndjson \
  -o output.ndjson \
  --model /path/to/model \
  -tp 4 \
  --max-model-len 16384 \
  --served-model-name your-model

# Step 3: import vLLM output and evaluate
python runner.py \
  --config configs/r12.yaml \
  --template templates/zeroshot-nosys.json \
  --sampler samplers/greedymax.json \
  --model your-model \
  --mode read-batch \
  --batch output.ndjson

Flag	Default	Description
`--config`	required	YAML experiment configuration — see config.md
`--template`	required	Prompt template file — see templates-samplers.md
`--sampler`	required	Generation parameters file — see templates-samplers.md
`--model`	required	Model identifier
`--mode`	`api`	`api`, `write-batch`, or `read-batch`
`--apibase`	required (api mode)	API base URL
`--apikey`	—	API key
`--batch`	—	vLLM output file to import (read-batch mode only)
`--precision`	first defined	Precision level from config
`--seed`	—	Random seed
`--parallel`	—	Parallel completions (api mode)
`--timeout`	—	Request timeout in seconds (api mode)
`--cache`	`cache/`	Cache DB file or directory
`--task`	all	Filter to specific tasks (repeatable)
`--output`	`results/`	Output base directory
`--no-shuffle`	—	Run steps in order instead of random
`--quiet`	—	Suppress per-step output
`--tokenizer`	—	HuggingFace tokenizer ID (required for calibrate tasks)

Output: Timestamped folders under results/ containing NDJSON files with complete inference traces. Batch write-mode produces a single flat NDJSON instead of a folder.

evaluate.py¶

Processes raw NDJSON results into the PointsDB DuckDB database.

# Process all evals in a dataset config
python evaluate.py --dataset data/r12.json --parallel 16

# Process specific NDJSON files directly
python evaluate.py --interview results/2025-*/arithmetic*.ndjson

Flag	Default	Description
`--dataset`	—	Dataset config JSON (processes all evals)
`--interview`	—	Path, glob, or comma-separated list of NDJSON files
`--parallel`	16	Worker count for bucket processing
`--offline`	—	Re-process answers (interview mode only)
`--inplace`	—	Write results back to NDJSON (requires `--offline`)
`--quiet`	—	Less output

--dataset and --interview are mutually exclusive. Run this after every new result import; it is idempotent.

cohort.py¶

Postprocesses cohorts — lists, verifies, and creates context-limited variants.

python cohort.py list                              # All cohorts in data/
python cohort.py list data/r12/Qwen3-Next-80B     # Specific cohort
python cohort.py list --search qwen               # Regex search
python cohort.py verify                           # Verify all r12 cohorts
python cohort.py verify data/r12/Qwen3.5-27B      # Specific cohort
python cohort.py context data/r12/ModelName \
  --eval-id a1b2c3 --context 12288               # Create 12k context variant

Subcommands:

Subcommand	Description
`list [path]`	List evals with markdown table output. Accepts glob patterns.
`verify [path]`	Verify glob patterns in evals.json match expected scenarios.
`context <cohort>`	Create context-limited variant from an existing eval. Non-destructive (new timestamped folder).

context key flags: --eval-id (required), --context (token limit, required).

After running context, rebuild the database: python evaluate.py --dataset data/<name>.json.

data.py¶

Content-addressed blob storage for sharing evaluation data.

python data.py pull dataset data/r12.json           # Pull DB only (~2GB)
python data.py pull dataset data/r12.json --full    # Pull everything (~20GB compressed)
python data.py pull cohort data/r12/GLM-4.5        # Pull one cohort (~150MB)
python data.py push cohort data/r12/NewModel-7B    # Publish new cohort
python data.py status                               # Show local/remote state
python data.py prune                                # Remove unreferenced blobs

Subcommands:

Subcommand	Scope options	Description
`pull`	`eval`, `cohort`, `dataset`	Download and unpack blobs
`push`	`eval`, `cohort`, `dataset`	Pack, upload, update manifest
`status`	`eval`, `cohort`, `dataset`	Show local vs remote state
`prune`	—	Remove blobs not referenced by manifest

Blobs stored in data/.blobs/ (gitignored). Manifest at data/manifest.json (tracked).

analyze.py¶

Unified analysis CLI. All subcommands share the same invocation pattern:

python analyze.py <subcommand> <dataset> [options]

Discovery¶

evals¶

List all evaluations in a dataset. Run this first to discover eval_id values.

python analyze.py evals data/r12.json
python analyze.py evals data/r12.json --search "phi-4"
python analyze.py evals data/r12.json --facets "size:small,arch:moe"

Flag	Default	Description
`--search`	—	Regex filter on model name (case-insensitive)
`--facets`	—	Comma-separated `key:value` pairs (AND logic)
`--format`	`table`	`table`, `markdown`, `json`

Output: eval_id, label, model, template, sampler, groups, hf_id, point counts.

tasks¶

List base tasks with their facet definitions and point counts.

python analyze.py tasks data/r12.json
python analyze.py tasks data/r12.json --search "arith"

Flag	Default	Description
`--search`	—	Regex filter on task name

Output: task names, view definitions (surfaces and projections), point counts.

tokens¶

Token utilization statistics grouped by an arbitrary dimension.

python analyze.py tokens data/r12.json
python analyze.py tokens data/r12.json --group-by manifold.id
python analyze.py tokens data/r12.json --filters '{"eval_id": ["a1b2c3"]}' --group-by facets.operations

Flag	Default	Description
`--group-by`	`base_task`	Grouping dimension (any aggregate dimension)
`--filters`	—	Standard filter JSON
`--format`	`markdown`	`markdown`, `json`
`--output-dir`	stdout	Directory for output file

Output: flat table of prompt/completion/correct/incorrect token averages per (eval, group) pair.

points¶

List per-point evaluation data. Bridge between aggregated analysis and raw trace inspection.

python analyze.py points data/r12.json --output-dir points/
python analyze.py points data/r12.json \
  --filters '{"eval_id": ["a1b2c3"], "base_task": "arithmetic"}' \
  --sort max_number --sort length \
  --output-dir points/

Flag	Default	Description
`--filters`	—	Standard filter JSON
`--split-by`	`base_task`	Dimension for separate output files
`--sort`	—	Sort by param column, without `p.` prefix (repeatable)
`--format`	`md`	`md`, `json`
`--output-dir`	`.`	Output directory

Output: one file per split value with point IDs, params, adjusted center/margin, and raw counters (correct/incorrect/invalid/truncated). Point IDs are inputs to probe.py.

modelinfo¶

Download model metadata from Hugging Face hub.

python analyze.py modelinfo data/r12.json --output-dir metadata/
python analyze.py modelinfo --hf-id meta-llama/Llama-3.1-70B --output-dir metadata/

Flag	Default	Description
`--filters`	—	Standard filter JSON
`--hf-id`	—	Specific HF model ID (skips dataset loading)
`--output-dir`	required	Cache directory

Downloads README, *config.json, recipe.yaml per model. Generates MODELINFO.md summary.

Position¶

scores¶

Absolute ranking by ReasonScore (distance-from-ideal metric).

python analyze.py scores data/r12.json
python analyze.py scores data/r12.json --format png --output-dir results/
python analyze.py scores data/r12.json --sort ratio --top 10

Flag	Default	Description
`--filters`	—	Standard filter JSON
`--group-by`	`base_task`	Aggregation dimension for ReasonScore computation
`--sort`	`score`	`score` (ReasonScore) or `ratio` (score/token)
`--top`	—	Limit to top N by rank
`--format`	`markdown`	`markdown`, `json`, `png`
`--width`	1400	PNG width in pixels
`--output-dir`	`.`	Output directory

cluster¶

Statistical clustering via confidence interval overlap. Models that are statistically indistinguishable are grouped together.

python analyze.py cluster data/r12.json
python analyze.py cluster data/r12.json --facet-by base_task --split-by eval_id
python analyze.py cluster data/r12.json --filters '{"eval_id": ["a1b2c3"]}' --format png --output-dir results/
python analyze.py cluster data/r12.json --facet-by none   # collapse all tasks into one ranking

Flag	Default	Description
`--filters`	—	Standard filter JSON
`--group-by`	`eval_id`	Series dimension: things being compared within each panel
`--facet-by`	`base_task`	Facet dimension: sections within each output file
`--split-by`	—	Split dimension: one output file per unique value
`--mode`	`C_P`	Statistical mode: `E_I`, `E_P`, `E_O`, `C_I`, `C_P`, `C_O`
`--format`	`markdown`	`markdown`, `json`, `png`
`--width`	1200	PNG width in pixels
`--output-dir`	`.`	Output directory

rank¶

Rank aggregation across facets. Clusters models per facet, then scores each model by its best cluster position in each facet, summing penalties into an overall leaderboard. Missing data in any facet invalidates that model's row.

python analyze.py rank data/r12.json
python analyze.py rank data/r12.json --facet-by base_task --filters '{"groups": [["arch:moe"]]}'
python analyze.py rank data/tables-16k.json --facet-by params.operation

Flag	Default	Description
`--filters`	—	Standard filter JSON
`--group-by`	`eval_id`	Series dimension: things being ranked
`--facet-by`	`base_task`	Facet dimension: what is ranked across (becomes columns)
`--mode`	`C_P`	Statistical mode: `E_I`, `E_P`, `E_O`, `C_I`, `C_P`, `C_O`
`--format`	`markdown`	`markdown`, `json`
`--output-dir`	`.`	Output directory

pairwise¶

Head-to-head win probabilities using Bradley-Terry probabilistic ranking.

python analyze.py pairwise data/r12.json
python analyze.py pairwise data/r12.json --sort bradley-terry --format png --output pairwise.png

Flag	Default	Description
`--filters`	—	Standard filter JSON
`--group-by`	`base_task`	Comparison dimension
`--mode`	`C_P`	`C_P` (pessimistic) or `C_I` (independence)
`--sort`	`expected-wins`	`expected-wins` or `bradley-terry`
`--format`	`markdown`	`markdown`, `png`
`--width`	1200	PNG width in pixels
`--output`	stdout/auto	Output file path

Profile / Point¶

spiderweb¶

Radar or bar chart of per-task accuracy for one or more models. Static PNG output (distinct from the spiderweb.py webapp).

python analyze.py spiderweb data/r12.json --filters '{"eval_id": ["a1b2c3"]}' --output-dir results/
python analyze.py spiderweb data/r12.json --format barpng

Flag	Default	Description
`--filters`	—	Standard filter JSON
`--group-by`	`base_task`	Slice dimension
`--format`	`webpng`	`webpng` (radar) or `barpng` (bar chart)
`--width`	1000	PNG width
`--height`	700	PNG height
`--output-dir`	`.`	Output directory (one file per eval)

surface¶

3D accuracy surfaces across 2D parameter grids (e.g., depth × length). Grid layout: rows = views, cols = evals.

python analyze.py surface data/r12.json --output-dir results/
python analyze.py surface data/r12.json --filters '{"eval_id": ["a1b2c3"]}' --split-by none
python analyze.py surface data/r12.json --view depth_length_single --output-dir results/

Flag	Default	Description
`--filters`	—	Standard filter JSON
`--view`	all 2-dim views	View ID(s) to include (repeatable)
`--split-by`	`base_task`	Dimension for separate output files (`none` for single file)
`--output-dir`	`.`	Output directory

compression¶

Information-theoretic failure analysis via compression correlation. Shows correct/incorrect/truncated answer patterns by entropy characteristics.

python analyze.py compression data/r12.json --output-dir compression/
python analyze.py compression data/r12.json \
  --filters '{"eval_id": ["a1b2c3"]}' \
  --split-by base_task --facet-by eval_id --group-by manifold.id \
  --output-dir compression/

Flag	Default	Description
`--filters`	—	Standard filter JSON
`--split-by`	`base_task`	Separate output files
`--facet-by`	`eval_id`	Subplot rows within each file (`none` to disable)
`--group-by`	`manifold.id`	Colored series within each panel
`--output-dir`	`.`	Output directory

hazard¶

Survival analysis showing when during token generation failures occur.

python analyze.py hazard data/r12.json --output-dir hazard/
python analyze.py hazard data/r12.json \
  --filters '{"eval_id": ["a1b2c3"]}' \
  --split-by base_task --split-by eval_id \
  --output-dir hazard/

Flag	Default	Description
`--filters`	—	Standard filter JSON
`--split-by`	`base_task eval_id`	Dimension(s) for separate files (repeatable)
`--bucket-size`	256	Token bucket size for time grid
`--output-dir`	`.`	Output directory

aggregate¶

Raw aggregated statistics with optional multi-mode comparison. Lower-level than scores; useful for debugging or custom analysis.

python analyze.py aggregate data/r12.json --filters '{"eval_id": ["a1b2c3"]}'
python analyze.py aggregate data/r12.json --compare-modes --raw-counters --tokens

Flag	Default	Description
`--filters`	—	Standard filter JSON
`--group-by`	`eval_id`	Series dimension
`--facet-by`	`base_task`	Sections within file
`--split-by`	—	One file per unique value
`--mode`	`C_P`	Statistical mode
`--compare-modes`	—	Show all 6 estimators side by side
`--raw-counters`	—	Include n_u, n_t, n_e, g columns
`--tokens`	—	Include token average columns
`--format`	`markdown`	`markdown`, `json`
`--output-dir`	`.`	Output directory

capacity¶

Effective capacity analysis: accuracy vs complexity frontier, or peak capacity vs token budget.

python analyze.py capacity data/r12.json --frontier --output-dir results/
python analyze.py capacity data/r12.json --peak --format png --output-dir results/
python analyze.py capacity data/r12.json --frontier manifold.id --facet-by facets.arch

Flag	Default	Description
`--filters`	—	Standard filter JSON
`--frontier [GROUP_BY]`	—	Accuracy vs num_rows sigmoid plots
`--peak [GROUP_BY]`	—	Effective capacity vs target_tokens plots
`--facet-by`	—	Subplot columns dimension
`--format`	`markdown`	`markdown`, `json`, `png`
`--width`	1200	PNG width
`--output-dir`	`.`	Output directory

--frontier and --peak are mutually exclusive.

probe.py¶

Deep inspection of raw evaluation traces. Requires point IDs discovered via analyze.py points.

python probe.py <subcommand> <dataset> [options]

fft¶

Frequency domain analysis of prompt token sequences. Computed on-the-fly from raw NDJSON files (no pre-computed DB columns needed). Uses 1-dim views from the dataset config.

python probe.py fft data/r12.json --output-dir research/fft/
python probe.py fft data/r12.json --filters '{"eval_id": ["a1b2c3"]}' --split-by none

Flag	Default	Description
`--filters`	—	Standard filter JSON
`--view`	all 1-dim views	View ID(s) to include (repeatable)
`--split-by`	`base_task`	Separate output files (`none` for single file)
`--output-dir`	`.`	Output directory

Grid layout: rows = view entries, cols = evals. Color axis = the view's single group_by dimension.

failure¶

Analyze failure patterns for a specific point: raw answer text and genresult facet breakdown.

python probe.py failure data/r12.json --point-id 1234
python probe.py failure data/r12.json --point-id 1234 --point-id 5678 --limit 5 --full

Flag	Default	Description
`--point-id`	required	Point ID to analyze (repeatable to aggregate)
`--limit`	all	Show only first N failure examples
`--full`	—	Show full answer text without truncation

Output: raw answer text per failure with extracted/reference answers; facet analysis on genresult metadata.

truncation¶

Segment-based loop detection in truncated model outputs.

python probe.py truncation data/r12.json --point-id 5

Flag	Default	Description
`--point-id`	required	Point ID to analyze

Loop classifications:

Type	Pattern	Interpretation
`1-LOOP-STATIC`	Fixed-length segments (Δtok ≈ 0)	Model stuck, repeating verbatim
`1-LOOP-GROWING`	Growing segments (Δtok > 0)	Counter loops — appears to progress, isn't
`1-LOOP-SHRINKING`	Shrinking segments (Δtok < 0)	Attempting to escape verbose state
`2-LOOP-STATIC`	ABAB alternation	Two-state oscillation (check-act cycles)
`3-LOOP-STATIC`	ABCABC pattern	Three-phase reasoning template loop
`DEGENERATE`	Character-level repetition	Sampler collapse

Loop coverage > 90% = severe failure. 1-LOOP-GROWING at high coverage is the most dangerous: the model appears to be making progress while stuck.

Discover point IDs first: python analyze.py points data/r12.json --filters '{"eval_id": ["a1b2c3"]}'