ReasonScape Tool Reference¶
Quick reference for all ReasonScape tools. For what to run when, see workflow.md. For architectural context, see implementation.md.
Filter Syntax¶
Most analyze.py subcommands accept --filters as a JSON string. Always use single quotes outside, double quotes inside.
--filters '{"eval_id": ["a1b2c3", "d4e5f6"]}'
--filters '{"base_task": "arithmetic"}'
--filters '{"eval_id": ["a1b2c3"], "base_task": "arithmetic"}'
| Key | Type | Description |
|---|---|---|
eval_id |
list[str] | 6-char hash(es) identifying model+template+sampler |
base_task |
str or list | Task name(s) |
groups |
list[str] | Group tags, e.g. ["arch:moe", "size:large"] (AND logic within item, OR across items) |
Discover valid values with analyze.py evals and analyze.py tasks before filtering.
Webapps¶
All three webapps take a dataset config and open an interactive browser UI.
All three webapps share common flags: --port (override default), --url-base-pathname (for reverse-proxy deployments), --debug.
leaderboard.py¶
Interactive rankings with heatmap visualization.
python leaderboard.py data/r12.json
# Open http://localhost:8050
python leaderboard.py data/tables-16k.json --group-by manifold.target_tokens --port 8060
| Flag | Default | Description |
|---|---|---|
--group-by |
base_task |
Aggregation dimension. Fixed at startup — not interactive. Use manifold.target_tokens or facets.operations for single-task datasets. |
--port |
8050 | Web server port |
--url-base-pathname |
/ |
URL base path (reverse-proxy deployments) |
Features: ReasonScore rankings, token efficiency, heatmap cells with truncation indicators, group/manifold filtering, pagination.
explorer.py¶
Interactive 3D surface visualization of reasoning landscapes.
python explorer.py data/r12.json
# Open http://localhost:8051
| Flag | Default | Description |
|---|---|---|
--port |
8051 | Web server port |
--url-base-pathname |
/ |
URL base path (reverse-proxy deployments) |
Features: 3D accuracy surfaces, multi-panel synchronized analysis (FFT, accuracy, histograms), cross-model comparison, point inspection.
spiderweb.py¶
Per-model radar chart web UI (distinct from analyze.py spiderweb which outputs static PNG).
python spiderweb.py data/r12.json
# Open http://localhost:8051
# Single-task datasets need an alternative group-by axis
python spiderweb.py data/tables-16k.json --group-by manifold.target_tokens
python spiderweb.py data/tables-16k.json --group-by params.operation
| Flag | Default | Description |
|---|---|---|
--group-by |
base_task |
Radar slice dimension. Required for single-task datasets — use manifold.target_tokens or params.operation instead of base_task to get a meaningful web. |
--port |
8051 | Web server port |
--url-base-pathname |
/ |
URL base path (reverse-proxy deployments) |
Features: Interactive radar/bar toggle, cognitive archetype identification, token efficiency overlay.
Pipeline CLI Tools¶
runner.py¶
Generates prompts and executes an experiment configuration against a concrete LLM.
API mode (default) — streams requests to a live OpenAI-compatible endpoint:
python runner.py \
--config configs/r12.yaml \
--template templates/zeroshot-nosys.json \
--sampler samplers/greedy-max.json \
--model your-model \
--apibase http://localhost:3333
Batch mode — for offline inference engines (e.g., vLLM offline batch). Three-step process:
# Step 1: generate batch input NDJSON
python runner.py \
--config configs/r12.yaml \
--template templates/zeroshot-nosys.json \
--sampler samplers/greedy-max.json \
--model your-model \
--mode write-batch
# Produces: results/2026-03-30_22-29-15_r12_..._flash-offline.ndjson
# Step 2: run vLLM offline batch
vllm run-batch \
-i results/2026-03-30_22-29-15_r12_..._flash-offline.ndjson \
-o output.ndjson \
--model /path/to/model \
-tp 4 \
--max-model-len 16384 \
--served-model-name your-model
# Step 3: import vLLM output and evaluate
python runner.py \
--config configs/r12.yaml \
--template templates/zeroshot-nosys.json \
--sampler samplers/greedymax.json \
--model your-model \
--mode read-batch \
--batch output.ndjson
| Flag | Default | Description |
|---|---|---|
--config |
required | YAML experiment configuration — see config.md |
--template |
required | Prompt template file — see templates-samplers.md |
--sampler |
required | Generation parameters file — see templates-samplers.md |
--model |
required | Model identifier |
--mode |
api |
api, write-batch, or read-batch |
--apibase |
required (api mode) | API base URL |
--apikey |
— | API key |
--batch |
— | vLLM output file to import (read-batch mode only) |
--precision |
first defined | Precision level from config |
--seed |
— | Random seed |
--parallel |
— | Parallel completions (api mode) |
--timeout |
— | Request timeout in seconds (api mode) |
--cache |
cache/ |
Cache DB file or directory |
--task |
all | Filter to specific tasks (repeatable) |
--output |
results/ |
Output base directory |
--no-shuffle |
— | Run steps in order instead of random |
--quiet |
— | Suppress per-step output |
--tokenizer |
— | HuggingFace tokenizer ID (required for calibrate tasks) |
Output: Timestamped folders under results/ containing NDJSON files with complete inference traces. Batch write-mode produces a single flat NDJSON instead of a folder.
evaluate.py¶
Processes raw NDJSON results into the PointsDB DuckDB database.
# Process all evals in a dataset config
python evaluate.py --dataset data/r12.json --parallel 16
# Process specific NDJSON files directly
python evaluate.py --interview results/2025-*/arithmetic*.ndjson
| Flag | Default | Description |
|---|---|---|
--dataset |
— | Dataset config JSON (processes all evals) |
--interview |
— | Path, glob, or comma-separated list of NDJSON files |
--parallel |
16 | Worker count for bucket processing |
--offline |
— | Re-process answers (interview mode only) |
--inplace |
— | Write results back to NDJSON (requires --offline) |
--quiet |
— | Less output |
--dataset and --interview are mutually exclusive. Run this after every new result import; it is idempotent.
cohort.py¶
Postprocesses cohorts — lists, verifies, and creates context-limited variants.
python cohort.py list # All cohorts in data/
python cohort.py list data/r12/Qwen3-Next-80B # Specific cohort
python cohort.py list --search qwen # Regex search
python cohort.py verify # Verify all r12 cohorts
python cohort.py verify data/r12/Qwen3.5-27B # Specific cohort
python cohort.py context data/r12/ModelName \
--eval-id a1b2c3 --context 12288 # Create 12k context variant
Subcommands:
| Subcommand | Description |
|---|---|
list [path] |
List evals with markdown table output. Accepts glob patterns. |
verify [path] |
Verify glob patterns in evals.json match expected scenarios. |
context <cohort> |
Create context-limited variant from an existing eval. Non-destructive (new timestamped folder). |
context key flags: --eval-id (required), --context (token limit, required).
After running context, rebuild the database: python evaluate.py --dataset data/<name>.json.
data.py¶
Content-addressed blob storage for sharing evaluation data.
python data.py pull dataset data/r12.json # Pull DB only (~2GB)
python data.py pull dataset data/r12.json --full # Pull everything (~20GB compressed)
python data.py pull cohort data/r12/GLM-4.5 # Pull one cohort (~150MB)
python data.py push cohort data/r12/NewModel-7B # Publish new cohort
python data.py status # Show local/remote state
python data.py prune # Remove unreferenced blobs
Subcommands:
| Subcommand | Scope options | Description |
|---|---|---|
pull |
eval, cohort, dataset |
Download and unpack blobs |
push |
eval, cohort, dataset |
Pack, upload, update manifest |
status |
eval, cohort, dataset |
Show local vs remote state |
prune |
— | Remove blobs not referenced by manifest |
Blobs stored in data/.blobs/ (gitignored). Manifest at data/manifest.json (tracked).
analyze.py¶
Unified analysis CLI. All subcommands share the same invocation pattern:
python analyze.py <subcommand> <dataset> [options]
Discovery¶
evals¶
List all evaluations in a dataset. Run this first to discover eval_id values.
python analyze.py evals data/r12.json
python analyze.py evals data/r12.json --search "phi-4"
python analyze.py evals data/r12.json --facets "size:small,arch:moe"
| Flag | Default | Description |
|---|---|---|
--search |
— | Regex filter on model name (case-insensitive) |
--facets |
— | Comma-separated key:value pairs (AND logic) |
--format |
table |
table, markdown, json |
Output: eval_id, label, model, template, sampler, groups, hf_id, point counts.
tasks¶
List base tasks with their facet definitions and point counts.
python analyze.py tasks data/r12.json
python analyze.py tasks data/r12.json --search "arith"
| Flag | Default | Description |
|---|---|---|
--search |
— | Regex filter on task name |
Output: task names, view definitions (surfaces and projections), point counts.
tokens¶
Token utilization statistics grouped by an arbitrary dimension.
python analyze.py tokens data/r12.json
python analyze.py tokens data/r12.json --group-by manifold.id
python analyze.py tokens data/r12.json --filters '{"eval_id": ["a1b2c3"]}' --group-by facets.operations
| Flag | Default | Description |
|---|---|---|
--group-by |
base_task |
Grouping dimension (any aggregate dimension) |
--filters |
— | Standard filter JSON |
--format |
markdown |
markdown, json |
--output-dir |
stdout | Directory for output file |
Output: flat table of prompt/completion/correct/incorrect token averages per (eval, group) pair.
points¶
List per-point evaluation data. Bridge between aggregated analysis and raw trace inspection.
python analyze.py points data/r12.json --output-dir points/
python analyze.py points data/r12.json \
--filters '{"eval_id": ["a1b2c3"], "base_task": "arithmetic"}' \
--sort max_number --sort length \
--output-dir points/
| Flag | Default | Description |
|---|---|---|
--filters |
— | Standard filter JSON |
--split-by |
base_task |
Dimension for separate output files |
--sort |
— | Sort by param column, without p. prefix (repeatable) |
--format |
md |
md, json |
--output-dir |
. |
Output directory |
Output: one file per split value with point IDs, params, adjusted center/margin, and raw counters (correct/incorrect/invalid/truncated). Point IDs are inputs to probe.py.
modelinfo¶
Download model metadata from Hugging Face hub.
python analyze.py modelinfo data/r12.json --output-dir metadata/
python analyze.py modelinfo --hf-id meta-llama/Llama-3.1-70B --output-dir metadata/
| Flag | Default | Description |
|---|---|---|
--filters |
— | Standard filter JSON |
--hf-id |
— | Specific HF model ID (skips dataset loading) |
--output-dir |
required | Cache directory |
Downloads README, *config.json, recipe.yaml per model. Generates MODELINFO.md summary.
Position¶
scores¶
Absolute ranking by ReasonScore (distance-from-ideal metric).
python analyze.py scores data/r12.json
python analyze.py scores data/r12.json --format png --output-dir results/
python analyze.py scores data/r12.json --sort ratio --top 10
| Flag | Default | Description |
|---|---|---|
--filters |
— | Standard filter JSON |
--group-by |
base_task |
Aggregation dimension for ReasonScore computation |
--sort |
score |
score (ReasonScore) or ratio (score/token) |
--top |
— | Limit to top N by rank |
--format |
markdown |
markdown, json, png |
--width |
1400 | PNG width in pixels |
--output-dir |
. |
Output directory |
cluster¶
Statistical clustering via confidence interval overlap. Models that are statistically indistinguishable are grouped together.
python analyze.py cluster data/r12.json
python analyze.py cluster data/r12.json --facet-by base_task --split-by eval_id
python analyze.py cluster data/r12.json --filters '{"eval_id": ["a1b2c3"]}' --format png --output-dir results/
python analyze.py cluster data/r12.json --facet-by none # collapse all tasks into one ranking
| Flag | Default | Description |
|---|---|---|
--filters |
— | Standard filter JSON |
--group-by |
eval_id |
Series dimension: things being compared within each panel |
--facet-by |
base_task |
Facet dimension: sections within each output file |
--split-by |
— | Split dimension: one output file per unique value |
--mode |
C_P |
Statistical mode: E_I, E_P, E_O, C_I, C_P, C_O |
--format |
markdown |
markdown, json, png |
--width |
1200 | PNG width in pixels |
--output-dir |
. |
Output directory |
rank¶
Rank aggregation across facets. Clusters models per facet, then scores each model by its best cluster position in each facet, summing penalties into an overall leaderboard. Missing data in any facet invalidates that model's row.
python analyze.py rank data/r12.json
python analyze.py rank data/r12.json --facet-by base_task --filters '{"groups": [["arch:moe"]]}'
python analyze.py rank data/tables-16k.json --facet-by params.operation
| Flag | Default | Description |
|---|---|---|
--filters |
— | Standard filter JSON |
--group-by |
eval_id |
Series dimension: things being ranked |
--facet-by |
base_task |
Facet dimension: what is ranked across (becomes columns) |
--mode |
C_P |
Statistical mode: E_I, E_P, E_O, C_I, C_P, C_O |
--format |
markdown |
markdown, json |
--output-dir |
. |
Output directory |
pairwise¶
Head-to-head win probabilities using Bradley-Terry probabilistic ranking.
python analyze.py pairwise data/r12.json
python analyze.py pairwise data/r12.json --sort bradley-terry --format png --output pairwise.png
| Flag | Default | Description |
|---|---|---|
--filters |
— | Standard filter JSON |
--group-by |
base_task |
Comparison dimension |
--mode |
C_P |
C_P (pessimistic) or C_I (independence) |
--sort |
expected-wins |
expected-wins or bradley-terry |
--format |
markdown |
markdown, png |
--width |
1200 | PNG width in pixels |
--output |
stdout/auto | Output file path |
Profile / Point¶
spiderweb¶
Radar or bar chart of per-task accuracy for one or more models. Static PNG output (distinct from the spiderweb.py webapp).
python analyze.py spiderweb data/r12.json --filters '{"eval_id": ["a1b2c3"]}' --output-dir results/
python analyze.py spiderweb data/r12.json --format barpng
| Flag | Default | Description |
|---|---|---|
--filters |
— | Standard filter JSON |
--group-by |
base_task |
Slice dimension |
--format |
webpng |
webpng (radar) or barpng (bar chart) |
--width |
1000 | PNG width |
--height |
700 | PNG height |
--output-dir |
. |
Output directory (one file per eval) |
surface¶
3D accuracy surfaces across 2D parameter grids (e.g., depth × length). Grid layout: rows = views, cols = evals.
python analyze.py surface data/r12.json --output-dir results/
python analyze.py surface data/r12.json --filters '{"eval_id": ["a1b2c3"]}' --split-by none
python analyze.py surface data/r12.json --view depth_length_single --output-dir results/
| Flag | Default | Description |
|---|---|---|
--filters |
— | Standard filter JSON |
--view |
all 2-dim views | View ID(s) to include (repeatable) |
--split-by |
base_task |
Dimension for separate output files (none for single file) |
--output-dir |
. |
Output directory |
compression¶
Information-theoretic failure analysis via compression correlation. Shows correct/incorrect/truncated answer patterns by entropy characteristics.
python analyze.py compression data/r12.json --output-dir compression/
python analyze.py compression data/r12.json \
--filters '{"eval_id": ["a1b2c3"]}' \
--split-by base_task --facet-by eval_id --group-by manifold.id \
--output-dir compression/
| Flag | Default | Description |
|---|---|---|
--filters |
— | Standard filter JSON |
--split-by |
base_task |
Separate output files |
--facet-by |
eval_id |
Subplot rows within each file (none to disable) |
--group-by |
manifold.id |
Colored series within each panel |
--output-dir |
. |
Output directory |
hazard¶
Survival analysis showing when during token generation failures occur.
python analyze.py hazard data/r12.json --output-dir hazard/
python analyze.py hazard data/r12.json \
--filters '{"eval_id": ["a1b2c3"]}' \
--split-by base_task --split-by eval_id \
--output-dir hazard/
| Flag | Default | Description |
|---|---|---|
--filters |
— | Standard filter JSON |
--split-by |
base_task eval_id |
Dimension(s) for separate files (repeatable) |
--bucket-size |
256 | Token bucket size for time grid |
--output-dir |
. |
Output directory |
aggregate¶
Raw aggregated statistics with optional multi-mode comparison. Lower-level than scores; useful for debugging or custom analysis.
python analyze.py aggregate data/r12.json --filters '{"eval_id": ["a1b2c3"]}'
python analyze.py aggregate data/r12.json --compare-modes --raw-counters --tokens
| Flag | Default | Description |
|---|---|---|
--filters |
— | Standard filter JSON |
--group-by |
eval_id |
Series dimension |
--facet-by |
base_task |
Sections within file |
--split-by |
— | One file per unique value |
--mode |
C_P |
Statistical mode |
--compare-modes |
— | Show all 6 estimators side by side |
--raw-counters |
— | Include n_u, n_t, n_e, g columns |
--tokens |
— | Include token average columns |
--format |
markdown |
markdown, json |
--output-dir |
. |
Output directory |
capacity¶
Effective capacity analysis: accuracy vs complexity frontier, or peak capacity vs token budget.
python analyze.py capacity data/r12.json --frontier --output-dir results/
python analyze.py capacity data/r12.json --peak --format png --output-dir results/
python analyze.py capacity data/r12.json --frontier manifold.id --facet-by facets.arch
| Flag | Default | Description |
|---|---|---|
--filters |
— | Standard filter JSON |
--frontier [GROUP_BY] |
— | Accuracy vs num_rows sigmoid plots |
--peak [GROUP_BY] |
— | Effective capacity vs target_tokens plots |
--facet-by |
— | Subplot columns dimension |
--format |
markdown |
markdown, json, png |
--width |
1200 | PNG width |
--output-dir |
. |
Output directory |
--frontier and --peak are mutually exclusive.
probe.py¶
Deep inspection of raw evaluation traces. Requires point IDs discovered via analyze.py points.
python probe.py <subcommand> <dataset> [options]
fft¶
Frequency domain analysis of prompt token sequences. Computed on-the-fly from raw NDJSON files (no pre-computed DB columns needed). Uses 1-dim views from the dataset config.
python probe.py fft data/r12.json --output-dir research/fft/
python probe.py fft data/r12.json --filters '{"eval_id": ["a1b2c3"]}' --split-by none
| Flag | Default | Description |
|---|---|---|
--filters |
— | Standard filter JSON |
--view |
all 1-dim views | View ID(s) to include (repeatable) |
--split-by |
base_task |
Separate output files (none for single file) |
--output-dir |
. |
Output directory |
Grid layout: rows = view entries, cols = evals. Color axis = the view's single group_by dimension.
failure¶
Analyze failure patterns for a specific point: raw answer text and genresult facet breakdown.
python probe.py failure data/r12.json --point-id 1234
python probe.py failure data/r12.json --point-id 1234 --point-id 5678 --limit 5 --full
| Flag | Default | Description |
|---|---|---|
--point-id |
required | Point ID to analyze (repeatable to aggregate) |
--limit |
all | Show only first N failure examples |
--full |
— | Show full answer text without truncation |
Output: raw answer text per failure with extracted/reference answers; facet analysis on genresult metadata.
truncation¶
Segment-based loop detection in truncated model outputs.
python probe.py truncation data/r12.json --point-id 5
| Flag | Default | Description |
|---|---|---|
--point-id |
required | Point ID to analyze |
Loop classifications:
| Type | Pattern | Interpretation |
|---|---|---|
1-LOOP-STATIC |
Fixed-length segments (Δtok ≈ 0) | Model stuck, repeating verbatim |
1-LOOP-GROWING |
Growing segments (Δtok > 0) | Counter loops — appears to progress, isn't |
1-LOOP-SHRINKING |
Shrinking segments (Δtok < 0) | Attempting to escape verbose state |
2-LOOP-STATIC |
ABAB alternation | Two-state oscillation (check-act cycles) |
3-LOOP-STATIC |
ABCABC pattern | Three-phase reasoning template loop |
DEGENERATE |
Character-level repetition | Sampler collapse |
Loop coverage > 90% = severe failure. 1-LOOP-GROWING at high coverage is the most dangerous: the model appears to be making progress while stuck.
Discover point IDs first: python analyze.py points data/r12.json --filters '{"eval_id": ["a1b2c3"]}'