ReasonScape Tools¶
ReasonScape provides a complete end-to-end evaluation platform organized around five evaluation stages: Definition, Execution, Evaluation, Exploration, and Research.
See Architecture Overview for the five-stage philosophy and how these tools fit into the unified research platform.
This document provides tool references organized by stage. For the architectural vision and stage interconnections, see the Architecture Overview.
See Also: - Architecture Overview - Five-stage philosophy - Methodology for the underlying concepts and statistical approaches - Configuration Guide for experiment setup and parameter definitions - Task Documentation for detailed task generator specifications
Repository Structure¶
resolver.py # Manifold analysis and configuration validation
runner.py # Test execution engine
evaluate.py # Unified evaluation (dataset and interview modes)
leaderboard.py # Interactive ranking webapp
spiderweb.py # Per-model diagnostic webapp
explorer.py # 3D manifold visualization webapp
analyze.py # Unified analysis interface with 9 subcommands
configs/ # Experiment configuration files (.yaml)
data/ # Evaluation suites and dataset configurations
tasks/ # Parametric test generators with JSON schemas
templates/ # Chat template definitions (.json)
samplers/ # Generation parameter sets (.json)
results/ # Raw experiment outputs (timestamped folders)
Tool Overview by Stage¶
The Five-Stage Pipeline¶
graph TB
subgraph "Stage 1: Definition"
A[resolver.py<br/>tasks/*.py<br/>configs/*.yaml]
end
subgraph "Stage 2: Execution"
B[runner.py<br/>templates/*.json<br/>samplers/*.json]
end
subgraph "Stage 3: Evaluation"
C[evaluate.py]
end
subgraph "Stage 4: Exploration"
D[leaderboard.py<br/>spiderweb.py<br/>explorer.py]
end
subgraph "Stage 5: Research V2"
E[analyze.py<br/>10 subcommands]
end
A --> B --> C --> D
C --> E
E -.research loop.-> A
style A fill:#e1f5fe
style B fill:#e8f5e8
style C fill:#fff3e0
style D fill:#f3e5f5
style E fill:#fce4ec
Stage 1: Definition Tools¶
resolver.py - Manifold Analysis Utility¶
Analyzes experiment configurations to predict computational costs and validate manifold definitions before execution.
Core functionality:
- Transforms abstract manifold definitions into concrete parameter grids
- Calculates total difficulty points across degrees and density settings
- Validates configuration correctness before expensive evaluations
- Provides detailed scaling analysis for resource planning
Basic usage:
python resolver.py configs/m12x.yaml 1
Output: Detailed analysis tables showing difficulty points, parameter grids, and computational cost estimates.
See Also: Configuration Guide for manifold definitions and Methodology for the underlying concepts.
runner.py - Test Execution Engine¶
Executes experiment configurations by applying templates and samplers to perform LLM inference.
Core functionality:
- Parametric test generation from task manifolds
- Progressive sampling with statistical confidence targeting
- Response caching to eliminate redundant API calls
- Parallel execution with configurable concurrency
- Deterministic seeding for reproducible results
Basic usage:
python runner.py \
--config configs/m12x.yaml \
--template templates/zerocot-nosys.json \
--sampler samplers/greedy-4k.json \
--model your-model \
--apibase http://localhost:3333 \
--degree 1 --precision medium --density normal
Output: Timestamped results/ folders containing NDJSON files with complete inference traces.
See Also: Runner Guide for detailed usage and Methodology for caching and seeding concepts.
Stage 3: Evaluation Tools¶
evaluate.py - Statistical Analysis Processor¶
Groups raw results into statistical buckets and performs rigorous analysis with bias corrections. Now supports both legacy JSON bucket output and DuckDB v2 workflow.
Core functionality:
- Excess accuracy correction (removes guessing inflation)
- 95% confidence interval calculation with truncation awareness
- Token usage analysis and frequency spectra (FFT)
- DuckDB v2: 5D de-duplication with native LIST types
- Parallel bucket processing for compression/FFT computation
- Context simulation for lower context limits
- Automatic tag updates (groups, surfaces, projections)
- Single database file with 5D de-duplication
- Optional .zstd compression archives for raw data sharing
Usage:
# Dataset mode - processes all evals from config
python evaluate.py --dataset data/dataset-m12x.json --parallel 16
Output: DuckDB database with de-duplicated points (v2) or JSON bucket files (legacy).
See Also: Methodology for statistical concepts.
Stage 4: Exploration Tools¶
leaderboard.py - Interactive Rankings¶
Multi-domain LLM performance ranking system with heatmap visualization and pagination.
Key features:
- ReasonScore rankings across reasoning domains
- Token efficiency analysis (score/token ratios)
- Heatmap performance visualization with color-coded cells
- Truncation indicators shown as rising darkness
- Statistical confidence indicators
- Group and manifold filtering
- Pagination for large model sets
Usage:
python leaderboard.py dataset-buckets.json
# Open http://localhost:8050
explorer.py - 3D Manifold Analysis¶
Interactive platform for navigating multi-dimensional difficulty manifolds and examining cognitive architecture patterns.
Key features:
- 3D reasoning landscape visualization
- Interactive grid line selection for projection analysis
- Multi-panel synchronized analysis (FFT, accuracy, histograms)
- Cross-model cognitive pattern comparison
- Line and point projection modes
Usage:
python explorer.py dataset-buckets.json
# Open http://localhost:8051
spiderweb.py - Per-Model Diagnostics (V2)¶
Per-model task breakdown visualization with cognitive archetype identification.
Key features:
- Radar chart (web-png) for pattern recognition
- Bar chart (bar-png) for precise value reading
- Cognitive archetype identification (9 patterns)
- Token efficiency overlay
- JSON/markdown outputs for automation
Usage:
python spiderweb.py dataset-buckets.json --scenarios model-name --output spider.png
# Or via analyze.py:
python analyze.py spider --scenarios model-name --output spider.png
Stage 5: Research Tools¶
analyze.py - Unified Analysis Interface¶
Comprehensive multi-tier analysis platform with 10 subcommands for investigating model performance, statistical groupings, and failure mechanisms. DuckDB-backed for efficient queries.
Analysis Flow:
- evals - Evaluation discovery with tier availability
- tasks - Task structure discovery (surfaces/projections)
- scores - Aggregate leaderboard with fair-sort ranking
- spiderweb - Per-model task breakdown diagnostics
- cluster - Statistical grouping via CI overlap
- surface - 3D difficulty manifold visualization
- fft - Token-frequency spectral analysis
- compression - Information-theoretic entropy analysis
- hazard - Temporal failure risk (survival analysis)
- modelinfo - Model metadata collection from HF hub
Basic usage:
# Discovery: list evaluations and tasks
python analyze.py evals data/dataset-m12x.json
python analyze.py tasks data/dataset-m12x.json
# Rankings and diagnostics
python analyze.py scores data/dataset-m12x.json --format png
python analyze.py spiderweb data/dataset-m12x.json --format webpng
# Statistical grouping
python analyze.py cluster data/dataset-m12x.json --split base_task --format png
# Deep analysis
python analyze.py surface data/dataset-m12x.json --filters '{"base_task": "arithmetic"}'
python analyze.py fft data/dataset-m12x.json --filters '{"base_task": "arithmetic"}'
python analyze.py compression data/dataset-m12x.json --output-dir compression/
python analyze.py hazard data/dataset-m12x.json --output-dir hazard/
# Metadata collection
python analyze.py modelinfo data/dataset-m12x.json --output-dir metadata/
Filter System:
All commands support --filters JSON parameter for flexible data selection by eval_id, groups, base_task, degrees, surfaces, projections, etc.