ReasonScape Tools¶
ReasonScape provides a complete end-to-end evaluation platform with clean separation between components. The tool ecosystem transforms raw experiment configurations into statistical analysis and interactive visualizations.
Repository Structure¶
resolver.py # Manifold analysis and configuration validation
runner.py # Test execution engine
evaluate.py # Statistical analysis processor
leaderboard.py # Interactive ranking webapp
explorer.py # 3D manifold visualization webapp
compare_surface.py # Cross-model 3D surface comparison
compare_project.py # Multi-spectral projection analysis
configs/ # Experiment configuration files (.yaml)
data/ # Evaluation suites and dataset configurations
tasks/ # Parametric test generators with JSON schemas
templates/ # Chat template definitions (.json)
samplers/ # Generation parameter sets (.json)
results/ # Raw experiment outputs (timestamped folders)
Tool Overview¶
Core Pipeline¶
graph LR
A[Configs] --> B[runner.py]
B --> C[Results NDJSON]
C --> D[evaluate.py]
D --> E[Bucket JSON]
E --> F[Visualization Tools]
F --> G[leaderboard.py]
F --> H[explorer.py]
F --> I[compare_*.py]
style B fill:#e8f5e8
style D fill:#e8f5e8
style G fill:#f3e5f5
style H fill:#f3e5f5
style I fill:#f3e5f5
Execution Tools¶
resolver.py - Manifold Analysis Utility¶
Analyzes experiment configurations to predict computational costs and validate manifold definitions before execution.
Core functionality: - Transforms abstract manifold definitions into concrete parameter grids - Calculates total difficulty points across degrees and density settings - Validates configuration correctness before expensive evaluations - Provides detailed scaling analysis for resource planning
Basic usage:
python resolver.py configs/m6.yaml 1
Output: Detailed analysis tables showing difficulty points, parameter grids, and computational cost estimates.
runner.py - Test Execution Engine¶
Executes experiment configurations by applying templates and samplers to perform LLM inference.
Core functionality: - Parametric test generation from task manifolds - Progressive sampling with statistical confidence targeting - Response caching to eliminate redundant API calls - Parallel execution with configurable concurrency - Deterministic seeding for reproducible results
Basic usage:
python runner.py \
--config configs/m6.yaml \
--template templates/zerocot-nosys.json \
--sampler samplers/greedy-4k.json \
--model your-model \
--apibase http://localhost:3333 \
--degree 1 --precision medium --density normal
Output: Timestamped results/
folders containing NDJSON files with complete inference traces.
evaluate.py - Statistical Analysis Processor¶
Groups raw results into statistical buckets and performs rigorous analysis with bias corrections.
Core functionality: - Excess accuracy correction (removes guessing inflation) - 95% confidence interval calculation with truncation awareness - Token usage analysis and completion histograms - Token-frequency domain analysis (FFT spectral analysis) - Statistical validation and quality metrics
Basic usage:
python evaluate.py \
--interview 'results/*/*.ndjson' \
--output dataset-buckets.json \
--histogram 50 30 \
--tokenizer microsoft/Phi-4 \
--fftsamples 128
Output: Bucket JSON files containing statistical summaries, confidence intervals, and spectral analysis data.
Visualization Tools¶
leaderboard.py - Interactive Rankings¶
Multi-domain LLM performance ranking system with embedded difficulty manifold visualization.
Key features: - ReasonScore rankings across reasoning domains - Token efficiency analysis (score/token ratios) - Embedded performance landscape plots - Statistical confidence indicators - Model filtering and comparison
Usage:
python leaderboard.py dataset-buckets.json
# Open http://localhost:8050
explorer.py - 3D Manifold Analysis¶
Interactive platform for navigating multi-dimensional difficulty manifolds and examining cognitive architecture patterns.
Key features: - 3D reasoning landscape visualization - Interactive grid line selection for projection analysis - Multi-panel synchronized analysis (FFT, accuracy, histograms) - Cross-model cognitive pattern comparison - Line and point projection modes
Usage:
python explorer.py dataset-buckets.json
# Open http://localhost:8051
Comparison Tools¶
compare_surface.py - Cross-Model 3D Analysis¶
Creates 2D grids of 3D surface plots for systematic model comparison across difficulty manifolds.
Key features: - Side-by-side 3D surface comparison - Multiple models × multiple tasks analysis - Synchronized difficulty manifold views - Export capabilities for publication
Usage:
python compare_surface.py dataset-buckets.json --models model1,model2 --tasks arithmetic,boolean
→ Detailed Surface Comparison Guide
compare_project.py - Multi-Spectral Projection Analysis¶
Generates systematic parameter sweep comparisons across models and reasoning domains.
Key features: - 2D grids of projection plots - Token-frequency spectral analysis - Cross-model parameter sensitivity - Multi-task cognitive pattern analysis
Usage:
python compare_project.py dataset-buckets.json --projections length,depth --models model1,model2
→ Detailed Projection Comparison Guide
Tool Integration Workflow¶
Standard Evaluation Pipeline¶
-
Configure Experiment
# Edit configs/m6.yaml for your needs # Select appropriate template and sampler
-
Analyze Configuration
python resolver.py configs/m6.yaml 1 # Review difficulty points and parameter grids
-
Execute Evaluation
python runner.py --config configs/m6.yaml --degree 1 --precision medium \ --model your-model --apibase http://localhost:3333
-
Process Results
python evaluate.py --interview 'results/*/*.ndjson' --output analysis.json \ --histogram 50 30 --tokenizer your-tokenizer-id
-
Explore Results
# Quick ranking overview python leaderboard.py analysis.json # Deep manifold analysis python explorer.py analysis.json # Cross-model comparison python compare_surface.py analysis.json --models model1,model2
Advanced Analysis Workflows¶
Progressive Evaluation Scaling¶
# Stage 1: Quick comparison (degree 0)
python resolver.py configs/m6.yaml 0 # Analyze costs first
python runner.py --config configs/m6.yaml --degree 0 --precision low
python evaluate.py --interview 'results/*degree0*/*.ndjson' --output quick.json
python leaderboard.py quick.json
# Stage 2: Add medium difficulty (degree 1)
python resolver.py configs/m6.yaml 1 # Check scaling
python runner.py --config configs/m6.yaml --degree 1 --precision medium
python evaluate.py --interview 'results/*/*.ndjson' --output full.json
python explorer.py full.json
# Stage 3: Research-grade analysis (degree 2)
python resolver.py configs/m6.yaml 2 # Validate full complexity
python runner.py --config configs/m6.yaml --degree 2 --precision high
python evaluate.py --interview 'results/*/*.ndjson' --output research.json \
--histogram 50 30 --tokenizer your-model --fftsamples 256
Multi-Model Comparison Study¶
# Run multiple models with identical configs
for model in model1 model2 model3; do
python runner.py --model $model --config configs/m6.yaml --degree 1
done
# Aggregate and compare
python evaluate.py --interview 'results/*/*.ndjson' --output comparison.json
python compare_surface.py comparison.json --models model1,model2,model3
python compare_project.py comparison.json --models model1,model2,model3
Performance Optimization¶
Caching and Efficiency¶
- Response caching: Identical API calls are never repeated
- Hierarchical sampling: Smaller evaluations are subsets of larger ones
- Parallel execution: Configurable concurrency for API throughput
- Progressive precision: Start small, scale up based on results
Resource Management¶
- Token budgeting: Estimate costs before large evaluations
- Truncation handling: Separate tracking prevents wasted resources
- Context optimization: Adaptive confidence intervals for high-truncation scenarios
- Batch processing: Efficient memory usage for large datasets