ReasonScape Tools¶

ReasonScape provides a complete end-to-end evaluation platform with clean separation between components. The tool ecosystem transforms raw experiment configurations into statistical analysis and interactive visualizations.

Repository Structure¶

resolver.py             # Manifold analysis and configuration validation
runner.py               # Test execution engine
evaluate.py             # Statistical analysis processor
leaderboard.py          # Interactive ranking webapp
explorer.py             # 3D manifold visualization webapp
compare_surface.py      # Cross-model 3D surface comparison
compare_project.py      # Multi-spectral projection analysis

configs/                # Experiment configuration files (.yaml)
data/                   # Evaluation suites and dataset configurations  
tasks/                  # Parametric test generators with JSON schemas
templates/              # Chat template definitions (.json)
samplers/               # Generation parameter sets (.json)
results/                # Raw experiment outputs (timestamped folders)

Tool Overview¶

Core Pipeline¶

graph LR
    A[Configs] --> B[runner.py]
    B --> C[Results NDJSON]
    C --> D[evaluate.py]
    D --> E[Bucket JSON]
    E --> F[Visualization Tools]
    F --> G[leaderboard.py]
    F --> H[explorer.py]
    F --> I[compare_*.py]

    style B fill:#e8f5e8
    style D fill:#e8f5e8
    style G fill:#f3e5f5
    style H fill:#f3e5f5
    style I fill:#f3e5f5

Execution Tools¶

resolver.py - Manifold Analysis Utility¶

Analyzes experiment configurations to predict computational costs and validate manifold definitions before execution.

Core functionality: - Transforms abstract manifold definitions into concrete parameter grids - Calculates total difficulty points across degrees and density settings - Validates configuration correctness before expensive evaluations - Provides detailed scaling analysis for resource planning

Basic usage:

python resolver.py configs/m6.yaml 1

Output: Detailed analysis tables showing difficulty points, parameter grids, and computational cost estimates.

→ Detailed Resolver Guide

runner.py - Test Execution Engine¶

Executes experiment configurations by applying templates and samplers to perform LLM inference.

Core functionality: - Parametric test generation from task manifolds - Progressive sampling with statistical confidence targeting - Response caching to eliminate redundant API calls - Parallel execution with configurable concurrency - Deterministic seeding for reproducible results

Basic usage:

python runner.py \
    --config configs/m6.yaml \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json \
    --model your-model \
    --apibase http://localhost:3333 \
    --degree 1 --precision medium --density normal

Output: Timestamped results/ folders containing NDJSON files with complete inference traces.

evaluate.py - Statistical Analysis Processor¶

Groups raw results into statistical buckets and performs rigorous analysis with bias corrections.

Core functionality: - Excess accuracy correction (removes guessing inflation) - 95% confidence interval calculation with truncation awareness - Token usage analysis and completion histograms - Token-frequency domain analysis (FFT spectral analysis) - Statistical validation and quality metrics

Basic usage:

python evaluate.py \
    --interview 'results/*/*.ndjson' \
    --output dataset-buckets.json \
    --histogram 50 30 \
    --tokenizer microsoft/Phi-4 \
    --fftsamples 128

Output: Bucket JSON files containing statistical summaries, confidence intervals, and spectral analysis data.

Visualization Tools¶

leaderboard.py - Interactive Rankings¶

Multi-domain LLM performance ranking system with embedded difficulty manifold visualization.

Key features: - ReasonScore rankings across reasoning domains - Token efficiency analysis (score/token ratios) - Embedded performance landscape plots - Statistical confidence indicators - Model filtering and comparison

Usage:

python leaderboard.py dataset-buckets.json
# Open http://localhost:8050

→ Detailed Leaderboard Guide

explorer.py - 3D Manifold Analysis¶

Interactive platform for navigating multi-dimensional difficulty manifolds and examining cognitive architecture patterns.

Key features: - 3D reasoning landscape visualization - Interactive grid line selection for projection analysis - Multi-panel synchronized analysis (FFT, accuracy, histograms) - Cross-model cognitive pattern comparison - Line and point projection modes

Usage:

python explorer.py dataset-buckets.json
# Open http://localhost:8051

→ Detailed Explorer Guide

Comparison Tools¶

compare_surface.py - Cross-Model 3D Analysis¶

Creates 2D grids of 3D surface plots for systematic model comparison across difficulty manifolds.

Key features: - Side-by-side 3D surface comparison - Multiple models × multiple tasks analysis - Synchronized difficulty manifold views - Export capabilities for publication

Usage:

python compare_surface.py dataset-buckets.json --models model1,model2 --tasks arithmetic,boolean

→ Detailed Surface Comparison Guide

compare_project.py - Multi-Spectral Projection Analysis¶

Generates systematic parameter sweep comparisons across models and reasoning domains.

Key features: - 2D grids of projection plots - Token-frequency spectral analysis - Cross-model parameter sensitivity - Multi-task cognitive pattern analysis

Usage:

python compare_project.py dataset-buckets.json --projections length,depth --models model1,model2

→ Detailed Projection Comparison Guide

Tool Integration Workflow¶

Standard Evaluation Pipeline¶

Configure Experiment

# Edit configs/m6.yaml for your needs
# Select appropriate template and sampler

Analyze Configuration

python resolver.py configs/m6.yaml 1
# Review difficulty points and parameter grids

Execute Evaluation

python runner.py --config configs/m6.yaml --degree 1 --precision medium \
    --model your-model --apibase http://localhost:3333

Process Results

python evaluate.py --interview 'results/*/*.ndjson' --output analysis.json \
    --histogram 50 30 --tokenizer your-tokenizer-id

Explore Results

# Quick ranking overview
python leaderboard.py analysis.json

# Deep manifold analysis
python explorer.py analysis.json

# Cross-model comparison
python compare_surface.py analysis.json --models model1,model2

Advanced Analysis Workflows¶

Progressive Evaluation Scaling¶

# Stage 1: Quick comparison (degree 0)
python resolver.py configs/m6.yaml 0  # Analyze costs first
python runner.py --config configs/m6.yaml --degree 0 --precision low
python evaluate.py --interview 'results/*degree0*/*.ndjson' --output quick.json
python leaderboard.py quick.json

# Stage 2: Add medium difficulty (degree 1)  
python resolver.py configs/m6.yaml 1  # Check scaling
python runner.py --config configs/m6.yaml --degree 1 --precision medium
python evaluate.py --interview 'results/*/*.ndjson' --output full.json
python explorer.py full.json

# Stage 3: Research-grade analysis (degree 2)
python resolver.py configs/m6.yaml 2  # Validate full complexity
python runner.py --config configs/m6.yaml --degree 2 --precision high
python evaluate.py --interview 'results/*/*.ndjson' --output research.json \
    --histogram 50 30 --tokenizer your-model --fftsamples 256

Multi-Model Comparison Study¶

# Run multiple models with identical configs
for model in model1 model2 model3; do
    python runner.py --model $model --config configs/m6.yaml --degree 1
done

# Aggregate and compare
python evaluate.py --interview 'results/*/*.ndjson' --output comparison.json
python compare_surface.py comparison.json --models model1,model2,model3
python compare_project.py comparison.json --models model1,model2,model3

Performance Optimization¶

Caching and Efficiency¶

Response caching: Identical API calls are never repeated
Hierarchical sampling: Smaller evaluations are subsets of larger ones
Parallel execution: Configurable concurrency for API throughput
Progressive precision: Start small, scale up based on results

Resource Management¶

Token budgeting: Estimate costs before large evaluations
Truncation handling: Separate tracking prevents wasted resources
Context optimization: Adaptive confidence intervals for high-truncation scenarios
Batch processing: Efficient memory usage for large datasets