Skip to content

ReasonScape Tools

ReasonScape provides a complete end-to-end evaluation platform organized around five evaluation stages: Definition, Execution, Evaluation, Exploration, and Research.

See Architecture Overview for the five-stage philosophy and how these tools fit into the unified research platform.

This document provides tool references organized by stage. For the architectural vision and stage interconnections, see the Architecture Overview.

See Also: - Architecture Overview - Five-stage philosophy - Methodology for the underlying concepts and statistical approaches - Configuration Guide for experiment setup and parameter definitions - Task Documentation for detailed task generator specifications

Repository Structure

resolver.py             # Manifold analysis and configuration validation
runner.py               # Test execution engine
evaluate.py             # Unified evaluation (dataset and interview modes)
leaderboard.py          # Interactive ranking webapp
spiderweb.py            # Per-model diagnostic webapp
explorer.py             # 3D manifold visualization webapp
analyze.py              # Unified analysis interface with 9 subcommands

configs/                # Experiment configuration files (.yaml)
data/                   # Evaluation suites and dataset configurations
tasks/                  # Parametric test generators with JSON schemas
templates/              # Chat template definitions (.json)
samplers/               # Generation parameter sets (.json)
results/                # Raw experiment outputs (timestamped folders)

Tool Overview by Stage

The Five-Stage Pipeline

graph TB
    subgraph "Stage 1: Definition"
        A[resolver.py<br/>tasks/*.py<br/>configs/*.yaml]
    end

    subgraph "Stage 2: Execution"
        B[runner.py<br/>templates/*.json<br/>samplers/*.json]
    end

    subgraph "Stage 3: Evaluation"
        C[evaluate.py]
    end

    subgraph "Stage 4: Exploration"
        D[leaderboard.py<br/>spiderweb.py<br/>explorer.py]
    end

    subgraph "Stage 5: Research V2"
        E[analyze.py<br/>10 subcommands]
    end

    A --> B --> C --> D
    C --> E
    E -.research loop.-> A

    style A fill:#e1f5fe
    style B fill:#e8f5e8
    style C fill:#fff3e0
    style D fill:#f3e5f5
    style E fill:#fce4ec

Stage 1: Definition Tools

resolver.py - Manifold Analysis Utility

Analyzes experiment configurations to predict computational costs and validate manifold definitions before execution.

Core functionality:

  • Transforms abstract manifold definitions into concrete parameter grids
  • Calculates total difficulty points across degrees and density settings
  • Validates configuration correctness before expensive evaluations
  • Provides detailed scaling analysis for resource planning

Basic usage:

python resolver.py configs/m12x.yaml 1

Output: Detailed analysis tables showing difficulty points, parameter grids, and computational cost estimates.

Detailed Resolver Guide

See Also: Configuration Guide for manifold definitions and Methodology for the underlying concepts.

runner.py - Test Execution Engine

Executes experiment configurations by applying templates and samplers to perform LLM inference.

Core functionality:

  • Parametric test generation from task manifolds
  • Progressive sampling with statistical confidence targeting
  • Response caching to eliminate redundant API calls
  • Parallel execution with configurable concurrency
  • Deterministic seeding for reproducible results

Basic usage:

python runner.py \
    --config configs/m12x.yaml \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json \
    --model your-model \
    --apibase http://localhost:3333 \
    --degree 1 --precision medium --density normal

Output: Timestamped results/ folders containing NDJSON files with complete inference traces.

See Also: Runner Guide for detailed usage and Methodology for caching and seeding concepts.


Stage 3: Evaluation Tools

evaluate.py - Statistical Analysis Processor

Groups raw results into statistical buckets and performs rigorous analysis with bias corrections. Now supports both legacy JSON bucket output and DuckDB v2 workflow.

Core functionality:

  • Excess accuracy correction (removes guessing inflation)
  • 95% confidence interval calculation with truncation awareness
  • Token usage analysis and frequency spectra (FFT)
  • DuckDB v2: 5D de-duplication with native LIST types
  • Parallel bucket processing for compression/FFT computation
  • Context simulation for lower context limits
  • Automatic tag updates (groups, surfaces, projections)
  • Single database file with 5D de-duplication
  • Optional .zstd compression archives for raw data sharing

Usage:

# Dataset mode - processes all evals from config
python evaluate.py --dataset data/dataset-m12x.json --parallel 16

Output: DuckDB database with de-duplicated points (v2) or JSON bucket files (legacy).

Detailed Evaluate Guide

See Also: Methodology for statistical concepts.


Stage 4: Exploration Tools

leaderboard.py - Interactive Rankings

Multi-domain LLM performance ranking system with heatmap visualization and pagination.

Key features:

  • ReasonScore rankings across reasoning domains
  • Token efficiency analysis (score/token ratios)
  • Heatmap performance visualization with color-coded cells
  • Truncation indicators shown as rising darkness
  • Statistical confidence indicators
  • Group and manifold filtering
  • Pagination for large model sets

Usage:

python leaderboard.py dataset-buckets.json
# Open http://localhost:8050

Detailed Leaderboard Guide

explorer.py - 3D Manifold Analysis

Interactive platform for navigating multi-dimensional difficulty manifolds and examining cognitive architecture patterns.

Key features:

  • 3D reasoning landscape visualization
  • Interactive grid line selection for projection analysis
  • Multi-panel synchronized analysis (FFT, accuracy, histograms)
  • Cross-model cognitive pattern comparison
  • Line and point projection modes

Usage:

python explorer.py dataset-buckets.json
# Open http://localhost:8051

Detailed Explorer Guide

spiderweb.py - Per-Model Diagnostics (V2)

Per-model task breakdown visualization with cognitive archetype identification.

Key features:

  • Radar chart (web-png) for pattern recognition
  • Bar chart (bar-png) for precise value reading
  • Cognitive archetype identification (9 patterns)
  • Token efficiency overlay
  • JSON/markdown outputs for automation

Usage:

python spiderweb.py dataset-buckets.json --scenarios model-name --output spider.png
# Or via analyze.py:
python analyze.py spider --scenarios model-name --output spider.png

Detailed Spider Tool Guide


Stage 5: Research Tools

analyze.py - Unified Analysis Interface

Comprehensive multi-tier analysis platform with 10 subcommands for investigating model performance, statistical groupings, and failure mechanisms. DuckDB-backed for efficient queries.

Analysis Flow:

  1. evals - Evaluation discovery with tier availability
  2. tasks - Task structure discovery (surfaces/projections)
  3. scores - Aggregate leaderboard with fair-sort ranking
  4. spiderweb - Per-model task breakdown diagnostics
  5. cluster - Statistical grouping via CI overlap
  6. surface - 3D difficulty manifold visualization
  7. fft - Token-frequency spectral analysis
  8. compression - Information-theoretic entropy analysis
  9. hazard - Temporal failure risk (survival analysis)
  10. modelinfo - Model metadata collection from HF hub

Basic usage:

# Discovery: list evaluations and tasks
python analyze.py evals data/dataset-m12x.json
python analyze.py tasks data/dataset-m12x.json

# Rankings and diagnostics
python analyze.py scores data/dataset-m12x.json --format png
python analyze.py spiderweb data/dataset-m12x.json --format webpng

# Statistical grouping
python analyze.py cluster data/dataset-m12x.json --split base_task --format png

# Deep analysis
python analyze.py surface data/dataset-m12x.json --filters '{"base_task": "arithmetic"}'
python analyze.py fft data/dataset-m12x.json --filters '{"base_task": "arithmetic"}'
python analyze.py compression data/dataset-m12x.json --output-dir compression/
python analyze.py hazard data/dataset-m12x.json --output-dir hazard/

# Metadata collection
python analyze.py modelinfo data/dataset-m12x.json --output-dir metadata/

Filter System: All commands support --filters JSON parameter for flexible data selection by eval_id, groups, base_task, degrees, surfaces, projections, etc.

Complete Analysis Guide