ReasonScape Implementation: The Python Codebase¶
Prerequisites: Before reading this document, familiarize yourself with: - architecture.md - The five-stage methodology this codebase implements
Overview¶
This document describes the Python implementation that brings the ReasonScape methodology to life. The codebase is organized around the five-stage architecture, with each stage implemented by specific tools and systems.
Stage 1 Implementation: Definition¶
Purpose: Parametric test generation with coordinate-based seeding
Core Components¶
Test Generators (tasks/)
- 12 reasoning domain implementations (arithmetic, logic, planning, etc.)
- Pydantic schemas for parameter validation
- Deterministic coordinate-based generation
- See tasks.md for complete task list and task API reference
Manifold Configurations (configs/*.yaml)
- Precision settings (sample counts per point)
- View definitions
- See config.md for complete configuration reference
For coordinate-based seeding, hierarchical sampling, and manifold parameter types, see: Technical Details: Parametric Test Generation
Stage 2 Implementation: Execution¶
Purpose: Efficient inference at scale with caching and adaptive sampling
Core Components¶
Execution Orchestrator (runner.py)
- Manages inference workflow
- Response caching (SHA-256 based)
- Hierarchical sampling coordination
- Truncation detection and handling
Chat Templates (templates/*.json)
- Model-specific prompt formatting
- Zero-shot CoT, few-shot, direct answer formats
- System message configuration
Sampling Configurations (samplers/*.json)
- Temperature, top-p, top-k, min-p settings
- Token budget configurations
- Model-specific optimizations
For response caching, hierarchical sampling, and adaptive evaluation, see: Technical Details: Progressive Evaluation Architecture
Stage 3 Implementation: Evaluation¶
Purpose: Statistical rigor with excess accuracy correction and forensic pre-computation
Core Components¶
Evaluation Processor (evaluate.py)
- Unified evaluation pipeline
- Dataset mode (batch processing)
- Interview mode (interactive testing)
- Pre-computation of forensic data
PointsDB (src/points_db.py)
- See pointsdb.md for complete API
Cohort Postprocessing (cohort.py)
- Creates context-limited variants of existing evaluations
- Non-destructive: new result folders with provenance metadata; re-run evaluate.py to rebuild the DB
Data Distribution (data.py)
- Content-addressed blob storage for sharing evaluation data (pull/push/status/prune)
- Selective pulls: database only (sufficient for Stages 4–5), specific cohorts, or full dataset
For excess accuracy correction, Wilson confidence intervals, truncation handling, and forensic pre-computation, see: Technical Details: Statistical Methodology
Deep-Dive Design Documents¶
These documents explain the design decisions behind Stage 3:
manifold.md - Two-Plane Data Model
- Why two planes (Evaluation × Task-Complexity)?
- Identity dimensions (5D uniqueness)
- Facet dimensions (multi-valued organization)
- Orthogonality principles and query patterns
reasonscore.md - Unified Metric Architecture
- Design philosophy (optimistic/pessimistic/punishing)
- Two-layer computation (samples → points → tasks)
- Why geometric mean across tasks?
- Token efficiency normalization
Stage 4 Implementation: Discovery¶
Purpose: Visual pattern recognition and hypothesis formation
Core Components¶
Leaderboard (leaderboard.py)
- Interactive rankings with ReasonScore
- Heatmap visualization (models × tasks × tiers)
- Group filtering for peer comparison
- Truncation indicators
Spider Plots (spiderweb.py)
- Radar charts across 12 reasoning domains
- Cognitive archetype identification
- Difficulty scaling visualization
- Token efficiency overlay
Explorer (explorer.py)
- Interactive 3D surface plots
- Multi-panel analysis (FFT, accuracy, tokens)
- Point inspection with sample viewing
- Capability boundary visualization
Stage 5 Implementation: Investigation¶
Purpose: Systematic forensic analysis and root cause identification
Stage 5 is organized around three hierarchical workflows—the Three P's—each implemented by specific tools and operating at different data levels:
| P | Level | What It Implements |
|---|---|---|
| Position | PointsDB (ranked) | Statistical ranking across models |
| Profile | PointsDB (unranked) | Capability characterization and diagnosis |
| Probe | Raw NDJSON | Raw trace analysis |
Core Components¶
Unified Analysis Interface (analyze.py) — discovery, position, and profile tools
Raw Trace Analysis (probe.py) — fft, failure inspection, and loop detection
Implementation Architecture Summary¶
| Stage | Purpose | Key Tools | Deep-Dive Docs |
|---|---|---|---|
| 1. Definition | Test generation | tasks/, configs/ | tasks.md, config.md, technical-details.md |
| 2. Execution | Inference at scale | runner.py, templates/, samplers/ | tools.md#runnerpy, technical-details.md |
| 3. Evaluation | Statistical processing | evaluate.py, points_db.py, cohort.py, data.py | pointsdb.md, manifold.md, reasonscore.md, technical-details.md, tools.md#cohortpy, tools.md#datapy |
| 4. Discovery | Visual exploration | leaderboard.py, spiderweb.py, explorer.py | tools.md#leaderboardpy, tools.md#spiderwebpy, tools.md#explorerpy |
| 5. Investigation | The Three P's | analyze.py, probe.py | tools.md, workflow.md |
See Also¶
- architecture.md - The five-stage methodology
- technical-details.md - Low-level implementation algorithms
- config.md - Configuration reference
- pointsdb.md - Data structure API
- tasks.md - Abstract task API