ReasonScape Implementation: The Python Codebase¶
Prerequisites: Before reading this document, familiarize yourself with: - architecture.md - The five-stage methodology this codebase implements
Overview¶
This document describes the Python implementation that brings the ReasonScape methodology to life. The codebase is organized around the five-stage architecture, with each stage implemented by specific tools and systems.
Stage 1 Implementation: Definition¶
Purpose: Parametric test generation with coordinate-based seeding
Core Components¶
Test Generators (tasks/)
- 12 reasoning domain implementations (arithmetic, logic, planning, etc.)
- Pydantic schemas for parameter validation
- Deterministic coordinate-based generation
- See tasks.md for complete task list and task API reference
Manifold Configurations (configs/*.yaml)
- Tier definitions (easy/medium/hard/ultra)
- Surface definitions (2D difficulty slices)
- Projection definitions (1D difficulty sweeps)
- Precision settings (sample counts per point)
- See config.md for complete configuration reference
Configuration System (resolver.py)
- Validates manifold definitions
- Expands (degree, density) into concrete sampling grids
Key Algorithms¶
Coordinate-Based Seeding:
# From runner.py
seed_params = {k: v for k, v in params.items() if k != 'count'}
param_hash = hashlib.sha256(json.dumps(seed_params, sort_keys=True).encode()).hexdigest()
base_seed = int(param_hash[-8:], 16)
generator.rng = random.Random(global_seed + base_seed)
Properties:
- Same coordinates → identical test sequences
- Hierarchical sampling (count=32 ⊂ count=128)
- Reproducible yet infinite test space
For detailed algorithms and parameter types, see: Technical Details: Parametric Test Generation
Stage 2 Implementation: Execution¶
Purpose: Efficient inference at scale with caching and adaptive sampling
Core Components¶
Execution Orchestrator (runner.py)
- Manages inference workflow
- Response caching (SHA-256 based)
- Hierarchical sampling coordination
- Truncation detection and handling
Chat Templates (templates/*.json)
- Model-specific prompt formatting
- Zero-shot CoT, few-shot, direct answer formats
- System message configuration
Sampling Configurations (samplers/*.json)
- Temperature, top-p, top-k, min-p settings
- Token budget configurations
- Model-specific optimizations
Key Mechanisms¶
Response Caching:
- Hash: SHA-256 of (model, messages, parameters)
- Storage: NDJSON response traces
- Typical cost reduction: 60-80% for multi-tier evaluation
Adaptive Sampling:
- Start with minimum samples (e.g., 32)
- Compute Wilson CI width
- Continue until confidence target met
- Early stopping for high-truncation points
For detailed implementation, see: Technical Details: Progressive Evaluation Architecture
Stage 3 Implementation: Evaluation¶
Purpose: Statistical rigor with excess accuracy correction and forensic pre-computation
Core Components¶
Evaluation Processor (evaluate.py)
- Unified evaluation pipeline
- Dataset mode (batch processing)
- Interview mode (interactive testing)
- Pre-computation of forensic data
Data Storage (PointsDB/DuckDB)
- Two-plane data model
- Per-point statistics storage
- Compression arrays for forensic analysis
- See pointsdb.md for complete API
Key Mechanisms¶
Excess Accuracy Correction:
# Remove guessing contributions
guess_accumulator = sum(r.guess_chance for r in results if not r.truncated)
adjusted_successes = correct - guess_accumulator
adjusted_trials = total - guess_accumulator
accuracy = adjusted_successes / adjusted_trials # 0.0 = guessing, 1.0 = perfect
Wilson Confidence Intervals:
- Handles small samples and extreme probabilities
- Better than normal approximation
- Used at both point-level and task-level aggregation
Truncation Handling:
- Tracked separately from errors
- Reduces effective sample size
- Reported explicitly in all visualizations
Forensic Pre-Computation:
- Compression arrays:
gzip(reasoning_trace)for every response - FFT arrays: Token-frequency domain analysis ready
- Token distributions: Separate tracking for correct/incorrect/truncated
- 10-100x speedup for Stage 5 investigations
For complete algorithms, see: Technical Details: Statistical Methodology
Deep-Dive Design Documents¶
These documents explain the design decisions behind Stage 3:
manifold.md - Two-Plane Data Model
- Why two planes (Evaluation × Task-Complexity)?
- Identity dimensions (5D uniqueness)
- Facet dimensions (multi-valued organization)
- Orthogonality principles and query patterns
reasonscore.md - Unified Metric Architecture
- Design philosophy (optimistic/pessimistic/punishing)
- Four-layer computation (samples → points → tasks → tiers)
- Why geometric mean across tasks?
- Token efficiency normalization
Stage 4 Implementation: Discovery¶
Purpose: Visual pattern recognition and hypothesis formation
Core Components¶
Leaderboard (leaderboard.py)
- Interactive rankings with ReasonScore
- Heatmap visualization (models × tasks × tiers)
- Group filtering for peer comparison
- Truncation indicators
Spider Plots (spiderweb.py)
- Radar charts across 12 reasoning domains
- Cognitive archetype identification
- Difficulty scaling visualization
- Token efficiency overlay
Explorer (explorer.py)
- Interactive 3D surface plots
- Multi-panel analysis (FFT, accuracy, tokens)
- Point inspection with sample viewing
- Capability boundary visualization
Discovery Workflows¶
Progressive Discovery Flow:
BROAD: Leaderboard → Identify interesting models
↓
FOCUSED: Spider → Understand cognitive profiles
↓
SPECIFIC: Explorer → Locate failure boundaries
For complete workflow patterns, see: workflow.md
Stage 5 Implementation: Investigation¶
Purpose: Systematic forensic analysis and root cause identification
Core Components¶
Unified Analysis Interface (analyze.py)
Discovery Support:
evals- Evaluation discovery with fuzzy searchtasks- Task structure discovery (surfaces/projections)scores- Statistical rankings with confidence intervalsspiderweb- Per-model diagnostics
The Forensic Quartet:
surface- Capability boundaries (OUTPUT space)fft- Tokenization analysis (INPUT space)compression- Information quality (REASONING space - spatial/entropy)hazard- Temporal degradation (REASONING space - temporal/timing)
Statistical Validation:
cluster- CI-overlap grouping (distinguish signal from noise)modelinfo- Architecture-aware interpretation
Investigation Workflow¶
Example Root Cause Analysis:
Discovery: "Model X fails at arithmetic length=18"
↓
surface: "Failure boundary confirmed at length=18"
↓
fft: "Tokenization not the issue"
↓
compression: "Reasoning traces become 3x more compressible"
↓
ROOT CAUSE: Information loss / reasoning loops
For complete forensic toolkit reference, see: tools/analyze.md
The Discovery-Investigation Loop¶
Stages 4 and 5 aren't sequential—they form a research loop:
Discovery reveals patterns → Investigation explains mechanisms
↑ ↓
└──── Investigation finds anomalies ─┘
↓
Both inform Stage 1 (new manifolds)
Key insight: After Stage 3, research is iterative. You ping-pong between discovery and investigation based on what you're trying to understand.
For detailed workflow patterns, see: workflow.md
Implementation Architecture Summary¶
| Stage | Purpose | Key Tools | Deep-Dive Docs |
|---|---|---|---|
| 1. Definition | Test generation | tasks/, configs/, resolver.py | technical-details.md, tasks.md, config.md |
| 2. Execution | Inference at scale | runner.py, templates/, samplers/ | technical-details.md, tools/runner.md |
| 3. Evaluation | Statistical processing | evaluate.py, PointsDB | technical-details.md, manifold.md, reasonscore.md, pointsdb.md |
| 4. Discovery | Visual exploration | leaderboard.py, spiderweb.py, explorer.py | workflow.md, tools/leaderboard.md |
| 5. Investigation | Forensic analysis | analyze.py (9 subcommands) | tools/analyze.md, workflow.md |
Technical Details¶
For implementation-level details of the core mechanisms:
technical-details.md - Low-level algorithms and data structures - Parametric Test Generation (coordinate-based seeding, manifold parameter types) - Token-Frequency Domain Analysis (FFT methodology, interpretation) - Progressive Evaluation Architecture (caching, adaptive sampling, truncation handling) - Statistical Methodology (excess accuracy, Wilson CI, compression pre-computation) - Mathematical foundations and data formats
See Also¶
Foundation Documents:
- challenges.md - The problems this implementation solves
- insight.md - The information processing paradigm
-
architecture.md - The five-stage methodology
-
manifold.md - Two-plane data model design decisions
- reasonscore.md - Unified metric architecture and design rationale
- technical-details.md - Low-level implementation algorithms
Reference Documentation:
- m12x.md - Reference evaluation proving this implementation works
- workflow.md - Four research workflow patterns
- tools.md - Complete tool reference
- config.md - Configuration reference
- pointsdb.md - Data structure API
- tasks.md - Abstract task API