ReasonScape Implementation: The Python Codebase¶

Prerequisites: Before reading this document, familiarize yourself with: - architecture.md - The five-stage methodology this codebase implements

Overview¶

This document describes the Python implementation that brings the ReasonScape methodology to life. The codebase is organized around the five-stage architecture, with each stage implemented by specific tools and systems.

Stage 1 Implementation: Definition¶

Purpose: Parametric test generation with coordinate-based seeding

Core Components¶

Test Generators (tasks/)

12 reasoning domain implementations (arithmetic, logic, planning, etc.)
Pydantic schemas for parameter validation
Deterministic coordinate-based generation
See tasks.md for complete task list and task API reference

Manifold Configurations (configs/*.yaml)

Precision settings (sample counts per point)
View definitions
See config.md for complete configuration reference

For coordinate-based seeding, hierarchical sampling, and manifold parameter types, see: Technical Details: Parametric Test Generation

Stage 2 Implementation: Execution¶

Purpose: Efficient inference at scale with caching and adaptive sampling

Core Components¶

Execution Orchestrator (runner.py)

Manages inference workflow
Response caching (SHA-256 based)
Hierarchical sampling coordination
Truncation detection and handling

Chat Templates (templates/*.json)

Model-specific prompt formatting
Zero-shot CoT, few-shot, direct answer formats
System message configuration

Sampling Configurations (samplers/*.json)

Temperature, top-p, top-k, min-p settings
Token budget configurations
Model-specific optimizations

For response caching, hierarchical sampling, and adaptive evaluation, see: Technical Details: Progressive Evaluation Architecture

Stage 3 Implementation: Evaluation¶

Purpose: Statistical rigor with excess accuracy correction and forensic pre-computation

Core Components¶

Evaluation Processor (evaluate.py)

Unified evaluation pipeline
Dataset mode (batch processing)
Interview mode (interactive testing)
Pre-computation of forensic data

PointsDB (src/points_db.py)

See pointsdb.md for complete API

Cohort Postprocessing (cohort.py)

Creates context-limited variants of existing evaluations
Non-destructive: new result folders with provenance metadata; re-run evaluate.py to rebuild the DB

Data Distribution (data.py)

Content-addressed blob storage for sharing evaluation data (pull/push/status/prune)
Selective pulls: database only (sufficient for Stages 4–5), specific cohorts, or full dataset

For excess accuracy correction, Wilson confidence intervals, truncation handling, and forensic pre-computation, see: Technical Details: Statistical Methodology

Deep-Dive Design Documents¶

These documents explain the design decisions behind Stage 3:

manifold.md - Two-Plane Data Model

Why two planes (Evaluation × Task-Complexity)?
Identity dimensions (5D uniqueness)
Facet dimensions (multi-valued organization)
Orthogonality principles and query patterns

reasonscore.md - Unified Metric Architecture

Design philosophy (optimistic/pessimistic/punishing)
Two-layer computation (samples → points → tasks)
Why geometric mean across tasks?
Token efficiency normalization

Stage 4 Implementation: Discovery¶

Purpose: Visual pattern recognition and hypothesis formation

Core Components¶

Leaderboard (leaderboard.py)

Interactive rankings with ReasonScore
Heatmap visualization (models × tasks × tiers)
Group filtering for peer comparison
Truncation indicators

Spider Plots (spiderweb.py)

Radar charts across 12 reasoning domains
Cognitive archetype identification
Difficulty scaling visualization
Token efficiency overlay

Explorer (explorer.py)

Interactive 3D surface plots
Multi-panel analysis (FFT, accuracy, tokens)
Point inspection with sample viewing
Capability boundary visualization

Stage 5 Implementation: Investigation¶

Purpose: Systematic forensic analysis and root cause identification

Stage 5 is organized around three hierarchical workflows—the Three P's—each implemented by specific tools and operating at different data levels:

P	Level	What It Implements
Position	PointsDB (ranked)	Statistical ranking across models
Profile	PointsDB (unranked)	Capability characterization and diagnosis
Probe	Raw NDJSON	Raw trace analysis

Core Components¶

Unified Analysis Interface (analyze.py) — discovery, position, and profile tools

Raw Trace Analysis (probe.py) — fft, failure inspection, and loop detection

Implementation Architecture Summary¶

Stage	Purpose	Key Tools	Deep-Dive Docs
1. Definition	Test generation	tasks/, configs/	tasks.md, config.md, technical-details.md
2. Execution	Inference at scale	runner.py, templates/, samplers/	tools.md#runnerpy, technical-details.md
3. Evaluation	Statistical processing	evaluate.py, points_db.py, cohort.py, data.py	pointsdb.md, manifold.md, reasonscore.md, technical-details.md, tools.md#cohortpy, tools.md#datapy
4. Discovery	Visual exploration	leaderboard.py, spiderweb.py, explorer.py	tools.md#leaderboardpy, tools.md#spiderwebpy, tools.md#explorerpy
5. Investigation	The Three P's	analyze.py, probe.py	tools.md, workflow.md

ReasonScape Implementation: The Python Codebase¶

Overview¶

Stage 1 Implementation: Definition¶

Core Components¶

Stage 2 Implementation: Execution¶

Core Components¶

Stage 3 Implementation: Evaluation¶

Core Components¶

Deep-Dive Design Documents¶

Stage 4 Implementation: Discovery¶

Core Components¶

Stage 5 Implementation: Investigation¶

Core Components¶

Implementation Architecture Summary¶

See Also¶