Skip to content

ReasonScape Implementation: The Python Codebase

Prerequisites: Before reading this document, familiarize yourself with: - architecture.md - The five-stage methodology this codebase implements

Overview

This document describes the Python implementation that brings the ReasonScape methodology to life. The codebase is organized around the five-stage architecture, with each stage implemented by specific tools and systems.

Stage 1 Implementation: Definition

Purpose: Parametric test generation with coordinate-based seeding

Core Components

Test Generators (tasks/)

  • 12 reasoning domain implementations (arithmetic, logic, planning, etc.)
  • Pydantic schemas for parameter validation
  • Deterministic coordinate-based generation
  • See tasks.md for complete task list and task API reference

Manifold Configurations (configs/*.yaml)

  • Precision settings (sample counts per point)
  • View definitions
  • See config.md for complete configuration reference

For coordinate-based seeding, hierarchical sampling, and manifold parameter types, see: Technical Details: Parametric Test Generation


Stage 2 Implementation: Execution

Purpose: Efficient inference at scale with caching and adaptive sampling

Core Components

Execution Orchestrator (runner.py)

  • Manages inference workflow
  • Response caching (SHA-256 based)
  • Hierarchical sampling coordination
  • Truncation detection and handling

Chat Templates (templates/*.json)

  • Model-specific prompt formatting
  • Zero-shot CoT, few-shot, direct answer formats
  • System message configuration

Sampling Configurations (samplers/*.json)

  • Temperature, top-p, top-k, min-p settings
  • Token budget configurations
  • Model-specific optimizations

For response caching, hierarchical sampling, and adaptive evaluation, see: Technical Details: Progressive Evaluation Architecture


Stage 3 Implementation: Evaluation

Purpose: Statistical rigor with excess accuracy correction and forensic pre-computation

Core Components

Evaluation Processor (evaluate.py)

  • Unified evaluation pipeline
  • Dataset mode (batch processing)
  • Interview mode (interactive testing)
  • Pre-computation of forensic data

PointsDB (src/points_db.py)

Cohort Postprocessing (cohort.py)

  • Creates context-limited variants of existing evaluations
  • Non-destructive: new result folders with provenance metadata; re-run evaluate.py to rebuild the DB

Data Distribution (data.py)

  • Content-addressed blob storage for sharing evaluation data (pull/push/status/prune)
  • Selective pulls: database only (sufficient for Stages 4–5), specific cohorts, or full dataset

For excess accuracy correction, Wilson confidence intervals, truncation handling, and forensic pre-computation, see: Technical Details: Statistical Methodology

Deep-Dive Design Documents

These documents explain the design decisions behind Stage 3:

manifold.md - Two-Plane Data Model

  • Why two planes (Evaluation × Task-Complexity)?
  • Identity dimensions (5D uniqueness)
  • Facet dimensions (multi-valued organization)
  • Orthogonality principles and query patterns

reasonscore.md - Unified Metric Architecture

  • Design philosophy (optimistic/pessimistic/punishing)
  • Two-layer computation (samples → points → tasks)
  • Why geometric mean across tasks?
  • Token efficiency normalization

Stage 4 Implementation: Discovery

Purpose: Visual pattern recognition and hypothesis formation

Core Components

Leaderboard (leaderboard.py)

  • Interactive rankings with ReasonScore
  • Heatmap visualization (models × tasks × tiers)
  • Group filtering for peer comparison
  • Truncation indicators

Spider Plots (spiderweb.py)

  • Radar charts across 12 reasoning domains
  • Cognitive archetype identification
  • Difficulty scaling visualization
  • Token efficiency overlay

Explorer (explorer.py)

  • Interactive 3D surface plots
  • Multi-panel analysis (FFT, accuracy, tokens)
  • Point inspection with sample viewing
  • Capability boundary visualization

Stage 5 Implementation: Investigation

Purpose: Systematic forensic analysis and root cause identification

Stage 5 is organized around three hierarchical workflows—the Three P's—each implemented by specific tools and operating at different data levels:

P Level What It Implements
Position PointsDB (ranked) Statistical ranking across models
Profile PointsDB (unranked) Capability characterization and diagnosis
Probe Raw NDJSON Raw trace analysis

Core Components

Unified Analysis Interface (analyze.py) — discovery, position, and profile tools

Raw Trace Analysis (probe.py) — fft, failure inspection, and loop detection


Implementation Architecture Summary

Stage Purpose Key Tools Deep-Dive Docs
1. Definition Test generation tasks/, configs/ tasks.md, config.md, technical-details.md
2. Execution Inference at scale runner.py, templates/, samplers/ tools.md#runnerpy, technical-details.md
3. Evaluation Statistical processing evaluate.py, points_db.py, cohort.py, data.py pointsdb.md, manifold.md, reasonscore.md, technical-details.md, tools.md#cohortpy, tools.md#datapy
4. Discovery Visual exploration leaderboard.py, spiderweb.py, explorer.py tools.md#leaderboardpy, tools.md#spiderwebpy, tools.md#explorerpy
5. Investigation The Three P's analyze.py, probe.py tools.md, workflow.md

See Also