Skip to content

ReasonScape Implementation: The Python Codebase

Prerequisites: Before reading this document, familiarize yourself with: - architecture.md - The five-stage methodology this codebase implements

Overview

This document describes the Python implementation that brings the ReasonScape methodology to life. The codebase is organized around the five-stage architecture, with each stage implemented by specific tools and systems.

Stage 1 Implementation: Definition

Purpose: Parametric test generation with coordinate-based seeding

Core Components

Test Generators (tasks/)

  • 12 reasoning domain implementations (arithmetic, logic, planning, etc.)
  • Pydantic schemas for parameter validation
  • Deterministic coordinate-based generation
  • See tasks.md for complete task list and task API reference

Manifold Configurations (configs/*.yaml)

  • Tier definitions (easy/medium/hard/ultra)
  • Surface definitions (2D difficulty slices)
  • Projection definitions (1D difficulty sweeps)
  • Precision settings (sample counts per point)
  • See config.md for complete configuration reference

Configuration System (resolver.py)

  • Validates manifold definitions
  • Expands (degree, density) into concrete sampling grids

Key Algorithms

Coordinate-Based Seeding:

# From runner.py
seed_params = {k: v for k, v in params.items() if k != 'count'}
param_hash = hashlib.sha256(json.dumps(seed_params, sort_keys=True).encode()).hexdigest()
base_seed = int(param_hash[-8:], 16)
generator.rng = random.Random(global_seed + base_seed)

Properties:

  • Same coordinates → identical test sequences
  • Hierarchical sampling (count=32 ⊂ count=128)
  • Reproducible yet infinite test space

For detailed algorithms and parameter types, see: Technical Details: Parametric Test Generation


Stage 2 Implementation: Execution

Purpose: Efficient inference at scale with caching and adaptive sampling

Core Components

Execution Orchestrator (runner.py)

  • Manages inference workflow
  • Response caching (SHA-256 based)
  • Hierarchical sampling coordination
  • Truncation detection and handling

Chat Templates (templates/*.json)

  • Model-specific prompt formatting
  • Zero-shot CoT, few-shot, direct answer formats
  • System message configuration

Sampling Configurations (samplers/*.json)

  • Temperature, top-p, top-k, min-p settings
  • Token budget configurations
  • Model-specific optimizations

Key Mechanisms

Response Caching:

  • Hash: SHA-256 of (model, messages, parameters)
  • Storage: NDJSON response traces
  • Typical cost reduction: 60-80% for multi-tier evaluation

Adaptive Sampling:

  • Start with minimum samples (e.g., 32)
  • Compute Wilson CI width
  • Continue until confidence target met
  • Early stopping for high-truncation points

For detailed implementation, see: Technical Details: Progressive Evaluation Architecture


Stage 3 Implementation: Evaluation

Purpose: Statistical rigor with excess accuracy correction and forensic pre-computation

Core Components

Evaluation Processor (evaluate.py)

  • Unified evaluation pipeline
  • Dataset mode (batch processing)
  • Interview mode (interactive testing)
  • Pre-computation of forensic data

Data Storage (PointsDB/DuckDB)

  • Two-plane data model
  • Per-point statistics storage
  • Compression arrays for forensic analysis
  • See pointsdb.md for complete API

Key Mechanisms

Excess Accuracy Correction:

# Remove guessing contributions
guess_accumulator = sum(r.guess_chance for r in results if not r.truncated)
adjusted_successes = correct - guess_accumulator
adjusted_trials = total - guess_accumulator
accuracy = adjusted_successes / adjusted_trials  # 0.0 = guessing, 1.0 = perfect

Wilson Confidence Intervals:

  • Handles small samples and extreme probabilities
  • Better than normal approximation
  • Used at both point-level and task-level aggregation

Truncation Handling:

  • Tracked separately from errors
  • Reduces effective sample size
  • Reported explicitly in all visualizations

Forensic Pre-Computation:

  • Compression arrays: gzip(reasoning_trace) for every response
  • FFT arrays: Token-frequency domain analysis ready
  • Token distributions: Separate tracking for correct/incorrect/truncated
  • 10-100x speedup for Stage 5 investigations

For complete algorithms, see: Technical Details: Statistical Methodology

Deep-Dive Design Documents

These documents explain the design decisions behind Stage 3:

manifold.md - Two-Plane Data Model

  • Why two planes (Evaluation × Task-Complexity)?
  • Identity dimensions (5D uniqueness)
  • Facet dimensions (multi-valued organization)
  • Orthogonality principles and query patterns

reasonscore.md - Unified Metric Architecture

  • Design philosophy (optimistic/pessimistic/punishing)
  • Four-layer computation (samples → points → tasks → tiers)
  • Why geometric mean across tasks?
  • Token efficiency normalization

Stage 4 Implementation: Discovery

Purpose: Visual pattern recognition and hypothesis formation

Core Components

Leaderboard (leaderboard.py)

  • Interactive rankings with ReasonScore
  • Heatmap visualization (models × tasks × tiers)
  • Group filtering for peer comparison
  • Truncation indicators

Spider Plots (spiderweb.py)

  • Radar charts across 12 reasoning domains
  • Cognitive archetype identification
  • Difficulty scaling visualization
  • Token efficiency overlay

Explorer (explorer.py)

  • Interactive 3D surface plots
  • Multi-panel analysis (FFT, accuracy, tokens)
  • Point inspection with sample viewing
  • Capability boundary visualization

Discovery Workflows

Progressive Discovery Flow:

BROAD: Leaderboard → Identify interesting models
   ↓
FOCUSED: Spider → Understand cognitive profiles
   ↓
SPECIFIC: Explorer → Locate failure boundaries

For complete workflow patterns, see: workflow.md


Stage 5 Implementation: Investigation

Purpose: Systematic forensic analysis and root cause identification

Core Components

Unified Analysis Interface (analyze.py)

Discovery Support:

  • evals - Evaluation discovery with fuzzy search
  • tasks - Task structure discovery (surfaces/projections)
  • scores - Statistical rankings with confidence intervals
  • spiderweb - Per-model diagnostics

The Forensic Quartet:

  • surface - Capability boundaries (OUTPUT space)
  • fft - Tokenization analysis (INPUT space)
  • compression - Information quality (REASONING space - spatial/entropy)
  • hazard - Temporal degradation (REASONING space - temporal/timing)

Statistical Validation:

  • cluster - CI-overlap grouping (distinguish signal from noise)
  • modelinfo - Architecture-aware interpretation

Investigation Workflow

Example Root Cause Analysis:

Discovery: "Model X fails at arithmetic length=18"
    ↓
surface: "Failure boundary confirmed at length=18"
    ↓
fft: "Tokenization not the issue"
    ↓
compression: "Reasoning traces become 3x more compressible"
    ↓
ROOT CAUSE: Information loss / reasoning loops

For complete forensic toolkit reference, see: tools/analyze.md


The Discovery-Investigation Loop

Stages 4 and 5 aren't sequential—they form a research loop:

Discovery reveals patterns → Investigation explains mechanisms
       ↑                                    ↓
       └──── Investigation finds anomalies ─┘
                           ↓
                 Both inform Stage 1 (new manifolds)

Key insight: After Stage 3, research is iterative. You ping-pong between discovery and investigation based on what you're trying to understand.

For detailed workflow patterns, see: workflow.md


Implementation Architecture Summary

Stage Purpose Key Tools Deep-Dive Docs
1. Definition Test generation tasks/, configs/, resolver.py technical-details.md, tasks.md, config.md
2. Execution Inference at scale runner.py, templates/, samplers/ technical-details.md, tools/runner.md
3. Evaluation Statistical processing evaluate.py, PointsDB technical-details.md, manifold.md, reasonscore.md, pointsdb.md
4. Discovery Visual exploration leaderboard.py, spiderweb.py, explorer.py workflow.md, tools/leaderboard.md
5. Investigation Forensic analysis analyze.py (9 subcommands) tools/analyze.md, workflow.md

Technical Details

For implementation-level details of the core mechanisms:

technical-details.md - Low-level algorithms and data structures - Parametric Test Generation (coordinate-based seeding, manifold parameter types) - Token-Frequency Domain Analysis (FFT methodology, interpretation) - Progressive Evaluation Architecture (caching, adaptive sampling, truncation handling) - Statistical Methodology (excess accuracy, Wilson CI, compression pre-computation) - Mathematical foundations and data formats


See Also

Foundation Documents:

Reference Documentation: