ReasonScape Architecture: The Methodology¶

Prerequisites: Before reading this document, familiarize yourself with: - challenges.md - The eight fundamental problems this methodology addresses - insight.md - The information processing paradigm that informs the design

Overview¶

ReasonScape's methodology addresses the eight challenges in current LLM evaluation through a systematic, information-theoretic approach grounded in the insight that LLMs are information processors.

How the methodology addresses each challenge:

Challenge	Solution
1. Doesn't Know What It's Asking	Parametric manifolds with coordinate-based test generation
2. Doesn't Know Which Answers	Per-point evaluation with Wilson confidence intervals
3. Doesn't Understand Reasoning Process	Information-theoretic forensics (compression, FFT, hazard)
4. Can't Distinguish Signal from Noise	Excess accuracy correction and proper uncertainty quantification
5. Trivially Gameable	Deterministic but unmemorizable coordinate-based generation
6. Ceiling Effects	Parametric difficulty scaling that adapts to model capabilities
7. Ignores Truncations and Context Failures	Explicit truncation tracking and penalty in ReasonScore
8. Ignores Token Budget and Resource Efficiency	score/token metric and per-point token consumption tracking

The Five-Stage Architecture¶

The architectural solution is a multi-stage data-processing pipeline:

graph TB
    subgraph "Data Pipeline"
        A[Stage 1: Definition] --> B[Stage 2: Execution]
        B --> C[Stage 3: Evaluation]
    end

    C --> D[PointsDB]
    D --> E[Stage 4: Discovery]
    D --> F[Stage 5: Investigation]

    subgraph "Research Loop"
        E <-.ping-pong.-> F
    end

    E --> G[Research Insights]
    F --> G
    G -.inform.-> A

    style A fill:#e1f5fe
    style B fill:#e8f5e8
    style C fill:#fff3e0
    style D fill:#fff3e0
    style E fill:#f3e5f5
    style F fill:#fce4ec
    style G fill:#ffebee

Stage 1: Definition — Parametric Test Generation¶

Key innovation: Test generators create infinite unique instances within controlled difficulty manifolds.

Every test is deterministically generated from coordinate seeds:

seed = hash(task, parameters, sample_index)

Same coordinates always produce same test sequence. Different coordinates produce different tests. The manifold is infinite but reproducible.

Manifold dimensions control difficulty:

Length (working memory load)
Depth (structural complexity)
Interference (selective attention demand)
Format (tokenization stress)
Multi-step operations (sequential reasoning)

Progressive complexity controls:

Precision (low/medium/high): How many tests per point

See technical-details.md for coordinate-based seeding algorithm and manifold resolution mechanics.

Stage 2: Execution — Efficient Inference at Scale¶

Output: runner.py writes per-test steps to results/… as NDJSON (0th-level, unaggregated records: task, degree, precision, eval_id, full inputs/outputs, meta). Nothing else reads these directly except Stage 3.

Key innovations: Response caching and hierarchical evaluation.

Response caching:

Every unique prompt is cached
Deterministic generation ensures cache hits
Typical cost reduction: 30-60% across evaluation runs

Hierarchical sampling:

Tests at count=32 are perfect subset of count=128
Can upsample without waste
Can downsample for quick comparison
Supports progressive evaluation workflows

See technical-details.md for caching implementation and truncation-aware execution.

Stage 3: Evaluation — Statistical Rigor Without Lies¶

Output: evaluate.py consumes step NDJSON and writes per-eval_id points into PointsDB (1st-level aggregates: outcome, tokens, compressed length, excess-accuracy adjusted). All downstream tools (Stages 4–5) work from these points and their higher-level aggregations (vectors → KPIs); they never read raw steps.

Key innovations: Excess accuracy correction, truncation awareness, and pre-computed forensics.

Excess accuracy correction:

Removes expected guessing contributions
0.000 = no better than guessing
1.000 = perfect knowledge
Fair comparison across all task types

Truncation awareness (Challenge #7):

Truncations tracked separately from errors
Not wrong answers—context limit failures that waste resources
Handled via probability multiplication in ReasonScore (joint mode: P[Correct|U] × P[Untrunc])
Widen confidence intervals (reduced effective sample size)
Report explicitly in all visualizations
Why this matters: Pass@k metrics hide that a model might need 10 attempts to produce valid output, masking deployment reliability issues

Pre-computed forensics:

Compression arrays (for entropy analysis)
FFT arrays (for spectral analysis)
Token distributions (for hazard analysis)
10-100x speedup for Stage 5 investigations

See technical-details.md for Wilson CI algorithm, excess accuracy computation, and pre-computation mechanics.

The Two-Plane Data Model¶

ReasonScape organizes evaluation data using PointsDB, a two-plane, three-layer structure. Each point exists simultaneously in an Evaluation Plane (model, template, sampler) and a Task-Complexity Plane (base_task, params). The two planes are orthogonal: the same model can be tested at many difficulty levels, and many models can be tested at the same difficulty level.

For the complete data model — design rationale, layer definitions, facet computation, and identity rules: See manifold.md

For complete PointsDB API and query patterns: See pointsdb.md

ReasonScore: The Unified Metric¶

ReasonScore captures six dimensions of model performance in a single interpretable number:

What it measures:

Accuracy - Correctness above random guessing baseline (Challenge #4)
Statistical confidence - Uncertainty from finite sampling (Challenge #4)
Context reliability - Truncation and context limit issues (Challenge #7)
Task balance - Performance consistency across reasoning domains (Challenge #2)
Difficulty scaling - Capability maintenance under increasing complexity (Challenge #1, #6)
Token efficiency - Computational cost per unit of quality (Challenge #8)

How it's computed (v2, 2-layer):

Layer 1: Samples → Task Score        [per-task probability-space computation]
Layer 2: Tasks → ReasonScore         [bootstrap geometric mean × 1000]

Design philosophy:

Probability-space truncation - Truncation modeled as (1 - P[Trunc]) × P[Correct], not subtracted
Punish imbalance - Geometric mean across tasks (catastrophic failure in one domain hurts overall score)
Preserved uncertainty - Confidence intervals carried through bootstrap aggregation
Account for efficiency - score/token ratio makes efficiency a first-class concern (Challenge #8)

Why geometric mean?

Unlike arithmetic mean, geometric mean penalizes inconsistency. Being great at 11 tasks doesn't excuse catastrophic failure at 1 task. In real deployment, users hit all task types.

Why score/token?

Two models with identical accuracy can differ by 5-10x in token consumption. Model A using 500 tokens/problem vs Model B using 5,000 tokens/problem have radically different deployment characteristics:

Cost: 10x difference in API bills
Latency: 10x difference in user wait times
Throughput: 10x difference in concurrent users supported
Environment: 10x difference in energy consumption

Accuracy-only metrics treat these as equivalent. They're not. The final score/token ratio makes efficiency a first-class concern, answering: "How much quality per unit of resource?"

For complete layer-by-layer computation, design rationale, and philosophical motivation: See reasonscore.md

Stage 4: Discovery — Visual Pattern Recognition¶

Purpose: Answer "WHAT is interesting?"

After Stage 3, you have a complete PointsDB. But where do you start? Discovery tools optimize for pattern recognition and hypothesis formation.

Three complementary perspectives:

1. Leaderboard — "What's the big picture?"

Aggregate rankings with ReasonScore
Heatmap visualization (models × tasks)
Color gradients reveal failure patterns
Truncation indicators show context issues
Group filtering enables peer comparison

2. Spider Plots — "What's this model's cognitive fingerprint?"

Radar chart across 12 reasoning domains
Cognitive archetype identification (9 recognizable patterns)
Difficulty scaling behavior across parameter space
Token efficiency overlay
Cross-task consistency analysis

3. Explorer — "Where in the manifold does behavior change?"

Interactive 3D surfaces (accuracy = Z-axis, params = X/Y)
Capability zones (green plateaus = success regions)
Failure boundaries (red cliffs = performance drop-offs)
Multi-panel analysis (FFT, accuracy, token distributions)
Point inspection (click to see test samples and responses)

Progressive discovery flow:

BROAD: Leaderboard → Identify candidates
    ↓
FOCUSED: Spider → Identify strengths/weaknesses
    ↓
SPECIFIC: Explorer → Identify failure boundaries

Stage 5: Investigation — Systematic Forensic Analysis¶

Purpose: Answer "WHY is it interesting?"

Discovery reveals patterns. Investigation explains mechanisms. Real research ping-pongs between both.

Stage 5 is organized around the Three P's: Position (rank models), Profile (characterize and diagnose), and Probe (inspect raw traces). Position and Profile operate on PointsDB; Probe drops to raw NDJSON when you need to see what the model actually produced.

See workflow.md for the complete Three P's methodology and tools.md for tool reference.

The Discovery-Investigation Loop¶

Stages 4 and 5 form a research loop, not a linear pipeline:

Discovery reveals patterns → Investigation explains mechanisms
        ↑                                    ↓
        └──── Investigation finds anomalies ─┘
                            ↓
                  Both inform Stage 1 (new manifolds)

Key insight: After Stage 3, research isn't sequential. You ping-pong based on what you're trying to understand at each moment.

Example ping-pong:

Discovery (leaderboard): "Model A and B look similar"
Investigation (cluster): "Overlapping CIs confirm equivalence"
Discovery (spider): "But different cognitive profiles"
Investigation (surface): "Model A has cliff at depth=3, Model B smooth degradation"
Investigation (compression): "Model A enters loops, Model B maintains entropy"
Finding: Same aggregate score, different failure modes

See workflow.md for the Three P's and when to use discovery vs investigation.

Proving It Works¶

r12 12 reasoning tasks, improved difficulty calibration, 16k context windows, 95% score ceiling. Demonstrates the architecture at its best—comprehensive parametric coverage without a-priori difficulty assumptions.

The extraordinary evidence: - Compression shows underthink/overthink/broken loops patterns - Hazard analysis proves models have measurable "thinking budgets" - Surface plots reveal capability boundaries nobody knew existed - Statistical rigor confirms these patterns are signal, not noise

For r12 documentation and ReasonScore v2: See r12.md and reasonscore.md

Interconnections: How It All Fits Together¶

The five stages form an interconnected research platform with forward data flow and iterative discovery-investigation loops.

flowchart TB
    subgraph Pipeline["Data Production Pipeline"]
        S1["Stage 1: Definition"]
        S2["Stage 2: Execution"]
        S3["Stage 3: Evaluation"]
        DB[("PointsDB")]
        S1 --> S2 --> S3 --> DB
    end

    subgraph Loop["Analysis Loop (Stages 4-5)"]
        S4["Stage 4: Discovery<br/>(leaderboard, spider, explorer)"]
        S5["Stage 5: Investigation<br/>The Three P's:<br/>Position → Profile → Probe"]
        RF["Research Findings"]

        S4 --> S5
        S5 --> S4
        S5 --> RF
    end

    subgraph Research["Research Loop"]
        NM["New manifold designs<br/>Hypothesis tests<br/>Difficulty refinements"]
    end

    DB --> Loop
    RF --> Research
    Research -.inform.-> S1

    style Pipeline fill:#e1f5fe
    style Loop fill:#f3e5f5
    style Research fill:#ffebee

What makes this work:

Unified Data Layer - Stages 4 and 5 access identical PointsDB via API
Complementary Modalities - Discovery optimizes for pattern recognition, investigation for root causes
Flexible Entry Points - Start wherever makes sense for your research question
Iterative Refinement - Each cycle improves understanding
Research Loop Closure - Findings drive design, enabling science not just benchmarking

Next Steps¶

For New Users (Start with r12)¶

Explore r12 data: python analyze.py evals data/r12.json
Visual discovery: Open leaderboard, spiderweb, and explorer
Learn by doing: Run forensic analysis on interesting patterns
Read technical-details.md for statistical concepts
Follow index.md to add your own models

For Researchers (Use r12 as Your Dataset)¶

Start analysis immediately: No inference needed, 6.5B tokens ready to explore
Review tools.md for complete forensic capabilities
Study workflow.md for discovery-investigation patterns
Consult tasks.md to understand manifold design
Extend r12: Add your own models to the reference dataset

For Developers (Fork r12 as Template)¶

Study r12 structure: data/r12.json and tasks/*.json
Examine config.md for manifold and view definitions
Review implementation.md for pipeline integration
Adapt for your needs: Copy manifolds, modify difficulty ranges, add new surfaces

For LLM Agents (r12 is Agent-Ready)¶

Start with analyze.py evals data/r12.json --format json
Use analyze.py tasks to discover available views
Query with --format json for machine-readable outputs
Follow workflow.md for systematic research patterns