Skip to content

ReasonScape Architecture: The Methodology

Prerequisites: Before reading this document, familiarize yourself with: - challenges.md - The eight fundamental problems this methodology addresses - insight.md - The information processing paradigm that informs the design


Overview

ReasonScape's methodology addresses the eight challenges in current LLM evaluation through a systematic, information-theoretic approach grounded in the insight that LLMs are information processors.

How the methodology addresses each challenge:

Challenge Solution
1. Doesn't Know What It's Asking Parametric manifolds with coordinate-based test generation
2. Doesn't Know Which Answers Per-point evaluation with Wilson confidence intervals
3. Doesn't Understand Reasoning Process Information-theoretic forensics (compression, FFT, hazard)
4. Can't Distinguish Signal from Noise Excess accuracy correction and proper uncertainty quantification
5. Trivially Gameable Deterministic but unmemorizable coordinate-based generation
6. Ceiling Effects Parametric difficulty scaling that adapts to model capabilities
7. Ignores Truncations and Context Failures Explicit truncation tracking and penalty in ReasonScore
8. Ignores Token Budget and Resource Efficiency score/token metric and per-point token consumption tracking

The Five-Stage Architecture

The architectural solution is a multi-stage data-processing pipeline:

graph TB
    subgraph "Data Pipeline"
        A[Stage 1: Definition] --> B[Stage 2: Execution]
        B --> C[Stage 3: Evaluation]
    end

    C --> D[PointsDB]
    D --> E[Stage 4: Discovery]
    D --> F[Stage 5: Investigation]

    subgraph "Research Loop"
        E <-.ping-pong.-> F
    end

    E --> G[Research Insights]
    F --> G
    G -.inform.-> A

    style A fill:#e1f5fe
    style B fill:#e8f5e8
    style C fill:#fff3e0
    style D fill:#fff3e0
    style E fill:#f3e5f5
    style F fill:#fce4ec
    style G fill:#ffebee

Stage 1: Definition — Parametric Test Generation

Key innovation: Test generators create infinite unique instances within controlled difficulty manifolds.

Every test is deterministically generated from coordinate seeds:

seed = hash(task, parameters, sample_index)

Same coordinates always produce same test sequence. Different coordinates produce different tests. The manifold is infinite but reproducible.

Manifold dimensions control difficulty:

  • Length (working memory load)
  • Depth (structural complexity)
  • Interference (selective attention demand)
  • Format (tokenization stress)
  • Multi-step operations (sequential reasoning)

Progressive complexity controls:

  • Precision (low/medium/high): How many tests per point

See technical-details.md for coordinate-based seeding algorithm and manifold resolution mechanics.

Stage 2: Execution — Efficient Inference at Scale

Output: runner.py writes per-test steps to results/… as NDJSON (0th-level, unaggregated records: task, degree, precision, eval_id, full inputs/outputs, meta). Nothing else reads these directly except Stage 3.

Key innovations: Response caching and hierarchical evaluation.

Response caching:

  • Every unique prompt is cached
  • Deterministic generation ensures cache hits
  • Typical cost reduction: 30-60% across evaluation runs

Hierarchical sampling:

  • Tests at count=32 are perfect subset of count=128
  • Can upsample without waste
  • Can downsample for quick comparison
  • Supports progressive evaluation workflows

See technical-details.md for caching implementation and truncation-aware execution.

Stage 3: Evaluation — Statistical Rigor Without Lies

Output: evaluate.py consumes step NDJSON and writes per-eval_id points into PointsDB (1st-level aggregates: outcome, tokens, compressed length, excess-accuracy adjusted). All downstream tools (Stages 4–5) work from these points and their higher-level aggregations (vectors → KPIs); they never read raw steps.

Key innovations: Excess accuracy correction, truncation awareness, and pre-computed forensics.

Excess accuracy correction:

  • Removes expected guessing contributions
  • 0.000 = no better than guessing
  • 1.000 = perfect knowledge
  • Fair comparison across all task types

Truncation awareness (Challenge #7):

  • Truncations tracked separately from errors
  • Not wrong answers—context limit failures that waste resources
  • Handled via probability multiplication in ReasonScore (joint mode: P[Correct|U] × P[Untrunc])
  • Widen confidence intervals (reduced effective sample size)
  • Report explicitly in all visualizations
  • Why this matters: Pass@k metrics hide that a model might need 10 attempts to produce valid output, masking deployment reliability issues

Pre-computed forensics:

  • Compression arrays (for entropy analysis)
  • FFT arrays (for spectral analysis)
  • Token distributions (for hazard analysis)
  • 10-100x speedup for Stage 5 investigations

See technical-details.md for Wilson CI algorithm, excess accuracy computation, and pre-computation mechanics.

The Two-Plane Data Model

ReasonScape organizes evaluation data using PointsDB, a two-plane, three-layer structure. Each point exists simultaneously in an Evaluation Plane (model, template, sampler) and a Task-Complexity Plane (base_task, params). The two planes are orthogonal: the same model can be tested at many difficulty levels, and many models can be tested at the same difficulty level.

For the complete data model — design rationale, layer definitions, facet computation, and identity rules: See manifold.md

For complete PointsDB API and query patterns: See pointsdb.md

ReasonScore: The Unified Metric

ReasonScore captures six dimensions of model performance in a single interpretable number:

What it measures:

  1. Accuracy - Correctness above random guessing baseline (Challenge #4)
  2. Statistical confidence - Uncertainty from finite sampling (Challenge #4)
  3. Context reliability - Truncation and context limit issues (Challenge #7)
  4. Task balance - Performance consistency across reasoning domains (Challenge #2)
  5. Difficulty scaling - Capability maintenance under increasing complexity (Challenge #1, #6)
  6. Token efficiency - Computational cost per unit of quality (Challenge #8)

How it's computed (v2, 2-layer):

Layer 1: Samples → Task Score        [per-task probability-space computation]
Layer 2: Tasks → ReasonScore         [bootstrap geometric mean × 1000]

Design philosophy:

  • Probability-space truncation - Truncation modeled as (1 - P[Trunc]) × P[Correct], not subtracted
  • Punish imbalance - Geometric mean across tasks (catastrophic failure in one domain hurts overall score)
  • Preserved uncertainty - Confidence intervals carried through bootstrap aggregation
  • Account for efficiency - score/token ratio makes efficiency a first-class concern (Challenge #8)

Why geometric mean?

Unlike arithmetic mean, geometric mean penalizes inconsistency. Being great at 11 tasks doesn't excuse catastrophic failure at 1 task. In real deployment, users hit all task types.

Why score/token?

Two models with identical accuracy can differ by 5-10x in token consumption. Model A using 500 tokens/problem vs Model B using 5,000 tokens/problem have radically different deployment characteristics:

  • Cost: 10x difference in API bills
  • Latency: 10x difference in user wait times
  • Throughput: 10x difference in concurrent users supported
  • Environment: 10x difference in energy consumption

Accuracy-only metrics treat these as equivalent. They're not. The final score/token ratio makes efficiency a first-class concern, answering: "How much quality per unit of resource?"

For complete layer-by-layer computation, design rationale, and philosophical motivation: See reasonscore.md

Stage 4: Discovery — Visual Pattern Recognition

Purpose: Answer "WHAT is interesting?"

After Stage 3, you have a complete PointsDB. But where do you start? Discovery tools optimize for pattern recognition and hypothesis formation.

Three complementary perspectives:

1. Leaderboard — "What's the big picture?"

  • Aggregate rankings with ReasonScore
  • Heatmap visualization (models × tasks)
  • Color gradients reveal failure patterns
  • Truncation indicators show context issues
  • Group filtering enables peer comparison

2. Spider Plots — "What's this model's cognitive fingerprint?"

  • Radar chart across 12 reasoning domains
  • Cognitive archetype identification (9 recognizable patterns)
  • Difficulty scaling behavior across parameter space
  • Token efficiency overlay
  • Cross-task consistency analysis

3. Explorer — "Where in the manifold does behavior change?"

  • Interactive 3D surfaces (accuracy = Z-axis, params = X/Y)
  • Capability zones (green plateaus = success regions)
  • Failure boundaries (red cliffs = performance drop-offs)
  • Multi-panel analysis (FFT, accuracy, token distributions)
  • Point inspection (click to see test samples and responses)

Progressive discovery flow:

BROAD: Leaderboard → Identify candidates
    ↓
FOCUSED: Spider → Identify strengths/weaknesses
    ↓
SPECIFIC: Explorer → Identify failure boundaries

Stage 5: Investigation — Systematic Forensic Analysis

Purpose: Answer "WHY is it interesting?"

Discovery reveals patterns. Investigation explains mechanisms. Real research ping-pongs between both.

Stage 5 is organized around the Three P's: Position (rank models), Profile (characterize and diagnose), and Probe (inspect raw traces). Position and Profile operate on PointsDB; Probe drops to raw NDJSON when you need to see what the model actually produced.

See workflow.md for the complete Three P's methodology and tools.md for tool reference.

The Discovery-Investigation Loop

Stages 4 and 5 form a research loop, not a linear pipeline:

Discovery reveals patterns → Investigation explains mechanisms
        ↑                                    ↓
        └──── Investigation finds anomalies ─┘
                            ↓
                  Both inform Stage 1 (new manifolds)

Key insight: After Stage 3, research isn't sequential. You ping-pong based on what you're trying to understand at each moment.

Example ping-pong:

  1. Discovery (leaderboard): "Model A and B look similar"
  2. Investigation (cluster): "Overlapping CIs confirm equivalence"
  3. Discovery (spider): "But different cognitive profiles"
  4. Investigation (surface): "Model A has cliff at depth=3, Model B smooth degradation"
  5. Investigation (compression): "Model A enters loops, Model B maintains entropy"
  6. Finding: Same aggregate score, different failure modes

See workflow.md for the Three P's and when to use discovery vs investigation.


Proving It Works

r12 12 reasoning tasks, improved difficulty calibration, 16k context windows, 95% score ceiling. Demonstrates the architecture at its best—comprehensive parametric coverage without a-priori difficulty assumptions.

The extraordinary evidence: - Compression shows underthink/overthink/broken loops patterns - Hazard analysis proves models have measurable "thinking budgets" - Surface plots reveal capability boundaries nobody knew existed - Statistical rigor confirms these patterns are signal, not noise

For r12 documentation and ReasonScore v2: See r12.md and reasonscore.md


Interconnections: How It All Fits Together

The five stages form an interconnected research platform with forward data flow and iterative discovery-investigation loops.

flowchart TB
    subgraph Pipeline["Data Production Pipeline"]
        S1["Stage 1: Definition"]
        S2["Stage 2: Execution"]
        S3["Stage 3: Evaluation"]
        DB[("PointsDB")]
        S1 --> S2 --> S3 --> DB
    end

    subgraph Loop["Analysis Loop (Stages 4-5)"]
        S4["Stage 4: Discovery<br/>(leaderboard, spider, explorer)"]
        S5["Stage 5: Investigation<br/>The Three P's:<br/>Position → Profile → Probe"]
        RF["Research Findings"]

        S4 --> S5
        S5 --> S4
        S5 --> RF
    end

    subgraph Research["Research Loop"]
        NM["New manifold designs<br/>Hypothesis tests<br/>Difficulty refinements"]
    end

    DB --> Loop
    RF --> Research
    Research -.inform.-> S1

    style Pipeline fill:#e1f5fe
    style Loop fill:#f3e5f5
    style Research fill:#ffebee

What makes this work:

  1. Unified Data Layer - Stages 4 and 5 access identical PointsDB via API
  2. Complementary Modalities - Discovery optimizes for pattern recognition, investigation for root causes
  3. Flexible Entry Points - Start wherever makes sense for your research question
  4. Iterative Refinement - Each cycle improves understanding
  5. Research Loop Closure - Findings drive design, enabling science not just benchmarking

Next Steps

For New Users (Start with r12)

  1. Explore r12 data: python analyze.py evals data/r12.json
  2. Visual discovery: Open leaderboard, spiderweb, and explorer
  3. Learn by doing: Run forensic analysis on interesting patterns
  4. Read technical-details.md for statistical concepts
  5. Follow index.md to add your own models

For Researchers (Use r12 as Your Dataset)

  1. Start analysis immediately: No inference needed, 6.5B tokens ready to explore
  2. Review tools.md for complete forensic capabilities
  3. Study workflow.md for discovery-investigation patterns
  4. Consult tasks.md to understand manifold design
  5. Extend r12: Add your own models to the reference dataset

For Developers (Fork r12 as Template)

  1. Study r12 structure: data/r12.json and tasks/*.json
  2. Examine config.md for manifold and view definitions
  3. Review implementation.md for pipeline integration
  4. Adapt for your needs: Copy manifolds, modify difficulty ranges, add new surfaces

For LLM Agents (r12 is Agent-Ready)

  1. Start with analyze.py evals data/r12.json --format json
  2. Use analyze.py tasks to discover available views
  3. Query with --format json for machine-readable outputs
  4. Follow workflow.md for systematic research patterns

See Also

Foundation Documents:

  • challenges.md - The six fundamental challenges in current LLM evaluation
  • insight.md - LLMs as information processors and system architecture

Core Documentation:

  • r12.md - The extraordinary evidence (reference evaluation + research dataset)
  • implementation.md - The Python codebase that realizes this methodology

Deep-Dive Design:

Reference Documentation: