ReasonScape Architecture: The Methodology¶

Prerequisites: Before reading this document, familiarize yourself with: - challenges.md - The eight fundamental problems this methodology addresses - insight.md - The information processing paradigm that informs the design

Overview¶

ReasonScape's methodology addresses the eight challenges in current LLM evaluation through a systematic, information-theoretic approach grounded in the insight that LLMs are information processors.

How the methodology addresses each challenge:

Challenge	Solution
1. Doesn't Know What It's Asking	Parametric manifolds with coordinate-based test generation
2. Doesn't Know Which Answers	Per-point evaluation with Wilson confidence intervals
3. Doesn't Understand Reasoning Process	Information-theoretic forensics (compression, FFT, hazard)
4. Can't Distinguish Signal from Noise	Excess accuracy correction and proper uncertainty quantification
5. Trivially Gameable	Deterministic but unmemorizable coordinate-based generation
6. Ceiling Effects	Parametric difficulty scaling that adapts to model capabilities
7. Ignores Truncations and Context Failures	Explicit truncation tracking and penalty in ReasonScore
8. Ignores Token Budget and Resource Efficiency	score/token metric and per-point token consumption tracking

The Five-Stage Architecture¶

The architectural solution is a multi-stage data-processing pipeline:

graph TB
    subgraph "Data Pipeline"
        A[Stage 1: Definition] --> B[Stage 2: Execution]
        B --> C[Stage 3: Evaluation]
    end

    C --> D[PointsDB]
    D --> E[Stage 4: Discovery]
    D --> F[Stage 5: Investigation]

    subgraph "Research Loop"
        E <-.ping-pong.-> F
    end

    E --> G[Research Insights]
    F --> G
    G -.inform.-> A

    style A fill:#e1f5fe
    style B fill:#e8f5e8
    style C fill:#fff3e0
    style D fill:#fff3e0
    style E fill:#f3e5f5
    style F fill:#fce4ec
    style G fill:#ffebee

Stage 1: Definition — Parametric Test Generation¶

Key innovation: Test generators create infinite unique instances within controlled difficulty manifolds.

Every test is deterministically generated from coordinate seeds:

seed = hash(task, parameters, sample_index)

Same coordinates always produce same test sequence. Different coordinates produce different tests. The manifold is infinite but reproducible.

Manifold dimensions control difficulty:

Length (working memory load)
Depth (structural complexity)
Interference (selective attention demand)
Format (tokenization stress)
Multi-step operations (sequential reasoning)

Progressive complexity controls:

Degree (0-2): Easy, Medium, Hard difficulty ranges
Density (corner/lowdef/normal): Which points to sample
Precision (low/medium/high): How many tests per point

See technical-details.md for coordinate-based seeding algorithm and manifold resolution mechanics.

Stage 2: Execution — Efficient Inference at Scale¶

Key innovations: Response caching, adaptive sampling, and hierarchical evaluation.

Response caching:

Every unique prompt is cached
Deterministic generation ensures cache hits
Typical cost reduction: 60-80% for multi-tier evaluation

Adaptive sampling:

Easy points converge quickly (few samples needed)
Hard points get more samples (more rounds for precision)
Truncation-heavy points abort early (don't waste tokens)
Statistical confidence guaranteed by CI tracking

Hierarchical sampling:

Tests at count=32 are perfect subset of count=128
Can upsample without waste
Can downsample for quick comparison
Supports progressive evaluation workflows

See technical-details.md for caching implementation, confidence targeting algorithm, and truncation-aware execution.

Stage 3: Evaluation — Statistical Rigor Without Lies¶

Key innovations: Excess accuracy correction, truncation awareness, semantic tier mapping, and pre-computed forensics.

Excess accuracy correction:

Removes expected guessing contributions
0.000 = no better than guessing
1.000 = perfect knowledge
Fair comparison across all task types

Truncation awareness (Challenge #7):

Truncations tracked separately from errors
Not wrong answers—context limit failures that waste resources
Direct penalty in ReasonScore (subtracted from point score)
Widen confidence intervals (reduced effective sample size)
Report explicitly in all visualizations
Why this matters: Pass@k metrics hide that a model might need 10 attempts to produce valid output, masking deployment reliability issues

Semantic tier mapping:

(degree, density) execution parameters → tier labels
(0, normal) → "easy", (1, normal) → "medium", (2, normal) → "hard"
Stable tier labels as execution strategies evolve
Enables adaptive difficulty (add "ultra" when "hard" saturates)

Pre-computed forensics:

Compression arrays (for entropy analysis)
FFT arrays (for spectral analysis)
Token distributions (for hazard analysis)
10-100x speedup for Stage 5 investigations

See technical-details.md for Wilson CI algorithm, excess accuracy computation, and pre-computation mechanics.

The Two-Plane Data Model¶

ReasonScape organizes evaluation data using PointsDB, a two-plane structure where each point exists simultaneously in both an Evaluation Plane and a Task-Complexity Plane.

Why two planes?

Traditional benchmarks are flat: (model, task) → score

This can't answer: - WHERE in complexity space does the model fail? - HOW does performance change as difficulty increases? - WHAT architectural patterns emerge across difficulty levels?

The structure:

	EVALUATION	TASK-COMPLEXITY
IDENTITY (5D)	- model - template - sampler	- base_task - params
FACETS	- eval_id - groups[]	- tiers[] - surfaces[] - projections[]

Identity dimensions (5D) uniquely define a point:

Evaluation Plane: model, template, sampler
Task-Complexity Plane: base_task, params

Points with identical 5D identity are de-duplicated.

Facet dimensions provide multi-valued organizational views:

Evaluation facets: eval_id (shorthand), groups[] (arch:moe, size:large, etc.)
Complexity facets: tiers[] (easy/medium/hard), surfaces[] (2D slices), projections[] (1D sweeps)

Points can belong to multiple facets simultaneously.

Key properties:

Orthogonality: Same model tested at many difficulty levels; many models tested at same difficulty level
Faceted organization: Filter by tier, group by architecture, slice by surface—all from the same data
Identity-based de-duplication: Running the same evaluation twice doesn't create duplicates

For detailed design rationale, orthogonality principles, and facet computation: See manifold.md

For complete PointsDB API and query patterns: See pointsdb.md

ReasonScore: The Unified Metric¶

ReasonScore captures six dimensions of model performance in a single interpretable number:

What it measures:

Accuracy - Correctness above random guessing baseline (Challenge #4)
Statistical confidence - Uncertainty from finite sampling (Challenge #4)
Context reliability - Truncation and context limit issues (Challenge #7)
Task balance - Performance consistency across reasoning domains (Challenge #2)
Difficulty scaling - Capability maintenance under increasing complexity (Challenge #1, #6)
Token efficiency - Computational cost per unit of quality (Challenge #8)

How it's computed:

Layer 1: Samples → Point Score       [Wilson CI + truncation penalty]
Layer 2: Points → Task Score         [Wilson CI re-aggregation]
Layer 3: Tasks → Tier ReasonScore    [Geometric Mean × 1000]
Layer 4: Tiers → score/token         [Arithmetic Mean ÷ median tokens]

Design philosophy:

Optimistic about uncertainty - Add confidence margin (statistical uncertainty is "our fault")
Pessimistic about failures - Subtract truncation penalty (Challenge #7: context limits are "model's fault")
Punish imbalance - Geometric mean across tasks (catastrophic failure in one domain hurts overall score)
Account for efficiency - Divide by tokens (Challenge #8: being right isn't enough if it's expensive)

Why geometric mean?

Unlike arithmetic mean, geometric mean penalizes inconsistency. Being great at 11 tasks doesn't excuse catastrophic failure at 1 task. In real deployment, users hit all task types.

Why score/token?

Two models with identical accuracy can differ by 5-10x in token consumption. Model A using 500 tokens/problem vs Model B using 5,000 tokens/problem have radically different deployment characteristics:

Cost: 10x difference in API bills
Latency: 10x difference in user wait times
Throughput: 10x difference in concurrent users supported
Environment: 10x difference in energy consumption

Accuracy-only metrics treat these as equivalent. They're not. The final score/token ratio makes efficiency a first-class concern, answering: "How much quality per unit of resource?"

For complete layer-by-layer computation, design rationale, and philosophical motivation: See reasonscore.md

Stage 4: Discovery — Visual Pattern Recognition¶

Purpose: Answer "WHAT is interesting?"

After Stage 3, you have a complete PointsDB. But where do you start? Discovery tools optimize for pattern recognition and hypothesis formation.

Three complementary perspectives:

1. Leaderboard — "What's the big picture?"

Aggregate rankings with ReasonScore
Heatmap visualization (models × tasks × tiers)
Color gradients reveal failure patterns
Truncation indicators show context issues
Group filtering enables peer comparison

2. Spider Plots — "What's this model's cognitive fingerprint?"

Radar chart across 12 reasoning domains
Cognitive archetype identification (9 recognizable patterns)
Difficulty scaling behavior (easy/medium/hard)
Token efficiency overlay
Cross-task consistency analysis

3. Explorer — "Where in the manifold does behavior change?"

Interactive 3D surfaces (accuracy = Z-axis, params = X/Y)
Capability zones (green plateaus = success regions)
Failure boundaries (red cliffs = performance drop-offs)
Multi-panel analysis (FFT, accuracy, token distributions)
Point inspection (click to see test samples and responses)

Progressive discovery flow:

BROAD: Leaderboard → Identify candidates
    ↓
FOCUSED: Spider → Identify strengths/weaknesses
    ↓
SPECIFIC: Explorer → Identify failure boundaries

See workflow.md for complete discovery workflows.

Stage 5: Investigation — Systematic Forensic Analysis¶

Purpose: Answer "WHY is it interesting?"

Discovery reveals patterns. Investigation explains mechanisms. Real research ping-pongs between both.

Information processing analysis tools

Understanding why a model fails requires investigating four information spaces:

surface - Where does performance break down? Look at OUTPUT.
fft - How is the problem represented? Look at INPUT.
compression - What is the information quality? Look at REASONING (spatial/entropy).
hazard - When does thinking degrade? Look at REASONING (temporal/timing).

Discovery support tools:

evals - Evaluation discovery with fuzzy search
tasks - Task structure discovery (surfaces/projections)
modelinfo - Architecture-aware interpretation

Statistical validation:

cluster - CI-overlap grouping (distinguish signal from noise)
scores - Statistical rankings with CI
spiderweb - Complete single-model fingerprinting

Example investigation flow:

Stage 4: "Model X fails at arithmetic length=18"
    ↓
Stage 5 surface: "Failure boundary confirmed at length=18"
    ↓
Stage 5 fft: "Tokenization not the issue"
    ↓
Stage 5 compression: "Reasoning traces become 3x more compressible"
    ↓
ROOT CAUSE: Information loss / reasoning loops

See tools/analyze.md for complete forensic toolkit reference.

The Discovery-Investigation Loop¶

Stages 4 and 5 form a research loop, not a linear pipeline:

Discovery reveals patterns → Investigation explains mechanisms
        ↑                                    ↓
        └──── Investigation finds anomalies ─┘
                            ↓
                  Both inform Stage 1 (new manifolds)

Key insight: After Stage 3, research isn't sequential. You ping-pong based on what you're trying to understand at each moment.

Example ping-pong:

Discovery (leaderboard): "Model A and B look similar"
Investigation (cluster): "Overlapping CIs confirm equivalence"
Discovery (spider): "But different cognitive profiles"
Investigation (surface): "Model A has cliff at depth=3, Model B smooth degradation"
Investigation (compression): "Model A enters loops, Model B maintains entropy"
Finding: Same aggregate score, different failure modes

See workflow.md for four research workflow patterns showing when to use discovery vs investigation.

Proving It Works¶

m12x validates this architecture.

75+ models, 12 reasoning tasks, 6.5B tokens, 150K+ evaluation points.

This isn't hypothetical. Every design pattern—the manifold definitions, the tier mappings, the precision configurations, the cognitive archetypes, the forensic workflows—has been battle-tested through real evaluation at production scale.

m12x serves three purposes: 1. Validates the architecture — Proves ReasonScape works (not vaporware) 2. Provides research-ready data — Enables immediate analysis without inference costs 3. Demonstrates design patterns — Shows concrete choices others can adapt

The extraordinary evidence: - FFT reveals spectral signatures that differ by tokenizer/architecture - Compression shows underthink/overthink/broken loops patterns - Hazard analysis proves models have measurable "thinking budgets" - Surface plots reveal capability boundaries nobody knew existed - Statistical rigor confirms these patterns are signal, not noise

For complete m12x documentation, configuration details, and usage guide: See m12x.md

The Four Research Workflows¶

The architecture enables four distinct research workflows, each using different tool combinations:

1. Ranking & Benchmarking¶

Question: "What's the best model overall?"

Tools: Leaderboard (Stage 4) → scores + cluster (Stage 5)

Flow: Discovery → Investigation (quick validation)

Duration: 2-3 minutes

2. Comparative Evaluation¶

Question: "Which models are truly different?"

Tools: cluster (Stage 5) → spiderweb + explorer (Stage 4)

Flow: Investigation → Discovery (visual confirmation)

Duration: 5-10 minutes

3. Model Characterization¶

Question: "What are this model's strengths and weaknesses?"

Tools: spiderweb + explorer (Stage 4) → surface + compression + hazard (Stage 5)

Flow: Discovery ↔ Investigation (heavy ping-pong)

Duration: 5-10 minutes

4. Failure Diagnosis¶

Question: "Why did this model fail?"

Tools: explorer (Stage 4) → surface + fft + compression + hazard (Stage 5)

Flow: Discovery → Investigation → Discovery → Investigation (deep iteration)

Duration: 10-20 minutes

For detailed workflow examples with command sequences: See workflow.md

Interconnections: How It All Fits Together¶

The five stages form an interconnected research platform with forward data flow and iterative discovery-investigation loops.

flowchart TB
    subgraph Pipeline["Data Production Pipeline"]
        S1["Stage 1: Definition"]
        S2["Stage 2: Execution"]
        S3["Stage 3: Evaluation"]
        DB[("PointsDB")]
        S1 --> S2 --> S3 --> DB
    end

    subgraph Loop["Analysis Loop"]
        S4["Stage 4: Discovery<br/>(leaderboard, spider, explorer)"]
        S5["Stage 5: Investigation<br/>(surface, fft, compression, hazard)"]
        RF["Research Findings"]

        S4 --> S5
        S5 --> S4
        S5 --> RF
    end

    subgraph Research["Research Loop"]
        NM["New manifold designs<br/>Hypothesis tests<br/>Difficulty refinements"]
    end

    DB --> Loop
    RF --> Research
    Research -.inform.-> S1

    style Pipeline fill:#e1f5fe
    style Loop fill:#f3e5f5
    style Research fill:#ffebee

What makes this work:

Unified Data Layer - Stages 4 and 5 access identical PointsDB via API
Complementary Modalities - Discovery optimizes for pattern recognition, investigation for root causes
Flexible Entry Points - Start wherever makes sense for your research question
Iterative Refinement - Each cycle improves understanding
Research Loop Closure - Findings drive design, enabling science not just benchmarking

Next Steps¶

For New Users (Start with m12x)¶

Explore m12x data: python analyze.py evals data/dataset-m12x.json
Visual discovery: Open leaderboard, spiderweb, and explorer
Learn by doing: Run forensic analysis on interesting patterns
Read technical-details.md for statistical concepts
Follow index.md to add your own models

For Researchers (Use m12x as Your Dataset)¶

Start analysis immediately: No inference needed, 6.5B tokens ready to explore
Review tools/analyze.md for complete forensic capabilities
Study workflow.md for discovery-investigation patterns
Consult tasks.md to understand manifold design
Extend m12x: Add your own models to the reference dataset

For Developers (Fork m12x as Template)¶

Study m12x structure: data/dataset-m12x.json and tasks/*.json
Examine config.md for manifold/tier/surface definitions
Review tools.md for pipeline integration
Adapt for your needs: Copy manifolds, modify difficulty ranges, add new surfaces

For LLM Agents (m12x is Agent-Ready)¶

Start with analyze.py evals data/dataset-m12x.json --format json
Use analyze.py tasks to discover available surfaces and projections
Query with --format json for machine-readable outputs
Follow workflow.md for systematic research patterns