Skip to content

ReasonScape Architecture: The Methodology

Prerequisites: Before reading this document, familiarize yourself with: - challenges.md - The eight fundamental problems this methodology addresses - insight.md - The information processing paradigm that informs the design


Overview

ReasonScape's methodology addresses the eight challenges in current LLM evaluation through a systematic, information-theoretic approach grounded in the insight that LLMs are information processors.

How the methodology addresses each challenge:

Challenge Solution
1. Doesn't Know What It's Asking Parametric manifolds with coordinate-based test generation
2. Doesn't Know Which Answers Per-point evaluation with Wilson confidence intervals
3. Doesn't Understand Reasoning Process Information-theoretic forensics (compression, FFT, hazard)
4. Can't Distinguish Signal from Noise Excess accuracy correction and proper uncertainty quantification
5. Trivially Gameable Deterministic but unmemorizable coordinate-based generation
6. Ceiling Effects Parametric difficulty scaling that adapts to model capabilities
7. Ignores Truncations and Context Failures Explicit truncation tracking and penalty in ReasonScore
8. Ignores Token Budget and Resource Efficiency score/token metric and per-point token consumption tracking

The Five-Stage Architecture

The architectural solution is a multi-stage data-processing pipeline:

graph TB
    subgraph "Data Pipeline"
        A[Stage 1: Definition] --> B[Stage 2: Execution]
        B --> C[Stage 3: Evaluation]
    end

    C --> D[PointsDB]
    D --> E[Stage 4: Discovery]
    D --> F[Stage 5: Investigation]

    subgraph "Research Loop"
        E <-.ping-pong.-> F
    end

    E --> G[Research Insights]
    F --> G
    G -.inform.-> A

    style A fill:#e1f5fe
    style B fill:#e8f5e8
    style C fill:#fff3e0
    style D fill:#fff3e0
    style E fill:#f3e5f5
    style F fill:#fce4ec
    style G fill:#ffebee

Stage 1: Definition — Parametric Test Generation

Key innovation: Test generators create infinite unique instances within controlled difficulty manifolds.

Every test is deterministically generated from coordinate seeds:

seed = hash(task, parameters, sample_index)

Same coordinates always produce same test sequence. Different coordinates produce different tests. The manifold is infinite but reproducible.

Manifold dimensions control difficulty:

  • Length (working memory load)
  • Depth (structural complexity)
  • Interference (selective attention demand)
  • Format (tokenization stress)
  • Multi-step operations (sequential reasoning)

Progressive complexity controls:

  • Degree (0-2): Easy, Medium, Hard difficulty ranges
  • Density (corner/lowdef/normal): Which points to sample
  • Precision (low/medium/high): How many tests per point

See technical-details.md for coordinate-based seeding algorithm and manifold resolution mechanics.

Stage 2: Execution — Efficient Inference at Scale

Key innovations: Response caching, adaptive sampling, and hierarchical evaluation.

Response caching:

  • Every unique prompt is cached
  • Deterministic generation ensures cache hits
  • Typical cost reduction: 60-80% for multi-tier evaluation

Adaptive sampling:

  • Easy points converge quickly (few samples needed)
  • Hard points get more samples (more rounds for precision)
  • Truncation-heavy points abort early (don't waste tokens)
  • Statistical confidence guaranteed by CI tracking

Hierarchical sampling:

  • Tests at count=32 are perfect subset of count=128
  • Can upsample without waste
  • Can downsample for quick comparison
  • Supports progressive evaluation workflows

See technical-details.md for caching implementation, confidence targeting algorithm, and truncation-aware execution.

Stage 3: Evaluation — Statistical Rigor Without Lies

Key innovations: Excess accuracy correction, truncation awareness, semantic tier mapping, and pre-computed forensics.

Excess accuracy correction:

  • Removes expected guessing contributions
  • 0.000 = no better than guessing
  • 1.000 = perfect knowledge
  • Fair comparison across all task types

Truncation awareness (Challenge #7):

  • Truncations tracked separately from errors
  • Not wrong answers—context limit failures that waste resources
  • Direct penalty in ReasonScore (subtracted from point score)
  • Widen confidence intervals (reduced effective sample size)
  • Report explicitly in all visualizations
  • Why this matters: Pass@k metrics hide that a model might need 10 attempts to produce valid output, masking deployment reliability issues

Semantic tier mapping:

  • (degree, density) execution parameters → tier labels
  • (0, normal) → "easy", (1, normal) → "medium", (2, normal) → "hard"
  • Stable tier labels as execution strategies evolve
  • Enables adaptive difficulty (add "ultra" when "hard" saturates)

Pre-computed forensics:

  • Compression arrays (for entropy analysis)
  • FFT arrays (for spectral analysis)
  • Token distributions (for hazard analysis)
  • 10-100x speedup for Stage 5 investigations

See technical-details.md for Wilson CI algorithm, excess accuracy computation, and pre-computation mechanics.

The Two-Plane Data Model

ReasonScape organizes evaluation data using PointsDB, a two-plane structure where each point exists simultaneously in both an Evaluation Plane and a Task-Complexity Plane.

Why two planes?

Traditional benchmarks are flat: (model, task) → score

This can't answer: - WHERE in complexity space does the model fail? - HOW does performance change as difficulty increases? - WHAT architectural patterns emerge across difficulty levels?

The structure:

EVALUATION TASK-COMPLEXITY
IDENTITY (5D) - model
- template
- sampler
- base_task
- params
FACETS - eval_id
- groups[]
- tiers[]
- surfaces[]
- projections[]

Identity dimensions (5D) uniquely define a point:

  • Evaluation Plane: model, template, sampler
  • Task-Complexity Plane: base_task, params

Points with identical 5D identity are de-duplicated.

Facet dimensions provide multi-valued organizational views:

  • Evaluation facets: eval_id (shorthand), groups[] (arch:moe, size:large, etc.)
  • Complexity facets: tiers[] (easy/medium/hard), surfaces[] (2D slices), projections[] (1D sweeps)

Points can belong to multiple facets simultaneously.

Key properties:

  • Orthogonality: Same model tested at many difficulty levels; many models tested at same difficulty level
  • Faceted organization: Filter by tier, group by architecture, slice by surface—all from the same data
  • Identity-based de-duplication: Running the same evaluation twice doesn't create duplicates

For detailed design rationale, orthogonality principles, and facet computation: See manifold.md

For complete PointsDB API and query patterns: See pointsdb.md

ReasonScore: The Unified Metric

ReasonScore captures six dimensions of model performance in a single interpretable number:

What it measures:

  1. Accuracy - Correctness above random guessing baseline (Challenge #4)
  2. Statistical confidence - Uncertainty from finite sampling (Challenge #4)
  3. Context reliability - Truncation and context limit issues (Challenge #7)
  4. Task balance - Performance consistency across reasoning domains (Challenge #2)
  5. Difficulty scaling - Capability maintenance under increasing complexity (Challenge #1, #6)
  6. Token efficiency - Computational cost per unit of quality (Challenge #8)

How it's computed:

Layer 1: Samples → Point Score       [Wilson CI + truncation penalty]
Layer 2: Points → Task Score         [Wilson CI re-aggregation]
Layer 3: Tasks → Tier ReasonScore    [Geometric Mean × 1000]
Layer 4: Tiers → score/token         [Arithmetic Mean ÷ median tokens]

Design philosophy:

  • Optimistic about uncertainty - Add confidence margin (statistical uncertainty is "our fault")
  • Pessimistic about failures - Subtract truncation penalty (Challenge #7: context limits are "model's fault")
  • Punish imbalance - Geometric mean across tasks (catastrophic failure in one domain hurts overall score)
  • Account for efficiency - Divide by tokens (Challenge #8: being right isn't enough if it's expensive)

Why geometric mean?

Unlike arithmetic mean, geometric mean penalizes inconsistency. Being great at 11 tasks doesn't excuse catastrophic failure at 1 task. In real deployment, users hit all task types.

Why score/token?

Two models with identical accuracy can differ by 5-10x in token consumption. Model A using 500 tokens/problem vs Model B using 5,000 tokens/problem have radically different deployment characteristics:

  • Cost: 10x difference in API bills
  • Latency: 10x difference in user wait times
  • Throughput: 10x difference in concurrent users supported
  • Environment: 10x difference in energy consumption

Accuracy-only metrics treat these as equivalent. They're not. The final score/token ratio makes efficiency a first-class concern, answering: "How much quality per unit of resource?"

For complete layer-by-layer computation, design rationale, and philosophical motivation: See reasonscore.md

Stage 4: Discovery — Visual Pattern Recognition

Purpose: Answer "WHAT is interesting?"

After Stage 3, you have a complete PointsDB. But where do you start? Discovery tools optimize for pattern recognition and hypothesis formation.

Three complementary perspectives:

1. Leaderboard — "What's the big picture?"

  • Aggregate rankings with ReasonScore
  • Heatmap visualization (models × tasks × tiers)
  • Color gradients reveal failure patterns
  • Truncation indicators show context issues
  • Group filtering enables peer comparison

2. Spider Plots — "What's this model's cognitive fingerprint?"

  • Radar chart across 12 reasoning domains
  • Cognitive archetype identification (9 recognizable patterns)
  • Difficulty scaling behavior (easy/medium/hard)
  • Token efficiency overlay
  • Cross-task consistency analysis

3. Explorer — "Where in the manifold does behavior change?"

  • Interactive 3D surfaces (accuracy = Z-axis, params = X/Y)
  • Capability zones (green plateaus = success regions)
  • Failure boundaries (red cliffs = performance drop-offs)
  • Multi-panel analysis (FFT, accuracy, token distributions)
  • Point inspection (click to see test samples and responses)

Progressive discovery flow:

BROAD: Leaderboard → Identify candidates
    ↓
FOCUSED: Spider → Identify strengths/weaknesses
    ↓
SPECIFIC: Explorer → Identify failure boundaries

See workflow.md for complete discovery workflows.

Stage 5: Investigation — Systematic Forensic Analysis

Purpose: Answer "WHY is it interesting?"

Discovery reveals patterns. Investigation explains mechanisms. Real research ping-pongs between both.

Information processing analysis tools

Understanding why a model fails requires investigating four information spaces:

  • surface - Where does performance break down? Look at OUTPUT.
  • fft - How is the problem represented? Look at INPUT.
  • compression - What is the information quality? Look at REASONING (spatial/entropy).
  • hazard - When does thinking degrade? Look at REASONING (temporal/timing).

Discovery support tools:

  • evals - Evaluation discovery with fuzzy search
  • tasks - Task structure discovery (surfaces/projections)
  • modelinfo - Architecture-aware interpretation

Statistical validation:

  • cluster - CI-overlap grouping (distinguish signal from noise)
  • scores - Statistical rankings with CI
  • spiderweb - Complete single-model fingerprinting

Example investigation flow:

Stage 4: "Model X fails at arithmetic length=18"
    ↓
Stage 5 surface: "Failure boundary confirmed at length=18"
    ↓
Stage 5 fft: "Tokenization not the issue"
    ↓
Stage 5 compression: "Reasoning traces become 3x more compressible"
    ↓
ROOT CAUSE: Information loss / reasoning loops

See tools/analyze.md for complete forensic toolkit reference.

The Discovery-Investigation Loop

Stages 4 and 5 form a research loop, not a linear pipeline:

Discovery reveals patterns → Investigation explains mechanisms
        ↑                                    ↓
        └──── Investigation finds anomalies ─┘
                            ↓
                  Both inform Stage 1 (new manifolds)

Key insight: After Stage 3, research isn't sequential. You ping-pong based on what you're trying to understand at each moment.

Example ping-pong:

  1. Discovery (leaderboard): "Model A and B look similar"
  2. Investigation (cluster): "Overlapping CIs confirm equivalence"
  3. Discovery (spider): "But different cognitive profiles"
  4. Investigation (surface): "Model A has cliff at depth=3, Model B smooth degradation"
  5. Investigation (compression): "Model A enters loops, Model B maintains entropy"
  6. Finding: Same aggregate score, different failure modes

See workflow.md for four research workflow patterns showing when to use discovery vs investigation.


Proving It Works

m12x validates this architecture.

75+ models, 12 reasoning tasks, 6.5B tokens, 150K+ evaluation points.

This isn't hypothetical. Every design pattern—the manifold definitions, the tier mappings, the precision configurations, the cognitive archetypes, the forensic workflows—has been battle-tested through real evaluation at production scale.

m12x serves three purposes: 1. Validates the architecture — Proves ReasonScape works (not vaporware) 2. Provides research-ready data — Enables immediate analysis without inference costs 3. Demonstrates design patterns — Shows concrete choices others can adapt

The extraordinary evidence: - FFT reveals spectral signatures that differ by tokenizer/architecture - Compression shows underthink/overthink/broken loops patterns - Hazard analysis proves models have measurable "thinking budgets" - Surface plots reveal capability boundaries nobody knew existed - Statistical rigor confirms these patterns are signal, not noise

For complete m12x documentation, configuration details, and usage guide: See m12x.md


The Four Research Workflows

The architecture enables four distinct research workflows, each using different tool combinations:

1. Ranking & Benchmarking

Question: "What's the best model overall?"

Tools: Leaderboard (Stage 4) → scores + cluster (Stage 5)

Flow: Discovery → Investigation (quick validation)

Duration: 2-3 minutes

2. Comparative Evaluation

Question: "Which models are truly different?"

Tools: cluster (Stage 5) → spiderweb + explorer (Stage 4)

Flow: Investigation → Discovery (visual confirmation)

Duration: 5-10 minutes

3. Model Characterization

Question: "What are this model's strengths and weaknesses?"

Tools: spiderweb + explorer (Stage 4) → surface + compression + hazard (Stage 5)

Flow: Discovery ↔ Investigation (heavy ping-pong)

Duration: 5-10 minutes

4. Failure Diagnosis

Question: "Why did this model fail?"

Tools: explorer (Stage 4) → surface + fft + compression + hazard (Stage 5)

Flow: Discovery → Investigation → Discovery → Investigation (deep iteration)

Duration: 10-20 minutes

For detailed workflow examples with command sequences: See workflow.md


Interconnections: How It All Fits Together

The five stages form an interconnected research platform with forward data flow and iterative discovery-investigation loops.

flowchart TB
    subgraph Pipeline["Data Production Pipeline"]
        S1["Stage 1: Definition"]
        S2["Stage 2: Execution"]
        S3["Stage 3: Evaluation"]
        DB[("PointsDB")]
        S1 --> S2 --> S3 --> DB
    end

    subgraph Loop["Analysis Loop"]
        S4["Stage 4: Discovery<br/>(leaderboard, spider, explorer)"]
        S5["Stage 5: Investigation<br/>(surface, fft, compression, hazard)"]
        RF["Research Findings"]

        S4 --> S5
        S5 --> S4
        S5 --> RF
    end

    subgraph Research["Research Loop"]
        NM["New manifold designs<br/>Hypothesis tests<br/>Difficulty refinements"]
    end

    DB --> Loop
    RF --> Research
    Research -.inform.-> S1

    style Pipeline fill:#e1f5fe
    style Loop fill:#f3e5f5
    style Research fill:#ffebee

What makes this work:

  1. Unified Data Layer - Stages 4 and 5 access identical PointsDB via API
  2. Complementary Modalities - Discovery optimizes for pattern recognition, investigation for root causes
  3. Flexible Entry Points - Start wherever makes sense for your research question
  4. Iterative Refinement - Each cycle improves understanding
  5. Research Loop Closure - Findings drive design, enabling science not just benchmarking

Next Steps

For New Users (Start with m12x)

  1. Explore m12x data: python analyze.py evals data/dataset-m12x.json
  2. Visual discovery: Open leaderboard, spiderweb, and explorer
  3. Learn by doing: Run forensic analysis on interesting patterns
  4. Read technical-details.md for statistical concepts
  5. Follow index.md to add your own models

For Researchers (Use m12x as Your Dataset)

  1. Start analysis immediately: No inference needed, 6.5B tokens ready to explore
  2. Review tools/analyze.md for complete forensic capabilities
  3. Study workflow.md for discovery-investigation patterns
  4. Consult tasks.md to understand manifold design
  5. Extend m12x: Add your own models to the reference dataset

For Developers (Fork m12x as Template)

  1. Study m12x structure: data/dataset-m12x.json and tasks/*.json
  2. Examine config.md for manifold/tier/surface definitions
  3. Review tools.md for pipeline integration
  4. Adapt for your needs: Copy manifolds, modify difficulty ranges, add new surfaces

For LLM Agents (m12x is Agent-Ready)

  1. Start with analyze.py evals data/dataset-m12x.json --format json
  2. Use analyze.py tasks to discover available surfaces and projections
  3. Query with --format json for machine-readable outputs
  4. Follow workflow.md for systematic research patterns

See Also

Foundation Documents:

  • challenges.md - The six fundamental challenges in current LLM evaluation
  • insight.md - LLMs as information processors and system architecture

Core Documentation:

  • m12x.md - The extraordinary evidence (reference evaluation + research dataset)
  • implementation.md - The Python codebase that realizes this methodology

Deep-Dive Design:

Reference Documentation:

  • workflow.md - Four research workflow patterns with examples
  • tasks.md - Abstract task API specifications
  • config.md - Configuration reference (manifolds, templates, samplers)
  • pointsdb.md - Complete data structure API
  • tools.md - Complete tool reference