ReasonScape Architecture: The Methodology¶
Prerequisites: Before reading this document, familiarize yourself with: - challenges.md - The eight fundamental problems this methodology addresses - insight.md - The information processing paradigm that informs the design
Overview¶
ReasonScape's methodology addresses the eight challenges in current LLM evaluation through a systematic, information-theoretic approach grounded in the insight that LLMs are information processors.
How the methodology addresses each challenge:
| Challenge | Solution |
|---|---|
| 1. Doesn't Know What It's Asking | Parametric manifolds with coordinate-based test generation |
| 2. Doesn't Know Which Answers | Per-point evaluation with Wilson confidence intervals |
| 3. Doesn't Understand Reasoning Process | Information-theoretic forensics (compression, FFT, hazard) |
| 4. Can't Distinguish Signal from Noise | Excess accuracy correction and proper uncertainty quantification |
| 5. Trivially Gameable | Deterministic but unmemorizable coordinate-based generation |
| 6. Ceiling Effects | Parametric difficulty scaling that adapts to model capabilities |
| 7. Ignores Truncations and Context Failures | Explicit truncation tracking and penalty in ReasonScore |
| 8. Ignores Token Budget and Resource Efficiency | score/token metric and per-point token consumption tracking |
The Five-Stage Architecture¶
The architectural solution is a multi-stage data-processing pipeline:
graph TB
subgraph "Data Pipeline"
A[Stage 1: Definition] --> B[Stage 2: Execution]
B --> C[Stage 3: Evaluation]
end
C --> D[PointsDB]
D --> E[Stage 4: Discovery]
D --> F[Stage 5: Investigation]
subgraph "Research Loop"
E <-.ping-pong.-> F
end
E --> G[Research Insights]
F --> G
G -.inform.-> A
style A fill:#e1f5fe
style B fill:#e8f5e8
style C fill:#fff3e0
style D fill:#fff3e0
style E fill:#f3e5f5
style F fill:#fce4ec
style G fill:#ffebee
Stage 1: Definition — Parametric Test Generation¶
Key innovation: Test generators create infinite unique instances within controlled difficulty manifolds.
Every test is deterministically generated from coordinate seeds:
seed = hash(task, parameters, sample_index)
Same coordinates always produce same test sequence. Different coordinates produce different tests. The manifold is infinite but reproducible.
Manifold dimensions control difficulty:
- Length (working memory load)
- Depth (structural complexity)
- Interference (selective attention demand)
- Format (tokenization stress)
- Multi-step operations (sequential reasoning)
Progressive complexity controls:
- Degree (0-2): Easy, Medium, Hard difficulty ranges
- Density (corner/lowdef/normal): Which points to sample
- Precision (low/medium/high): How many tests per point
See technical-details.md for coordinate-based seeding algorithm and manifold resolution mechanics.
Stage 2: Execution — Efficient Inference at Scale¶
Key innovations: Response caching, adaptive sampling, and hierarchical evaluation.
Response caching:
- Every unique prompt is cached
- Deterministic generation ensures cache hits
- Typical cost reduction: 60-80% for multi-tier evaluation
Adaptive sampling:
- Easy points converge quickly (few samples needed)
- Hard points get more samples (more rounds for precision)
- Truncation-heavy points abort early (don't waste tokens)
- Statistical confidence guaranteed by CI tracking
Hierarchical sampling:
- Tests at count=32 are perfect subset of count=128
- Can upsample without waste
- Can downsample for quick comparison
- Supports progressive evaluation workflows
See technical-details.md for caching implementation, confidence targeting algorithm, and truncation-aware execution.
Stage 3: Evaluation — Statistical Rigor Without Lies¶
Key innovations: Excess accuracy correction, truncation awareness, semantic tier mapping, and pre-computed forensics.
Excess accuracy correction:
- Removes expected guessing contributions
- 0.000 = no better than guessing
- 1.000 = perfect knowledge
- Fair comparison across all task types
Truncation awareness (Challenge #7):
- Truncations tracked separately from errors
- Not wrong answers—context limit failures that waste resources
- Direct penalty in ReasonScore (subtracted from point score)
- Widen confidence intervals (reduced effective sample size)
- Report explicitly in all visualizations
- Why this matters: Pass@k metrics hide that a model might need 10 attempts to produce valid output, masking deployment reliability issues
Semantic tier mapping:
(degree, density)execution parameters → tier labels(0, normal)→ "easy",(1, normal)→ "medium",(2, normal)→ "hard"- Stable tier labels as execution strategies evolve
- Enables adaptive difficulty (add "ultra" when "hard" saturates)
Pre-computed forensics:
- Compression arrays (for entropy analysis)
- FFT arrays (for spectral analysis)
- Token distributions (for hazard analysis)
- 10-100x speedup for Stage 5 investigations
See technical-details.md for Wilson CI algorithm, excess accuracy computation, and pre-computation mechanics.
The Two-Plane Data Model¶
ReasonScape organizes evaluation data using PointsDB, a two-plane structure where each point exists simultaneously in both an Evaluation Plane and a Task-Complexity Plane.
Why two planes?
Traditional benchmarks are flat: (model, task) → score
This can't answer: - WHERE in complexity space does the model fail? - HOW does performance change as difficulty increases? - WHAT architectural patterns emerge across difficulty levels?
The structure:
| EVALUATION | TASK-COMPLEXITY | |
|---|---|---|
| IDENTITY (5D) | - model - template - sampler |
- base_task - params |
| FACETS | - eval_id - groups[] |
- tiers[] - surfaces[] - projections[] |
Identity dimensions (5D) uniquely define a point:
- Evaluation Plane:
model,template,sampler - Task-Complexity Plane:
base_task,params
Points with identical 5D identity are de-duplicated.
Facet dimensions provide multi-valued organizational views:
- Evaluation facets:
eval_id(shorthand),groups[](arch:moe, size:large, etc.) - Complexity facets:
tiers[](easy/medium/hard),surfaces[](2D slices),projections[](1D sweeps)
Points can belong to multiple facets simultaneously.
Key properties:
- Orthogonality: Same model tested at many difficulty levels; many models tested at same difficulty level
- Faceted organization: Filter by tier, group by architecture, slice by surface—all from the same data
- Identity-based de-duplication: Running the same evaluation twice doesn't create duplicates
For detailed design rationale, orthogonality principles, and facet computation: See manifold.md
For complete PointsDB API and query patterns: See pointsdb.md
ReasonScore: The Unified Metric¶
ReasonScore captures six dimensions of model performance in a single interpretable number:
What it measures:
- Accuracy - Correctness above random guessing baseline (Challenge #4)
- Statistical confidence - Uncertainty from finite sampling (Challenge #4)
- Context reliability - Truncation and context limit issues (Challenge #7)
- Task balance - Performance consistency across reasoning domains (Challenge #2)
- Difficulty scaling - Capability maintenance under increasing complexity (Challenge #1, #6)
- Token efficiency - Computational cost per unit of quality (Challenge #8)
How it's computed:
Layer 1: Samples → Point Score [Wilson CI + truncation penalty]
Layer 2: Points → Task Score [Wilson CI re-aggregation]
Layer 3: Tasks → Tier ReasonScore [Geometric Mean × 1000]
Layer 4: Tiers → score/token [Arithmetic Mean ÷ median tokens]
Design philosophy:
- Optimistic about uncertainty - Add confidence margin (statistical uncertainty is "our fault")
- Pessimistic about failures - Subtract truncation penalty (Challenge #7: context limits are "model's fault")
- Punish imbalance - Geometric mean across tasks (catastrophic failure in one domain hurts overall score)
- Account for efficiency - Divide by tokens (Challenge #8: being right isn't enough if it's expensive)
Why geometric mean?
Unlike arithmetic mean, geometric mean penalizes inconsistency. Being great at 11 tasks doesn't excuse catastrophic failure at 1 task. In real deployment, users hit all task types.
Why score/token?
Two models with identical accuracy can differ by 5-10x in token consumption. Model A using 500 tokens/problem vs Model B using 5,000 tokens/problem have radically different deployment characteristics:
- Cost: 10x difference in API bills
- Latency: 10x difference in user wait times
- Throughput: 10x difference in concurrent users supported
- Environment: 10x difference in energy consumption
Accuracy-only metrics treat these as equivalent. They're not. The final score/token ratio makes efficiency a first-class concern, answering: "How much quality per unit of resource?"
For complete layer-by-layer computation, design rationale, and philosophical motivation: See reasonscore.md
Stage 4: Discovery — Visual Pattern Recognition¶
Purpose: Answer "WHAT is interesting?"
After Stage 3, you have a complete PointsDB. But where do you start? Discovery tools optimize for pattern recognition and hypothesis formation.
Three complementary perspectives:
1. Leaderboard — "What's the big picture?"
- Aggregate rankings with ReasonScore
- Heatmap visualization (models × tasks × tiers)
- Color gradients reveal failure patterns
- Truncation indicators show context issues
- Group filtering enables peer comparison
2. Spider Plots — "What's this model's cognitive fingerprint?"
- Radar chart across 12 reasoning domains
- Cognitive archetype identification (9 recognizable patterns)
- Difficulty scaling behavior (easy/medium/hard)
- Token efficiency overlay
- Cross-task consistency analysis
3. Explorer — "Where in the manifold does behavior change?"
- Interactive 3D surfaces (accuracy = Z-axis, params = X/Y)
- Capability zones (green plateaus = success regions)
- Failure boundaries (red cliffs = performance drop-offs)
- Multi-panel analysis (FFT, accuracy, token distributions)
- Point inspection (click to see test samples and responses)
Progressive discovery flow:
BROAD: Leaderboard → Identify candidates
↓
FOCUSED: Spider → Identify strengths/weaknesses
↓
SPECIFIC: Explorer → Identify failure boundaries
See workflow.md for complete discovery workflows.
Stage 5: Investigation — Systematic Forensic Analysis¶
Purpose: Answer "WHY is it interesting?"
Discovery reveals patterns. Investigation explains mechanisms. Real research ping-pongs between both.
Information processing analysis tools
Understanding why a model fails requires investigating four information spaces:
- surface - Where does performance break down? Look at OUTPUT.
- fft - How is the problem represented? Look at INPUT.
- compression - What is the information quality? Look at REASONING (spatial/entropy).
- hazard - When does thinking degrade? Look at REASONING (temporal/timing).
Discovery support tools:
- evals - Evaluation discovery with fuzzy search
- tasks - Task structure discovery (surfaces/projections)
- modelinfo - Architecture-aware interpretation
Statistical validation:
- cluster - CI-overlap grouping (distinguish signal from noise)
- scores - Statistical rankings with CI
- spiderweb - Complete single-model fingerprinting
Example investigation flow:
Stage 4: "Model X fails at arithmetic length=18"
↓
Stage 5 surface: "Failure boundary confirmed at length=18"
↓
Stage 5 fft: "Tokenization not the issue"
↓
Stage 5 compression: "Reasoning traces become 3x more compressible"
↓
ROOT CAUSE: Information loss / reasoning loops
See tools/analyze.md for complete forensic toolkit reference.
The Discovery-Investigation Loop¶
Stages 4 and 5 form a research loop, not a linear pipeline:
Discovery reveals patterns → Investigation explains mechanisms
↑ ↓
└──── Investigation finds anomalies ─┘
↓
Both inform Stage 1 (new manifolds)
Key insight: After Stage 3, research isn't sequential. You ping-pong based on what you're trying to understand at each moment.
Example ping-pong:
- Discovery (leaderboard): "Model A and B look similar"
- Investigation (cluster): "Overlapping CIs confirm equivalence"
- Discovery (spider): "But different cognitive profiles"
- Investigation (surface): "Model A has cliff at depth=3, Model B smooth degradation"
- Investigation (compression): "Model A enters loops, Model B maintains entropy"
- Finding: Same aggregate score, different failure modes
See workflow.md for four research workflow patterns showing when to use discovery vs investigation.
Proving It Works¶
m12x validates this architecture.
75+ models, 12 reasoning tasks, 6.5B tokens, 150K+ evaluation points.
This isn't hypothetical. Every design pattern—the manifold definitions, the tier mappings, the precision configurations, the cognitive archetypes, the forensic workflows—has been battle-tested through real evaluation at production scale.
m12x serves three purposes: 1. Validates the architecture — Proves ReasonScape works (not vaporware) 2. Provides research-ready data — Enables immediate analysis without inference costs 3. Demonstrates design patterns — Shows concrete choices others can adapt
The extraordinary evidence: - FFT reveals spectral signatures that differ by tokenizer/architecture - Compression shows underthink/overthink/broken loops patterns - Hazard analysis proves models have measurable "thinking budgets" - Surface plots reveal capability boundaries nobody knew existed - Statistical rigor confirms these patterns are signal, not noise
For complete m12x documentation, configuration details, and usage guide: See m12x.md
The Four Research Workflows¶
The architecture enables four distinct research workflows, each using different tool combinations:
1. Ranking & Benchmarking¶
Question: "What's the best model overall?"
Tools: Leaderboard (Stage 4) → scores + cluster (Stage 5)
Flow: Discovery → Investigation (quick validation)
Duration: 2-3 minutes
2. Comparative Evaluation¶
Question: "Which models are truly different?"
Tools: cluster (Stage 5) → spiderweb + explorer (Stage 4)
Flow: Investigation → Discovery (visual confirmation)
Duration: 5-10 minutes
3. Model Characterization¶
Question: "What are this model's strengths and weaknesses?"
Tools: spiderweb + explorer (Stage 4) → surface + compression + hazard (Stage 5)
Flow: Discovery ↔ Investigation (heavy ping-pong)
Duration: 5-10 minutes
4. Failure Diagnosis¶
Question: "Why did this model fail?"
Tools: explorer (Stage 4) → surface + fft + compression + hazard (Stage 5)
Flow: Discovery → Investigation → Discovery → Investigation (deep iteration)
Duration: 10-20 minutes
For detailed workflow examples with command sequences: See workflow.md
Interconnections: How It All Fits Together¶
The five stages form an interconnected research platform with forward data flow and iterative discovery-investigation loops.
flowchart TB
subgraph Pipeline["Data Production Pipeline"]
S1["Stage 1: Definition"]
S2["Stage 2: Execution"]
S3["Stage 3: Evaluation"]
DB[("PointsDB")]
S1 --> S2 --> S3 --> DB
end
subgraph Loop["Analysis Loop"]
S4["Stage 4: Discovery<br/>(leaderboard, spider, explorer)"]
S5["Stage 5: Investigation<br/>(surface, fft, compression, hazard)"]
RF["Research Findings"]
S4 --> S5
S5 --> S4
S5 --> RF
end
subgraph Research["Research Loop"]
NM["New manifold designs<br/>Hypothesis tests<br/>Difficulty refinements"]
end
DB --> Loop
RF --> Research
Research -.inform.-> S1
style Pipeline fill:#e1f5fe
style Loop fill:#f3e5f5
style Research fill:#ffebee
What makes this work:
- Unified Data Layer - Stages 4 and 5 access identical PointsDB via API
- Complementary Modalities - Discovery optimizes for pattern recognition, investigation for root causes
- Flexible Entry Points - Start wherever makes sense for your research question
- Iterative Refinement - Each cycle improves understanding
- Research Loop Closure - Findings drive design, enabling science not just benchmarking
Next Steps¶
For New Users (Start with m12x)¶
- Explore m12x data:
python analyze.py evals data/dataset-m12x.json - Visual discovery: Open leaderboard, spiderweb, and explorer
- Learn by doing: Run forensic analysis on interesting patterns
- Read technical-details.md for statistical concepts
- Follow index.md to add your own models
For Researchers (Use m12x as Your Dataset)¶
- Start analysis immediately: No inference needed, 6.5B tokens ready to explore
- Review tools/analyze.md for complete forensic capabilities
- Study workflow.md for discovery-investigation patterns
- Consult tasks.md to understand manifold design
- Extend m12x: Add your own models to the reference dataset
For Developers (Fork m12x as Template)¶
- Study m12x structure:
data/dataset-m12x.jsonandtasks/*.json - Examine config.md for manifold/tier/surface definitions
- Review tools.md for pipeline integration
- Adapt for your needs: Copy manifolds, modify difficulty ranges, add new surfaces
For LLM Agents (m12x is Agent-Ready)¶
- Start with
analyze.py evals data/dataset-m12x.json --format json - Use
analyze.py tasksto discover available surfaces and projections - Query with
--format jsonfor machine-readable outputs - Follow workflow.md for systematic research patterns
See Also¶
Foundation Documents:
- challenges.md - The six fundamental challenges in current LLM evaluation
- insight.md - LLMs as information processors and system architecture
Core Documentation:
- m12x.md - The extraordinary evidence (reference evaluation + research dataset)
- implementation.md - The Python codebase that realizes this methodology
Deep-Dive Design:
- manifold.md - Two-plane data model design decisions
- reasonscore.md - Unified metric architecture and design rationale
Reference Documentation:
- workflow.md - Four research workflow patterns with examples
- tasks.md - Abstract task API specifications
- config.md - Configuration reference (manifolds, templates, samplers)
- pointsdb.md - Complete data structure API
- tools.md - Complete tool reference