ReasonScape Architecture: The Methodology¶
Prerequisites: Before reading this document, familiarize yourself with: - challenges.md - The eight fundamental problems this methodology addresses - insight.md - The information processing paradigm that informs the design
Overview¶
ReasonScape's methodology addresses the eight challenges in current LLM evaluation through a systematic, information-theoretic approach grounded in the insight that LLMs are information processors.
How the methodology addresses each challenge:
| Challenge | Solution |
|---|---|
| 1. Doesn't Know What It's Asking | Parametric manifolds with coordinate-based test generation |
| 2. Doesn't Know Which Answers | Per-point evaluation with Wilson confidence intervals |
| 3. Doesn't Understand Reasoning Process | Information-theoretic forensics (compression, FFT, hazard) |
| 4. Can't Distinguish Signal from Noise | Excess accuracy correction and proper uncertainty quantification |
| 5. Trivially Gameable | Deterministic but unmemorizable coordinate-based generation |
| 6. Ceiling Effects | Parametric difficulty scaling that adapts to model capabilities |
| 7. Ignores Truncations and Context Failures | Explicit truncation tracking and penalty in ReasonScore |
| 8. Ignores Token Budget and Resource Efficiency | score/token metric and per-point token consumption tracking |
The Five-Stage Architecture¶
The architectural solution is a multi-stage data-processing pipeline:
graph TB
subgraph "Data Pipeline"
A[Stage 1: Definition] --> B[Stage 2: Execution]
B --> C[Stage 3: Evaluation]
end
C --> D[PointsDB]
D --> E[Stage 4: Discovery]
D --> F[Stage 5: Investigation]
subgraph "Research Loop"
E <-.ping-pong.-> F
end
E --> G[Research Insights]
F --> G
G -.inform.-> A
style A fill:#e1f5fe
style B fill:#e8f5e8
style C fill:#fff3e0
style D fill:#fff3e0
style E fill:#f3e5f5
style F fill:#fce4ec
style G fill:#ffebee
Stage 1: Definition — Parametric Test Generation¶
Key innovation: Test generators create infinite unique instances within controlled difficulty manifolds.
Every test is deterministically generated from coordinate seeds:
seed = hash(task, parameters, sample_index)
Same coordinates always produce same test sequence. Different coordinates produce different tests. The manifold is infinite but reproducible.
Manifold dimensions control difficulty:
- Length (working memory load)
- Depth (structural complexity)
- Interference (selective attention demand)
- Format (tokenization stress)
- Multi-step operations (sequential reasoning)
Progressive complexity controls:
- Precision (low/medium/high): How many tests per point
See technical-details.md for coordinate-based seeding algorithm and manifold resolution mechanics.
Stage 2: Execution — Efficient Inference at Scale¶
Output: runner.py writes per-test steps to results/… as NDJSON (0th-level, unaggregated records: task, degree, precision, eval_id, full inputs/outputs, meta). Nothing else reads these directly except Stage 3.
Key innovations: Response caching and hierarchical evaluation.
Response caching:
- Every unique prompt is cached
- Deterministic generation ensures cache hits
- Typical cost reduction: 30-60% across evaluation runs
Hierarchical sampling:
- Tests at count=32 are perfect subset of count=128
- Can upsample without waste
- Can downsample for quick comparison
- Supports progressive evaluation workflows
See technical-details.md for caching implementation and truncation-aware execution.
Stage 3: Evaluation — Statistical Rigor Without Lies¶
Output: evaluate.py consumes step NDJSON and writes per-eval_id points into PointsDB (1st-level aggregates: outcome, tokens, compressed length, excess-accuracy adjusted). All downstream tools (Stages 4–5) work from these points and their higher-level aggregations (vectors → KPIs); they never read raw steps.
Key innovations: Excess accuracy correction, truncation awareness, and pre-computed forensics.
Excess accuracy correction:
- Removes expected guessing contributions
- 0.000 = no better than guessing
- 1.000 = perfect knowledge
- Fair comparison across all task types
Truncation awareness (Challenge #7):
- Truncations tracked separately from errors
- Not wrong answers—context limit failures that waste resources
- Handled via probability multiplication in ReasonScore (joint mode:
P[Correct|U] × P[Untrunc]) - Widen confidence intervals (reduced effective sample size)
- Report explicitly in all visualizations
- Why this matters: Pass@k metrics hide that a model might need 10 attempts to produce valid output, masking deployment reliability issues
Pre-computed forensics:
- Compression arrays (for entropy analysis)
- FFT arrays (for spectral analysis)
- Token distributions (for hazard analysis)
- 10-100x speedup for Stage 5 investigations
See technical-details.md for Wilson CI algorithm, excess accuracy computation, and pre-computation mechanics.
The Two-Plane Data Model¶
ReasonScape organizes evaluation data using PointsDB, a two-plane, three-layer structure. Each point exists simultaneously in an Evaluation Plane (model, template, sampler) and a Task-Complexity Plane (base_task, params). The two planes are orthogonal: the same model can be tested at many difficulty levels, and many models can be tested at the same difficulty level.
For the complete data model — design rationale, layer definitions, facet computation, and identity rules: See manifold.md
For complete PointsDB API and query patterns: See pointsdb.md
ReasonScore: The Unified Metric¶
ReasonScore captures six dimensions of model performance in a single interpretable number:
What it measures:
- Accuracy - Correctness above random guessing baseline (Challenge #4)
- Statistical confidence - Uncertainty from finite sampling (Challenge #4)
- Context reliability - Truncation and context limit issues (Challenge #7)
- Task balance - Performance consistency across reasoning domains (Challenge #2)
- Difficulty scaling - Capability maintenance under increasing complexity (Challenge #1, #6)
- Token efficiency - Computational cost per unit of quality (Challenge #8)
How it's computed (v2, 2-layer):
Layer 1: Samples → Task Score [per-task probability-space computation]
Layer 2: Tasks → ReasonScore [bootstrap geometric mean × 1000]
Design philosophy:
- Probability-space truncation - Truncation modeled as
(1 - P[Trunc]) × P[Correct], not subtracted - Punish imbalance - Geometric mean across tasks (catastrophic failure in one domain hurts overall score)
- Preserved uncertainty - Confidence intervals carried through bootstrap aggregation
- Account for efficiency - score/token ratio makes efficiency a first-class concern (Challenge #8)
Why geometric mean?
Unlike arithmetic mean, geometric mean penalizes inconsistency. Being great at 11 tasks doesn't excuse catastrophic failure at 1 task. In real deployment, users hit all task types.
Why score/token?
Two models with identical accuracy can differ by 5-10x in token consumption. Model A using 500 tokens/problem vs Model B using 5,000 tokens/problem have radically different deployment characteristics:
- Cost: 10x difference in API bills
- Latency: 10x difference in user wait times
- Throughput: 10x difference in concurrent users supported
- Environment: 10x difference in energy consumption
Accuracy-only metrics treat these as equivalent. They're not. The final score/token ratio makes efficiency a first-class concern, answering: "How much quality per unit of resource?"
For complete layer-by-layer computation, design rationale, and philosophical motivation: See reasonscore.md
Stage 4: Discovery — Visual Pattern Recognition¶
Purpose: Answer "WHAT is interesting?"
After Stage 3, you have a complete PointsDB. But where do you start? Discovery tools optimize for pattern recognition and hypothesis formation.
Three complementary perspectives:
1. Leaderboard — "What's the big picture?"
- Aggregate rankings with ReasonScore
- Heatmap visualization (models × tasks)
- Color gradients reveal failure patterns
- Truncation indicators show context issues
- Group filtering enables peer comparison
2. Spider Plots — "What's this model's cognitive fingerprint?"
- Radar chart across 12 reasoning domains
- Cognitive archetype identification (9 recognizable patterns)
- Difficulty scaling behavior across parameter space
- Token efficiency overlay
- Cross-task consistency analysis
3. Explorer — "Where in the manifold does behavior change?"
- Interactive 3D surfaces (accuracy = Z-axis, params = X/Y)
- Capability zones (green plateaus = success regions)
- Failure boundaries (red cliffs = performance drop-offs)
- Multi-panel analysis (FFT, accuracy, token distributions)
- Point inspection (click to see test samples and responses)
Progressive discovery flow:
BROAD: Leaderboard → Identify candidates
↓
FOCUSED: Spider → Identify strengths/weaknesses
↓
SPECIFIC: Explorer → Identify failure boundaries
Stage 5: Investigation — Systematic Forensic Analysis¶
Purpose: Answer "WHY is it interesting?"
Discovery reveals patterns. Investigation explains mechanisms. Real research ping-pongs between both.
Stage 5 is organized around the Three P's: Position (rank models), Profile (characterize and diagnose), and Probe (inspect raw traces). Position and Profile operate on PointsDB; Probe drops to raw NDJSON when you need to see what the model actually produced.
See workflow.md for the complete Three P's methodology and tools.md for tool reference.
The Discovery-Investigation Loop¶
Stages 4 and 5 form a research loop, not a linear pipeline:
Discovery reveals patterns → Investigation explains mechanisms
↑ ↓
└──── Investigation finds anomalies ─┘
↓
Both inform Stage 1 (new manifolds)
Key insight: After Stage 3, research isn't sequential. You ping-pong based on what you're trying to understand at each moment.
Example ping-pong:
- Discovery (leaderboard): "Model A and B look similar"
- Investigation (cluster): "Overlapping CIs confirm equivalence"
- Discovery (spider): "But different cognitive profiles"
- Investigation (surface): "Model A has cliff at depth=3, Model B smooth degradation"
- Investigation (compression): "Model A enters loops, Model B maintains entropy"
- Finding: Same aggregate score, different failure modes
See workflow.md for the Three P's and when to use discovery vs investigation.
Proving It Works¶
r12 12 reasoning tasks, improved difficulty calibration, 16k context windows, 95% score ceiling. Demonstrates the architecture at its best—comprehensive parametric coverage without a-priori difficulty assumptions.
The extraordinary evidence: - Compression shows underthink/overthink/broken loops patterns - Hazard analysis proves models have measurable "thinking budgets" - Surface plots reveal capability boundaries nobody knew existed - Statistical rigor confirms these patterns are signal, not noise
For r12 documentation and ReasonScore v2: See r12.md and reasonscore.md
Interconnections: How It All Fits Together¶
The five stages form an interconnected research platform with forward data flow and iterative discovery-investigation loops.
flowchart TB
subgraph Pipeline["Data Production Pipeline"]
S1["Stage 1: Definition"]
S2["Stage 2: Execution"]
S3["Stage 3: Evaluation"]
DB[("PointsDB")]
S1 --> S2 --> S3 --> DB
end
subgraph Loop["Analysis Loop (Stages 4-5)"]
S4["Stage 4: Discovery<br/>(leaderboard, spider, explorer)"]
S5["Stage 5: Investigation<br/>The Three P's:<br/>Position → Profile → Probe"]
RF["Research Findings"]
S4 --> S5
S5 --> S4
S5 --> RF
end
subgraph Research["Research Loop"]
NM["New manifold designs<br/>Hypothesis tests<br/>Difficulty refinements"]
end
DB --> Loop
RF --> Research
Research -.inform.-> S1
style Pipeline fill:#e1f5fe
style Loop fill:#f3e5f5
style Research fill:#ffebee
What makes this work:
- Unified Data Layer - Stages 4 and 5 access identical PointsDB via API
- Complementary Modalities - Discovery optimizes for pattern recognition, investigation for root causes
- Flexible Entry Points - Start wherever makes sense for your research question
- Iterative Refinement - Each cycle improves understanding
- Research Loop Closure - Findings drive design, enabling science not just benchmarking
Next Steps¶
For New Users (Start with r12)¶
- Explore r12 data:
python analyze.py evals data/r12.json - Visual discovery: Open leaderboard, spiderweb, and explorer
- Learn by doing: Run forensic analysis on interesting patterns
- Read technical-details.md for statistical concepts
- Follow index.md to add your own models
For Researchers (Use r12 as Your Dataset)¶
- Start analysis immediately: No inference needed, 6.5B tokens ready to explore
- Review tools.md for complete forensic capabilities
- Study workflow.md for discovery-investigation patterns
- Consult tasks.md to understand manifold design
- Extend r12: Add your own models to the reference dataset
For Developers (Fork r12 as Template)¶
- Study r12 structure:
data/r12.jsonandtasks/*.json - Examine config.md for manifold and view definitions
- Review implementation.md for pipeline integration
- Adapt for your needs: Copy manifolds, modify difficulty ranges, add new surfaces
For LLM Agents (r12 is Agent-Ready)¶
- Start with
analyze.py evals data/r12.json --format json - Use
analyze.py tasksto discover available views - Query with
--format jsonfor machine-readable outputs - Follow workflow.md for systematic research patterns
See Also¶
Foundation Documents:
- challenges.md - The six fundamental challenges in current LLM evaluation
- insight.md - LLMs as information processors and system architecture
Core Documentation:
- r12.md - The extraordinary evidence (reference evaluation + research dataset)
- implementation.md - The Python codebase that realizes this methodology
Deep-Dive Design:
- manifold.md - Two-plane data model design decisions
- reasonscore.md - Unified metric architecture and design rationale
Reference Documentation:
- workflow.md - The Three P's research methodology
- tasks.md - Abstract task API specifications
- config.md - Configuration reference (manifolds, templates, samplers)
- pointsdb.md - Complete data structure API
- implementation.md - Complete tool reference