Methodology
System Architecture¶
graph LR
A[Task Definitions] --> C{Test Generator}
B[Difficulty Parameters] --> C
C -->|Test:text| D{Template}
C -->|Target:text| K
D -->|Test:prompt| E[[LLM Tokenizer]]
E -->|Input:tokens| F((LLM Inference))
F -->|Output:logprobs| G{Sampling}
G -->|Output:tokens| H[[LLM Tokenizer]]
H -->|Output:text| I{Parse}
I -->|Thought:text| J[Analysis]
I -->|Answer:text| K{Compare}
K -->|Success:bool| J
style A fill:#e1f5fe
style B fill:#e1f5fe
style J fill:#f3e5f5
style K fill:#fff3e0
style D fill:#fff3e0
style I fill:#fff3e0
style G fill:#fff3e0
style C fill:#e8f5e8
Information Processing Pipeline¶
ReasonScape treats LLM evaluation as the creation of an information processing system where we stimulate the input and observe the output as text - but the information is transformed several times:
- Task Generator creates parametric test instances with controlled difficulty
- Template Engine formats tests according to model-specific chat templates
- Tokenizer converts prompts to token sequences (with measurable spectral signatures)
- LLM Inference processes tokens through layers of learned parameters with non-linear activations in between
- Sampler collapses output distributions into concrete responses
- Parser extracts reasoning traces and final answers
- Evaluator compares results with statistical rigor and bias correction
When viewed through this lens we can see that LLMs produce quite complex systems - clearly there is more then 'text transformation' going on here.
LLM Sub-system Breakdown¶
This architectural breakdown reveals that LLM system under test have multiple analyzable components:
- Chat Templates: Model-specific prompt formatting affecting performance
- Tokenizers: Text↔token conversion with distinct frequency signatures
- Inference Engines: The actual model parameters and architecture
- Sampling Strategies: Temperature, top-p, and other generation controls
Each of these contributes to the performance of the overall system and must be understood to make sure we are really measuring what we think we are!
Statistics, without Lies¶
At the core of this system is the idea that we need to understand if we have generated enough samples to achieve statistically significant results. It turns out several corrections are necessary!
Truncation Removal and Token Counts¶
Truncations are not errors in task output, they are a distinct KPI representing failures to complete the task due to context limits are handled specially:
- Truncated responses are removed from accuracy calculations (not counted as failures)
- Truncation rates are tracked separately as a quality metric
- Higher truncation rates increase confidence intervals but preserve accuracy measurement
- Models with high truncation rates may need context expansion or different sampling parameters
Token Counts and histograms are separately computed for correct and incorrect answers.
Excess Accuracy Correction with 95% CI¶
Traditional benchmarks suffer from guessing inflation - a model that randomly selects from 4 multiple choice options appears to achieve 25% accuracy despite having zero knowledge. This creates two critical problems:
- Baseline Confusion: Is 60% accuracy actually 60% knowledge or 35% knowledge + 25% luck?
- Task Comparison: Binary tasks (50% guess rate) and 8-option tasks (12.5% guess rate) can't be meaningfully compared
ReasonScape applies excess accuracy correction by computing true knowledge above random chance:
For multiple choice: knowledge = (accuracy - 1/n) × n/(n-1)
For binary tasks: knowledge = 2 × (accuracy - 0.5)
For write-in tasks: knowledge = accuracy (no guessing possible)
For each evaluation batch:
- Calculate expected guessing successes:
Σ(1 - true_knowledge_probability)
- Subtract from both successes and total trials
- Compute 95% winston confidence intervals on the excess accuracy
- Report knowledge-adjusted probability of success where .000 = "no better than guessing" and 1.000 = "perfect knowledge"
This correction enables fair comparison across different question formats and prevents gaming through easier task selection.
Dynamic CI with Truncation Awareness¶
The execution environment supports a target CI that will continue to generate additional tests in batches until this CI is reached or a safety limit is hit. This allows optimal allocation of compute resources to points on the manifold that would benefit the most!
For high-truncation points that would otherwise waste resources, an alternative CI target can be provided. This saves up to 30% tokens and speeds up evaluations by 2-3x by preventing KV cache overflow!
Infinite Parametric Test Manifolds¶
To support the statistical demands, we build parametric test generation manifolds that can generate infinite task samples grouped by multi-dimensional difficulty.
Sweeping across these dimensions allows us to 'image' a model's capability on a particular task: understand which attributes of the task allow the model to complete it succesfully and with least effort.
Token-Frequency Domain Analysis¶
Massive randomized test generation introduces unique challenges: how can we be sure the performance differences between populations are real, and not caused by unexpected issues inside the random test population?
ReasonScape introduces Token-Frequency Domain Analysis as a multi-faceted tool specifically to study populations of test instances. We apply chat templates and tokenize reasoning problems, then apply FFT to each sample and compute the mean and std.dev of the resulting populations of frequency spectrums.
These Token-Frequency Domain Distribution views of the data allows us to both validate that the populations are not different in unexpected ways (as perceived by the LLM), but also exposes unexpected insights such as that that many of the difficulty manifold parameters correspond to well-understood phenomenon from the RF space such as gain, compression and modulation.
Progressive Evaluation Architecture¶
ReasonScape solves the fundamental scaling problem in AI evaluation: comprehensive analysis requires millions of tokens per model, making systematic template/sampler exploration prohibitively expensive. Traditional benchmarks force researchers to choose between affordable small-scale evaluation or statistically rigorous large-scale analysis.
ReasonScape introduces Progressive Evaluation - a three-component architecture enabling incremental, hierarchical evaluation at arbitrary scales:
1. Deterministic Manifold Seeding¶
Every difficulty manifold point generates identical test sequences across runs through coordinate-based RNG seeding:
def generate_cache_key(cache_data):
cache_str = json.dumps(cache_data, sort_keys=True)
return hashlib.sha256(cache_str.encode()).hexdigest()
This ensures {length: 16, max_depth: 4}
generates identical tests whether evaluated alone or as part of a larger grid, enabling perfect reproducibility and hierarchical sampling.
2. Response Caching¶
All LLM inference requests are cached using payload-based keys:
# Cache key includes: model, messages, temperature, max_tokens, etc.
cache_key = generate_cache_key({
"model": "phi-4-fp16",
"messages": [...],
"temperature": 0.0,
"max_tokens": 4096
})
Identical requests (same model + prompt + parameters) are never re-executed, dramatically reducing costs for:
- Template/sampler exploration: Test multiple configurations on same model
- Ablation studies: Systematic parameter sweeps
- Incremental scaling: Expand sample sizes without redundant inference
3. Count-Invariant Generation¶
Task manifolds guarantee that smaller sample counts are perfect subsets of larger ones:
count=32: [test_0, test_1, test_2, ..., test_31]
count=128: [test_0, test_1, test_2, ..., test_31, test_32, ..., test_127]
This hierarchical test generation enables:
- Bidirectional scaling: Downsample existing results or expand sample sizes
- Adaptive confidence targeting: Continue sampling until statistical significance
- Progressive analysis: Quick model comparison → detailed statistical analysis
4. Manifold-based Test Specification¶
The configuration file format makes it easy to specify multiple overlapping grids of parameters, and the visualization tools allow merging data collected at multiple levels of both density and precision.