Methodology
System Architecture¶
graph LR
A[Task Definitions] --> C{Test Generator}
B[Difficulty Parameters] --> C
C -->|Test:text| D{Template}
C -->|Target:text| K
D -->|Test:prompt| E[[LLM Tokenizer]]
E -->|Input:tokens| F((LLM Inference))
F -->|Output:logprobs| G{Sampling}
G -->|Output:tokens| H[[LLM Tokenizer]]
H -->|Output:text| I{Parse}
I -->|Thought:text| J[Analysis]
I -->|Answer:text| K{Compare}
K -->|Success:bool| J
style A fill:#e1f5fe
style B fill:#e1f5fe
style J fill:#f3e5f5
style K fill:#fff3e0
style D fill:#fff3e0
style I fill:#fff3e0
style G fill:#fff3e0
style C fill:#e8f5e8
See Also:
- Task Documentation for detailed task generator implementations
- Configuration Guide for template and sampler definitions
- Tools Overview for the complete evaluation pipeline
Information Processing Pipeline¶
ReasonScape treats LLM evaluation as the creation of an information processing system where we stimulate the input and observe the output as text - but the information is transformed several times:
- Task Generator creates parametric test instances with controlled difficulty
- Template Engine formats tests according to model-specific chat templates
- Tokenizer converts prompts to token sequences (with measurable spectral signatures)
- LLM Inference processes tokens through layers of learned parameters with non-linear activations in between
- Sampler collapses output distributions into concrete responses
- Parser extracts reasoning traces and final answers
- Evaluator compares results with statistical rigor and bias correction
When viewed through this lens we can see that LLMs produce quite complex systems - clearly there is more than 'text transformation' going on here.
LLM Sub-system Breakdown¶
This architectural breakdown reveals that the LLM system under test has multiple analyzable components:
- Chat Templates: Model-specific prompt formatting affecting performance
- Tokenizers: Text↔token conversion with distinct frequency signatures
- Inference Engines: The actual model parameters and architecture
- Sampling Strategies: Temperature, top-p, and other generation controls
Each of these contributes to the performance of the overall system and must be understood to make sure we are really measuring what we think we are!
Statistics, without Lies¶
At the core of this system is the idea that we need to understand if we have generated enough samples to achieve statistically significant results. It turns out several corrections are necessary!
Truncation Removal and Token Counts¶
Truncations are not errors in task output; they are a distinct KPI representing failures to complete the task due to context limits and are handled specially:
- Truncated responses are removed from accuracy calculations (not counted as failures)
- Truncation rates are tracked separately as a quality metric
- Higher truncation rates increase confidence intervals but preserve accuracy measurement
- Models with high truncation rates may need context expansion or different sampling parameters
Token Counts and histograms are separately computed for correct and incorrect answers.
See Also: evaluate.py for implementation details of truncation handling and token analysis.
Excess Accuracy Correction with 95% CI¶
Traditional benchmarks suffer from guessing inflation - a model that randomly selects from 4 multiple choice options appears to achieve 25% accuracy despite having zero knowledge. This creates two critical problems:
- Baseline Confusion: Is 60% accuracy actually 60% knowledge or 35% knowledge + 25% luck?
- Task Comparison: Binary tasks (50% guess rate) and 8-option tasks (12.5% guess rate) can't be meaningfully compared
ReasonScape applies excess accuracy correction by directly removing expected guessing contributions from the success and trial counts before computing confidence intervals.
For each test sample, the guess probability is determined by the response format:
guess_chance = 1/n where n = number of valid options (multiple choice)
guess_chance = 0 for write-in tasks (no guessing possible)
For each evaluation batch:
- Compute guess probability for each sample:
guess_chance = 1/len(response_enum)(or 0 if no enum) - Sum expected guessing successes:
guess_accumulator = Σ(guess_chance)across all samples - Compute adjusted metrics:
adjusted_successes = correct - guess_accumulatoradjusted_trials = total - guess_accumulator- Compute 95% Wilson confidence intervals on the adjusted values
- Report excess-accuracy-adjusted Wilson interval where 0.000 = "no better than guessing" and 1.000 = "perfect knowledge"
This direct subtraction approach enables fair comparison across different question formats and prevents gaming through easier task selection.
See Also: evaluate.py:compute_bucket_stats() for the statistical implementation and Tools Overview for usage examples.
Dynamic CI with Truncation Awareness¶
The execution environment supports a target CI that will continue to generate additional tests in batches until this CI is reached or a safety limit is hit. This allows optimal allocation of compute resources to points on the manifold that would benefit the most!
For high-truncation points that would otherwise waste resources, an alternative CI target can be provided. This saves up to 30% tokens and speeds up evaluations by 2-3x by preventing KV cache overflow!
Infinite Parametric Test Manifolds¶
To support the statistical demands, we build parametric test generation manifolds that can generate infinite task samples grouped by multi-dimensional difficulty.
Sweeping across these dimensions allows us to 'image' a model's capability on a particular task: understand which attributes of the task allow the model to complete it successfully and with least effort.
Token-Frequency Domain Analysis¶
Massive randomized test generation introduces unique challenges: how can we be sure the performance differences between populations are real, and not caused by unexpected issues inside the random test population?
ReasonScape introduces Token-Frequency Domain Analysis as a multi-faceted tool specifically to study populations of test instances. We apply chat templates and tokenize reasoning problems, then apply FFT to each sample and compute the mean and std.dev of the resulting populations of frequency spectrums.
These Token-Frequency Domain Distribution views of the data allow us to validate that the populations are not different in unexpected ways (as perceived by the LLM), and also expose unexpected insights such as that many of the difficulty manifold parameters correspond to well-understood phenomena from the RF domain, such as gain, compression, and modulation.
See Also: evaluate.py for FFT implementation details and Explorer Guide for visualization tools.
Progressive Evaluation Architecture¶
ReasonScape solves the fundamental scaling problem in AI evaluation: comprehensive analysis requires millions of tokens per model, making systematic template/sampler exploration prohibitively expensive. Traditional benchmarks force researchers to choose between affordable small-scale evaluation or statistically rigorous large-scale analysis.
ReasonScape introduces Progressive Evaluation - a three-component architecture enabling incremental, hierarchical evaluation at arbitrary scales:
1. Deterministic Manifold Seeding¶
Every difficulty manifold point generates identical test sequences across runs through coordinate-based RNG seeding:
def generate_cache_key(cache_data):
cache_str = json.dumps(cache_data, sort_keys=True)
return hashlib.sha256(cache_str.encode()).hexdigest()
This ensures {length: 16, max_depth: 4} generates identical tests whether evaluated alone or as part of a larger grid, enabling perfect reproducibility and hierarchical sampling.
See Also: resolver.py for configuration validation and runner.py for execution implementation.
2. Response Caching¶
All LLM inference requests are cached using payload-based keys:
# Cache key includes: model, messages, temperature, max_tokens, etc.
cache_key = generate_cache_key({
"model": "phi-4-fp16",
"messages": [...],
"temperature": 0.0,
"max_tokens": 4096
})
Identical requests (same model + prompt + parameters) are never re-executed, dramatically reducing costs for:
- Template/sampler exploration: Test multiple configurations on same model
- Ablation studies: Systematic parameter sweeps
- Incremental scaling: Expand sample sizes without redundant inference
See Also: runner.py for caching implementation and QuickStart Guide for usage examples.
3. Count-Invariant Generation¶
Task manifolds guarantee that smaller sample counts are perfect subsets of larger ones:
count=32: [test_0, test_1, test_2, ..., test_31]
count=128: [test_0, test_1, test_2, ..., test_31, test_32, ..., test_127]
This hierarchical test generation enables:
- Bidirectional scaling: Downsample existing results or expand sample sizes
- Adaptive confidence targeting: Continue sampling until statistical significance
- Progressive analysis: Quick model comparison → detailed statistical analysis
See Also: M12X Documentation for scaling examples and evaluate.py for statistical processing.
4. Manifold-based Test Specification¶
The configuration file format makes it easy to specify multiple overlapping grids of parameters, and the visualization tools allow merging data collected at multiple levels of both density and precision.
See Also: Configuration Guide for detailed manifold definitions and resolver.py for analysis tools.