ReasonScape Technical Details¶
Prerequisites:
- architecture.md - The five-stage methodology
- implementation.md - How the codebase is organized
- insight.md - The information processing paradigm
This document provides low-level algorithms and implementation details of the core technical mechanisms underlying ReasonScape.
Table of Contents¶
- Parametric Test Generation
- Token-Frequency Domain Analysis
- Progressive Evaluation Architecture
- Statistical Methodology
- Implementation References
Parametric Test Generation¶
See Architecture: Stage 1 for the design philosophy.
Coordinate-Based Seeding¶
Every test sequence is deterministically generated from the parameter coordinates:
# From runner.py:472-476
# Create stable seed based on params (excluding 'count')
seed_params = {k: v for k, v in step_info['params'].items() if k != 'count'}
param_hash = generate_cache_key(seed_params) # SHA-256 hash of JSON-serialized params
base_seed = int(param_hash[-8:], 16) # Take last 8 hex digits as seed
generator.rng = random.Random(args.seed + base_seed)
Where generate_cache_key() is:
def generate_cache_key(cache_data):
cache_str = json.dumps(cache_data, sort_keys=True)
return hashlib.sha256(cache_str.encode()).hexdigest()
How It Works:
- Extract parameter coordinates (e.g.,
{"length": 16, "depth": 3}) - Remove
countparameter (doesn't affect test content, only quantity) - Compute SHA-256 hash of JSON-serialized parameters
- Take last 8 hex digits as integer seed
- Add global seed offset (
args.seed) - Initialize
random.Random()with this seed - Call
generator.generate_random(**params)which uses the seeded RNG
Properties:
- Same coordinates always produce identical test sequences
- Smaller
countvalues are perfect subsets of larger ones (hierarchical sampling) - Different global seeds (
args.seed) produce different test sets for same coordinates - Enables deterministic caching and reproducible evaluations
Manifold Parameter Types¶
Tasks define difficulty through multiple dimensions:
| Dimension Type | Example Parameters | Effect on Difficulty |
|---|---|---|
| Length | length, num_terms, num_steps |
Working memory load |
| Depth | max_depth, nesting |
Structural complexity |
| Interference | distractors, noise_ratio |
Selective attention demand |
| Format | whitespace, case_mutations |
Tokenization stress |
| Multi-step | num_operations |
Sequential reasoning |
See Also: Task Documentation for per-task manifold specifications.
Token-Frequency Domain Analysis¶
See insight.md for the information processing paradigm that motivates this analysis.
ReasonScape introduces Token-Frequency Domain Analysis to validate test populations and reveal architectural patterns invisible in text space.
Methodology¶
For each test instance:
- Apply chat template to test text
- Tokenize using model-specific tokenizer
- Compute token sequence length
N - Apply FFT to token ID sequence:
F = FFT(token_ids) - Compute magnitude spectrum:
|F|(discard phase) - Normalize by sequence length
For each population of tests at a manifold point:
- Collect magnitude spectra from all samples
- Compute mean spectrum:
μ(f) = mean(|F_i(f)|)across samples - Compute standard deviation:
σ(f) = std(|F_i(f)|)
Interpretation¶
Frequency Domain Insights:
- Low frequencies: Long-range patterns, sentence structure
- High frequencies: Token-level variations, formatting noise
- Spectral peaks: Repeated patterns, structural rhythms
- Bandwidth: Diversity of token patterns
Cross-Model Comparisons:
- Different tokenizers create different spectral signatures
- Same test text → different frequency profiles per model
- Reveals tokenization efficiency and pattern recognition
RF Domain Analogies:
- Gain: Spectral amplitude changes with difficulty
- Compression: Bandwidth reduction under cognitive load
- Modulation: Parameter sweeps create predictable spectral shifts
Applications¶
- Population Validation: Ensure test sets have expected spectral properties
- Tokenizer Analysis: Identify efficiency differences between models
- Difficulty Calibration: Verify manifold parameters affect input complexity
- Failure Mode Diagnosis: Correlate spectral patterns with accuracy drops
See Also: - analyze.py fft for forensic investigation - Explorer for interactive visualization
Progressive Evaluation Architecture¶
See Architecture: Stage 2 for the execution philosophy.
1. Response Caching Implementation¶
All inference requests are cached using SHA-256 hashes of the complete request payload:
def generate_cache_key(cache_data: dict) -> str:
"""
Generate deterministic cache key from request payload.
Args:
cache_data: {
"model": "phi-4-fp16",
"messages": [...],
"temperature": 0.0,
"max_tokens": 4096,
"top_p": 1.0,
# ... all sampling parameters
}
Returns:
64-character hex string (SHA-256 hash)
"""
cache_str = json.dumps(cache_data, sort_keys=True)
return hashlib.sha256(cache_str.encode()).hexdigest()
Cache Behavior:
- Every unique prompt gets its own cache entry
- Changing any parameter (temperature, max_tokens, etc.) creates new entry
- Identical requests (same model + prompt + parameters) are never re-executed
- Deterministic test generation ensures reproducible cache keys
- Typical cost reduction of 30% for 3-tiered evaluation
2. Hierarchical Sampling¶
Test generators guarantee subset relationships:
# Example: count-invariant generation
manifold_point = {"length": 16, "depth": 3}
seed = hash(("arithmetic", (("depth", 3), ("length", 16))))
# Generate different sample counts
tests_32 = generate_tests(seed, count=32) # [t0, t1, ..., t31]
tests_128 = generate_tests(seed, count=128) # [t0, t1, ..., t127]
# Verify subset property
assert tests_32 == tests_128[:32] # Always True
Enabled Workflows:
- Upsampling: Add more samples to existing evaluation (no waste)
- Downsampling: Use subset of large evaluation for quick comparison
- Adaptive Sampling: Continue until statistical confidence target reached
- Cost Optimization: Start small, scale only where needed
3. Dynamic Confidence Targeting¶
Progressive sampling adapts to the statistical needs of each manifold point:
# Pseudocode for adaptive sampling
min_samples = 32
max_samples = 512
target_ci_width = 0.05
samples = min_samples
while samples < max_samples:
accuracy, ci_lower, ci_upper = compute_statistics(samples)
ci_width = ci_upper - ci_lower
if ci_width <= target_ci_width:
break # Sufficient confidence
samples += 32 # Request more samples
Optimization:
- Easy points (near 0% or 100% accuracy) reach confidence quickly
- Hard points (near 50% accuracy) require more samples
- High-truncation points stop early to avoid wasting tokens
4. Truncation-Aware Execution¶
Models that consistently truncate on specific manifold points waste resources:
Standard Approach (wasteful):
- Request 128 samples at difficult point
- 100 samples truncate (context limit exceeded)
- Only 28 valid responses, wide confidence intervals
- Wasted 100 × max_tokens on truncations
ReasonScape Approach (efficient):
- Detect high truncation rate (>50%)
- Apply relaxed confidence target (or stop sampling)
- Allocate saved tokens to more productive points
- Track truncation rate as separate quality metric
See Also: runner.py for execution implementation details.
Statistical Methodology¶
See Architecture: Stage 3 for the evaluation philosophy.
Excess Accuracy Correction¶
Traditional benchmarks suffer from guessing inflation: a model randomly selecting from 4 options appears to achieve 25% accuracy despite zero knowledge.
Problem:
- Is 60% accuracy actually 60% knowledge or 35% knowledge + 25% luck?
- Binary tasks (50% guess) and 8-option tasks (12.5% guess) can't be compared
ReasonScape Solution: Remove expected guessing contributions before statistical analysis.
Algorithm¶
For each test sample, determine guess probability:
if response_format == "multiple_choice":
guess_chance = 1.0 / len(valid_options)
elif response_format == "write_in":
guess_chance = 0.0
For each evaluation batch:
def compute_excess_accuracy(results: List[TestResult]) -> Tuple[float, float, float]:
"""
Compute excess accuracy with 95% confidence intervals.
Returns:
(accuracy, ci_lower, ci_upper) where:
- 0.000 = no better than guessing
- 1.000 = perfect knowledge
"""
# Step 1: Accumulate guess contributions
guess_accumulator = sum(r.guess_chance for r in results if not r.truncated)
correct = sum(1 for r in results if r.correct and not r.truncated)
total = sum(1 for r in results if not r.truncated)
# Step 2: Adjust totals
adjusted_successes = correct - guess_accumulator
adjusted_trials = total - guess_accumulator
# Step 3: Compute Wilson confidence interval
if adjusted_trials <= 0:
return 0.0, 0.0, 0.0
p = adjusted_successes / adjusted_trials
n = adjusted_trials
# Wilson score interval
z = 1.96 # 95% confidence
denominator = 1 + z**2 / n
center = (p + z**2 / (2*n)) / denominator
margin = z * sqrt(p * (1 - p) / n + z**2 / (4*n**2)) / denominator
accuracy = max(0.0, min(1.0, p))
ci_lower = max(0.0, center - margin)
ci_upper = min(1.0, center + margin)
return accuracy, ci_lower, ci_upper
Properties:
- Fair comparison: All tasks normalized to [0, 1] knowledge scale
- Honest baseline: 0.000 explicitly means "pure guessing"
- Statistical rigor: Wilson intervals handle edge cases (small n, extreme p)
See Also: evaluate.py:compute_bucket_stats() for complete implementation.
Truncation Handling¶
Truncations are not errors in task output—they are distinct quality metrics:
Treatment:
- Remove from accuracy calculations: Truncations ≠ wrong answers
- Track separately: Truncation rate as independent KPI
- Widen confidence intervals: Reduced effective sample size increases uncertainty
- Report explicitly: Leaderboard shows truncation indicators
Example:
200 total samples:
- 150 correct
- 30 incorrect
- 20 truncated
Traditional: 150/200 = 75% accuracy
ReasonScape: 150/180 = 83.3% accuracy, 10% truncation rate
Interpretation:
- High accuracy + low truncation = strong performance
- High accuracy + high truncation = capable but context-limited
- Low accuracy + high truncation = both capability and context issues
- Low accuracy + low truncation = capability issue, not context
Token Analysis¶
Completion token counts are tracked and analyzed separately for truncated, correct and incorrect answers.
Applications:
- Efficiency analysis: Identify models that waste tokens
- Pattern detection: Do incorrect answers consistently use more/fewer tokens?
- Truncation prediction: What token counts correlate with truncation?
- Cost optimization: Balance accuracy vs token usage
See Also:
Compression Pre-Computation¶
During processing, we compute gzip(reasoning_trace) for every reasoning trace output by the LLM.
Rationale: Compression size is a proxy for entropy:
- Low compression (large output) = high entropy, information-rich reasoning
- High compression (small output) = low entropy, repetitive or degenerate reasoning
Storage Format:
(status: int, tokens: int, compressed_size: int)
# Status values: 0 "incorrect", 1 "correct", 2 "truncated"
Applications:
- Failure mechanism investigation: Do failing samples show low entropy (loops)?
- Efficiency comparison: Which models maintain high entropy under load?
- Cognitive analysis: Information-theoretic view of reasoning quality
See Also: - analyze.py compression - analyze.py hazard
ReasonScore Calculation¶
ReasonScore provides a single, interpretable metric that aggregates performance across all tasks and difficulty levels while accounting for statistical reliability and context limit issues.
ReasonScore is computed in four layers using mathematically appropriate operations at each level:
Layer 1: Samples → Point Score [Wilson CI]
Layer 2: Points → Task Score [Wilson CI re-aggregation]
Layer 3: Tasks → Tier ReasonScore [Geometric Mean × 1000]
Layer 4: Tiers → score/token [Arithmetic Mean ÷ tokens]
Point-level formula:
point_score = adjusted_center + adjusted_margin - truncated_ratio
Tier-level aggregation:
ReasonScore = 1000 × Geometric_Mean([bucket_scores])
Key properties: - Optimistic about uncertainty (add margin) - Pessimistic about failures (subtract truncation) - Punishes imbalance (geometric mean across tasks)
For complete design rationale, layer-by-layer computation, and philosophical motivation: See reasonscore.md
Data Model: Two-Plane Structure¶
ReasonScape stores evaluation results in a two-plane structure where each point exists simultaneously in both an Evaluation Plane (model/template/sampler) and a Task-Complexity Plane (task/params).
Identity dimensions (5D):
- Evaluation Plane:
model,template,sampler - Task-Complexity Plane:
base_task,params
Facet dimensions (multi-valued tags):
- Evaluation Plane:
eval_id,groups[] - Task-Complexity Plane:
tiers[],surfaces[],projections[]
Key features:
- Points with identical 5D identity are de-duplicated
- Facets enable multi-view organization (group by tier, filter by surface, etc.)
- Orthogonal planes enable independent variation (same model across difficulties, different models at same difficulty)
For complete design decisions, orthogonality rationale, and facet computation: See manifold.md
For complete API reference and query patterns: See pointsdb.md
Implementation References¶
Core Components¶
Stage 1: Definition
- tasks/*.py - Parametric test generators with Pydantic schemas
- configs/*.yaml - Manifold definitions and sampling strategies
- resolver.py - Configuration validator and cost predictor
- tasks.md - Abstract task API reference
- config.md - Templates, samplers, and dataset configuration
Stage 2: Execution
- runner.py - Execution orchestrator with caching
- templates/*.json - Chat template definitions
- samplers/*.json - Generation parameter presets
Stage 3: Evaluation
- evaluate.py - Unified evaluation with dataset and interview modes
- data/dataset-*.json - Bucket organization metadata
- pointsdb.md - Complete data structure API
Stage 4: Exploration
- leaderboard.py - Interactive rankings and heatmaps
- spiderweb.py - Per-model diagnostics (V2)
- explorer.py - 3D manifold navigation
Stage 5: Research
- analyze.py - Unified 9-subcommand interface (V2)
- tools.md - Complete tool reference
See Also¶
- implementation.md - Top-level implementation overview
- architecture.md - Five-stage methodology
- manifold.md - Two-plane data model design
- reasonscore.md - Unified metric design
- config.md - Configuration reference
- tasks.md - Task API specifications
- tools.md - Complete tool reference