ReasonScape Technical Details¶

Prerequisites:

architecture.md - The five-stage methodology
implementation.md - How the codebase is organized
insight.md - The information processing paradigm

This document provides low-level algorithms and implementation details of the core technical mechanisms underlying ReasonScape.

Parametric Test Generation¶

See Architecture: Stage 1 for the design philosophy.

Coordinate-Based Seeding¶

Every test sequence is deterministically generated from the parameter coordinates:

# From runner.py:472-476
# Create stable seed based on params (excluding 'count')
seed_params = {k: v for k, v in step_info['params'].items() if k != 'count'}
param_hash = generate_cache_key(seed_params)  # SHA-256 hash of JSON-serialized params
base_seed = int(param_hash[-8:], 16)  # Take last 8 hex digits as seed
generator.rng = random.Random(args.seed + base_seed)

Where generate_cache_key() is:

def generate_cache_key(cache_data):
    cache_str = json.dumps(cache_data, sort_keys=True)
    return hashlib.sha256(cache_str.encode()).hexdigest()

How It Works:

Extract parameter coordinates (e.g., {"length": 16, "depth": 3})
Remove count parameter (doesn't affect test content, only quantity)
Compute SHA-256 hash of JSON-serialized parameters
Take last 8 hex digits as integer seed
Add global seed offset (args.seed)
Initialize random.Random() with this seed
Call generator.generate_random(**params) which uses the seeded RNG

Properties:

Same coordinates always produce identical test sequences
Smaller count values are perfect subsets of larger ones (hierarchical sampling)
Different global seeds (args.seed) produce different test sets for same coordinates
Enables deterministic caching and reproducible evaluations

Manifold Parameter Types¶

Tasks define difficulty through multiple dimensions:

Dimension Type	Example Parameters	Effect on Difficulty
Length	`length`, `num_terms`, `num_steps`	Working memory load
Depth	`max_depth`, `nesting`	Structural complexity
Interference	`distractors`, `noise_ratio`	Selective attention demand
Format	`whitespace`, `case_mutations`	Tokenization stress
Multi-step	`num_operations`	Sequential reasoning

See Also: Task Documentation for per-task manifold specifications.

Token-Frequency Domain Analysis¶

See insight.md for the information processing paradigm that motivates this analysis.

ReasonScape introduces Token-Frequency Domain Analysis to validate test populations and reveal architectural patterns invisible in text space.

Methodology¶

For each test instance:

Apply chat template to test text
Tokenize using model-specific tokenizer
Compute token sequence length N
Apply FFT to token ID sequence: F = FFT(token_ids)
Compute magnitude spectrum: |F| (discard phase)
Normalize by sequence length

For each population of tests at a manifold point:

Collect magnitude spectra from all samples
Compute mean spectrum: μ(f) = mean(|F_i(f)|) across samples
Compute standard deviation: σ(f) = std(|F_i(f)|)

Interpretation¶

Frequency Domain Insights:

Low frequencies: Long-range patterns, sentence structure
High frequencies: Token-level variations, formatting noise
Spectral peaks: Repeated patterns, structural rhythms
Bandwidth: Diversity of token patterns

Cross-Model Comparisons:

Different tokenizers create different spectral signatures
Same test text → different frequency profiles per model
Reveals tokenization efficiency and pattern recognition

RF Domain Analogies:

Gain: Spectral amplitude changes with difficulty
Compression: Bandwidth reduction under cognitive load
Modulation: Parameter sweeps create predictable spectral shifts

Applications¶

Population Validation: Ensure test sets have expected spectral properties
Tokenizer Analysis: Identify efficiency differences between models
Difficulty Calibration: Verify manifold parameters affect input complexity
Failure Mode Diagnosis: Correlate spectral patterns with accuracy drops

See Also: - analyze.py fft for forensic investigation - Explorer for interactive visualization

Progressive Evaluation Architecture¶

See Architecture: Stage 2 for the execution philosophy.

1. Response Caching Implementation¶

All inference requests are cached using SHA-256 hashes of the complete request payload:

def generate_cache_key(cache_data: dict) -> str:
    """
    Generate deterministic cache key from request payload.

    Args:
        cache_data: {
            "model": "phi-4-fp16",
            "messages": [...],
            "temperature": 0.0,
            "max_tokens": 4096,
            "top_p": 1.0,
            # ... all sampling parameters
        }

    Returns:
        64-character hex string (SHA-256 hash)
    """
    cache_str = json.dumps(cache_data, sort_keys=True)
    return hashlib.sha256(cache_str.encode()).hexdigest()

Cache Behavior:

Every unique prompt gets its own cache entry
Changing any parameter (temperature, max_tokens, etc.) creates new entry
Identical requests (same model + prompt + parameters) are never re-executed
Deterministic test generation ensures reproducible cache keys
Typical cost reduction of 30% for 3-tiered evaluation

2. Hierarchical Sampling¶

Test generators guarantee subset relationships:

# Example: count-invariant generation
manifold_point = {"length": 16, "depth": 3}
seed = hash(("arithmetic", (("depth", 3), ("length", 16))))

# Generate different sample counts
tests_32 = generate_tests(seed, count=32)   # [t0, t1, ..., t31]
tests_128 = generate_tests(seed, count=128) # [t0, t1, ..., t127]

# Verify subset property
assert tests_32 == tests_128[:32]  # Always True

Enabled Workflows:

Upsampling: Add more samples to existing evaluation (no waste)
Downsampling: Use subset of large evaluation for quick comparison
Adaptive Sampling: Continue until statistical confidence target reached
Cost Optimization: Start small, scale only where needed

3. Dynamic Confidence Targeting¶

Progressive sampling adapts to the statistical needs of each manifold point:

# Pseudocode for adaptive sampling
min_samples = 32
max_samples = 512
target_ci_width = 0.05

samples = min_samples
while samples < max_samples:
    accuracy, ci_lower, ci_upper = compute_statistics(samples)
    ci_width = ci_upper - ci_lower

    if ci_width <= target_ci_width:
        break  # Sufficient confidence

    samples += 32  # Request more samples

Optimization:

Easy points (near 0% or 100% accuracy) reach confidence quickly
Hard points (near 50% accuracy) require more samples
High-truncation points stop early to avoid wasting tokens

4. Truncation-Aware Execution¶

Models that consistently truncate on specific manifold points waste resources:

Standard Approach (wasteful):

Request 128 samples at difficult point
100 samples truncate (context limit exceeded)
Only 28 valid responses, wide confidence intervals
Wasted 100 × max_tokens on truncations

ReasonScape Approach (efficient):

Detect high truncation rate (>50%)
Apply relaxed confidence target (or stop sampling)
Allocate saved tokens to more productive points
Track truncation rate as separate quality metric

See Also: runner.py for execution implementation details.

Statistical Methodology¶

See Architecture: Stage 3 for the evaluation philosophy.

Excess Accuracy Correction¶

Traditional benchmarks suffer from guessing inflation: a model randomly selecting from 4 options appears to achieve 25% accuracy despite zero knowledge.

Problem:

Is 60% accuracy actually 60% knowledge or 35% knowledge + 25% luck?
Binary tasks (50% guess) and 8-option tasks (12.5% guess) can't be compared

ReasonScape Solution: Remove expected guessing contributions before statistical analysis.

Algorithm¶

For each test sample, determine guess probability:

if response_format == "multiple_choice":
    guess_chance = 1.0 / len(valid_options)
elif response_format == "write_in":
    guess_chance = 0.0

For each evaluation batch:

def compute_excess_accuracy(results: List[TestResult]) -> Tuple[float, float, float]:
    """
    Compute excess accuracy with 95% confidence intervals.

    Returns:
        (accuracy, ci_lower, ci_upper) where:
        - 0.000 = no better than guessing
        - 1.000 = perfect knowledge
    """
    # Step 1: Accumulate guess contributions
    guess_accumulator = sum(r.guess_chance for r in results if not r.truncated)
    correct = sum(1 for r in results if r.correct and not r.truncated)
    total = sum(1 for r in results if not r.truncated)

    # Step 2: Adjust totals
    adjusted_successes = correct - guess_accumulator
    adjusted_trials = total - guess_accumulator

    # Step 3: Compute Wilson confidence interval
    if adjusted_trials <= 0:
        return 0.0, 0.0, 0.0

    p = adjusted_successes / adjusted_trials
    n = adjusted_trials

    # Wilson score interval
    z = 1.96  # 95% confidence
    denominator = 1 + z**2 / n
    center = (p + z**2 / (2*n)) / denominator
    margin = z * sqrt(p * (1 - p) / n + z**2 / (4*n**2)) / denominator

    accuracy = max(0.0, min(1.0, p))
    ci_lower = max(0.0, center - margin)
    ci_upper = min(1.0, center + margin)

    return accuracy, ci_lower, ci_upper

Properties:

Fair comparison: All tasks normalized to [0, 1] knowledge scale
Honest baseline: 0.000 explicitly means "pure guessing"
Statistical rigor: Wilson intervals handle edge cases (small n, extreme p)

See Also: evaluate.py:compute_bucket_stats() for complete implementation.

Truncation Handling¶

Truncations are not errors in task output—they are distinct quality metrics:

Treatment:

Remove from accuracy calculations: Truncations ≠ wrong answers
Track separately: Truncation rate as independent KPI
Widen confidence intervals: Reduced effective sample size increases uncertainty
Report explicitly: Leaderboard shows truncation indicators

Example:

200 total samples:
- 150 correct
- 30 incorrect
- 20 truncated

Traditional: 150/200 = 75% accuracy
ReasonScape: 150/180 = 83.3% accuracy, 10% truncation rate

Interpretation:

High accuracy + low truncation = strong performance
High accuracy + high truncation = capable but context-limited
Low accuracy + high truncation = both capability and context issues
Low accuracy + low truncation = capability issue, not context

Token Analysis¶

Completion token counts are tracked and analyzed separately for truncated, correct and incorrect answers.

Applications:

Efficiency analysis: Identify models that waste tokens
Pattern detection: Do incorrect answers consistently use more/fewer tokens?
Truncation prediction: What token counts correlate with truncation?
Cost optimization: Balance accuracy vs token usage

See Also:

Compression Pre-Computation¶

During processing, we compute gzip(reasoning_trace) for every reasoning trace output by the LLM.

Rationale: Compression size is a proxy for entropy:

Low compression (large output) = high entropy, information-rich reasoning
High compression (small output) = low entropy, repetitive or degenerate reasoning

Storage Format:

(status: int, tokens: int, compressed_size: int)
# Status values: 0 "incorrect", 1 "correct",  2 "truncated"

Applications:

Failure mechanism investigation: Do failing samples show low entropy (loops)?
Efficiency comparison: Which models maintain high entropy under load?
Cognitive analysis: Information-theoretic view of reasoning quality

See Also: - analyze.py compression - analyze.py hazard

ReasonScore Calculation¶

ReasonScore provides a single, interpretable metric that aggregates performance across all tasks and difficulty levels while accounting for statistical reliability and context limit issues.

ReasonScore is computed in four layers using mathematically appropriate operations at each level:

Layer 1: Samples → Point Score       [Wilson CI]
Layer 2: Points → Task Score         [Wilson CI re-aggregation]
Layer 3: Tasks → Tier ReasonScore    [Geometric Mean × 1000]
Layer 4: Tiers → score/token         [Arithmetic Mean ÷ tokens]

Point-level formula:

point_score = adjusted_center + adjusted_margin - truncated_ratio

Tier-level aggregation:

ReasonScore = 1000 × Geometric_Mean([bucket_scores])

Key properties: - Optimistic about uncertainty (add margin) - Pessimistic about failures (subtract truncation) - Punishes imbalance (geometric mean across tasks)

For complete design rationale, layer-by-layer computation, and philosophical motivation: See reasonscore.md

Data Model: Two-Plane Structure¶

ReasonScape stores evaluation results in a two-plane structure where each point exists simultaneously in both an Evaluation Plane (model/template/sampler) and a Task-Complexity Plane (task/params).

Identity dimensions (5D):

Evaluation Plane: model, template, sampler
Task-Complexity Plane: base_task, params

Facet dimensions (multi-valued tags):

Evaluation Plane: eval_id, groups[]
Task-Complexity Plane: tiers[], surfaces[], projections[]

Key features:

Points with identical 5D identity are de-duplicated
Facets enable multi-view organization (group by tier, filter by surface, etc.)
Orthogonal planes enable independent variation (same model across difficulties, different models at same difficulty)

For complete design decisions, orthogonality rationale, and facet computation: See manifold.md

For complete API reference and query patterns: See pointsdb.md

Implementation References¶

Core Components¶

Stage 1: Definition

tasks/*.py - Parametric test generators with Pydantic schemas
configs/*.yaml - Manifold definitions and sampling strategies
resolver.py - Configuration validator and cost predictor
tasks.md - Abstract task API reference
config.md - Templates, samplers, and dataset configuration

Stage 2: Execution

runner.py - Execution orchestrator with caching
templates/*.json - Chat template definitions
samplers/*.json - Generation parameter presets

Stage 3: Evaluation

evaluate.py - Unified evaluation with dataset and interview modes
data/dataset-*.json - Bucket organization metadata
pointsdb.md - Complete data structure API

Stage 4: Exploration

leaderboard.py - Interactive rankings and heatmaps
spiderweb.py - Per-model diagnostics (V2)
explorer.py - 3D manifold navigation

Stage 5: Research

analyze.py - Unified 9-subcommand interface (V2)
tools.md - Complete tool reference