Skip to content

ReasonScape Technical Details

Prerequisites:

This document provides low-level algorithms and implementation details of the core technical mechanisms underlying ReasonScape.


Table of Contents

  1. Parametric Test Generation
  2. Token-Frequency Domain Analysis
  3. Progressive Evaluation Architecture
  4. Statistical Methodology
  5. Implementation References

Parametric Test Generation

See Architecture: Stage 1 for the design philosophy.

Coordinate-Based Seeding

Every test sequence is deterministically generated from the parameter coordinates:

# From runner.py:472-476
# Create stable seed based on params (excluding 'count')
seed_params = {k: v for k, v in step_info['params'].items() if k != 'count'}
param_hash = generate_cache_key(seed_params)  # SHA-256 hash of JSON-serialized params
base_seed = int(param_hash[-8:], 16)  # Take last 8 hex digits as seed
generator.rng = random.Random(args.seed + base_seed)

Where generate_cache_key() is:

def generate_cache_key(cache_data):
    cache_str = json.dumps(cache_data, sort_keys=True)
    return hashlib.sha256(cache_str.encode()).hexdigest()

How It Works:

  1. Extract parameter coordinates (e.g., {"length": 16, "depth": 3})
  2. Remove count parameter (doesn't affect test content, only quantity)
  3. Compute SHA-256 hash of JSON-serialized parameters
  4. Take last 8 hex digits as integer seed
  5. Add global seed offset (args.seed)
  6. Initialize random.Random() with this seed
  7. Call generator.generate_random(**params) which uses the seeded RNG

Properties:

  • Same coordinates always produce identical test sequences
  • Smaller count values are perfect subsets of larger ones (hierarchical sampling)
  • Different global seeds (args.seed) produce different test sets for same coordinates
  • Enables deterministic caching and reproducible evaluations

Manifold Parameter Types

Tasks define difficulty through multiple dimensions:

Dimension Type Example Parameters Effect on Difficulty
Length length, num_terms, num_steps Working memory load
Depth max_depth, nesting Structural complexity
Interference distractors, noise_ratio Selective attention demand
Format whitespace, case_mutations Tokenization stress
Multi-step num_operations Sequential reasoning

See Also: Task Documentation for per-task manifold specifications.


Token-Frequency Domain Analysis

See insight.md for the information processing paradigm that motivates this analysis.

ReasonScape introduces Token-Frequency Domain Analysis to validate test populations and reveal architectural patterns invisible in text space.

Methodology

For each test instance:

  1. Apply chat template to test text
  2. Tokenize using model-specific tokenizer
  3. Compute token sequence length N
  4. Apply FFT to token ID sequence: F = FFT(token_ids)
  5. Compute magnitude spectrum: |F| (discard phase)
  6. Normalize by sequence length

For each population of tests at a manifold point:

  1. Collect magnitude spectra from all samples
  2. Compute mean spectrum: μ(f) = mean(|F_i(f)|) across samples
  3. Compute standard deviation: σ(f) = std(|F_i(f)|)

Interpretation

Frequency Domain Insights:

  • Low frequencies: Long-range patterns, sentence structure
  • High frequencies: Token-level variations, formatting noise
  • Spectral peaks: Repeated patterns, structural rhythms
  • Bandwidth: Diversity of token patterns

Cross-Model Comparisons:

  • Different tokenizers create different spectral signatures
  • Same test text → different frequency profiles per model
  • Reveals tokenization efficiency and pattern recognition

RF Domain Analogies:

  • Gain: Spectral amplitude changes with difficulty
  • Compression: Bandwidth reduction under cognitive load
  • Modulation: Parameter sweeps create predictable spectral shifts

Applications

  1. Population Validation: Ensure test sets have expected spectral properties
  2. Tokenizer Analysis: Identify efficiency differences between models
  3. Difficulty Calibration: Verify manifold parameters affect input complexity
  4. Failure Mode Diagnosis: Correlate spectral patterns with accuracy drops

See Also: - analyze.py fft for forensic investigation - Explorer for interactive visualization


Progressive Evaluation Architecture

See Architecture: Stage 2 for the execution philosophy.

1. Response Caching Implementation

All inference requests are cached using SHA-256 hashes of the complete request payload:

def generate_cache_key(cache_data: dict) -> str:
    """
    Generate deterministic cache key from request payload.

    Args:
        cache_data: {
            "model": "phi-4-fp16",
            "messages": [...],
            "temperature": 0.0,
            "max_tokens": 4096,
            "top_p": 1.0,
            # ... all sampling parameters
        }

    Returns:
        64-character hex string (SHA-256 hash)
    """
    cache_str = json.dumps(cache_data, sort_keys=True)
    return hashlib.sha256(cache_str.encode()).hexdigest()

Cache Behavior:

  • Every unique prompt gets its own cache entry
  • Changing any parameter (temperature, max_tokens, etc.) creates new entry
  • Identical requests (same model + prompt + parameters) are never re-executed
  • Deterministic test generation ensures reproducible cache keys
  • Typical cost reduction of 30% for 3-tiered evaluation

2. Hierarchical Sampling

Test generators guarantee subset relationships:

# Example: count-invariant generation
manifold_point = {"length": 16, "depth": 3}
seed = hash(("arithmetic", (("depth", 3), ("length", 16))))

# Generate different sample counts
tests_32 = generate_tests(seed, count=32)   # [t0, t1, ..., t31]
tests_128 = generate_tests(seed, count=128) # [t0, t1, ..., t127]

# Verify subset property
assert tests_32 == tests_128[:32]  # Always True

Enabled Workflows:

  • Upsampling: Add more samples to existing evaluation (no waste)
  • Downsampling: Use subset of large evaluation for quick comparison
  • Adaptive Sampling: Continue until statistical confidence target reached
  • Cost Optimization: Start small, scale only where needed

3. Dynamic Confidence Targeting

Progressive sampling adapts to the statistical needs of each manifold point:

# Pseudocode for adaptive sampling
min_samples = 32
max_samples = 512
target_ci_width = 0.05

samples = min_samples
while samples < max_samples:
    accuracy, ci_lower, ci_upper = compute_statistics(samples)
    ci_width = ci_upper - ci_lower

    if ci_width <= target_ci_width:
        break  # Sufficient confidence

    samples += 32  # Request more samples

Optimization:

  • Easy points (near 0% or 100% accuracy) reach confidence quickly
  • Hard points (near 50% accuracy) require more samples
  • High-truncation points stop early to avoid wasting tokens

4. Truncation-Aware Execution

Models that consistently truncate on specific manifold points waste resources:

Standard Approach (wasteful):

  • Request 128 samples at difficult point
  • 100 samples truncate (context limit exceeded)
  • Only 28 valid responses, wide confidence intervals
  • Wasted 100 × max_tokens on truncations

ReasonScape Approach (efficient):

  • Detect high truncation rate (>50%)
  • Apply relaxed confidence target (or stop sampling)
  • Allocate saved tokens to more productive points
  • Track truncation rate as separate quality metric

See Also: runner.py for execution implementation details.


Statistical Methodology

See Architecture: Stage 3 for the evaluation philosophy.

Excess Accuracy Correction

Traditional benchmarks suffer from guessing inflation: a model randomly selecting from 4 options appears to achieve 25% accuracy despite zero knowledge.

Problem:

  1. Is 60% accuracy actually 60% knowledge or 35% knowledge + 25% luck?
  2. Binary tasks (50% guess) and 8-option tasks (12.5% guess) can't be compared

ReasonScape Solution: Remove expected guessing contributions before statistical analysis.

Algorithm

For each test sample, determine guess probability:

if response_format == "multiple_choice":
    guess_chance = 1.0 / len(valid_options)
elif response_format == "write_in":
    guess_chance = 0.0

For each evaluation batch:

def compute_excess_accuracy(results: List[TestResult]) -> Tuple[float, float, float]:
    """
    Compute excess accuracy with 95% confidence intervals.

    Returns:
        (accuracy, ci_lower, ci_upper) where:
        - 0.000 = no better than guessing
        - 1.000 = perfect knowledge
    """
    # Step 1: Accumulate guess contributions
    guess_accumulator = sum(r.guess_chance for r in results if not r.truncated)
    correct = sum(1 for r in results if r.correct and not r.truncated)
    total = sum(1 for r in results if not r.truncated)

    # Step 2: Adjust totals
    adjusted_successes = correct - guess_accumulator
    adjusted_trials = total - guess_accumulator

    # Step 3: Compute Wilson confidence interval
    if adjusted_trials <= 0:
        return 0.0, 0.0, 0.0

    p = adjusted_successes / adjusted_trials
    n = adjusted_trials

    # Wilson score interval
    z = 1.96  # 95% confidence
    denominator = 1 + z**2 / n
    center = (p + z**2 / (2*n)) / denominator
    margin = z * sqrt(p * (1 - p) / n + z**2 / (4*n**2)) / denominator

    accuracy = max(0.0, min(1.0, p))
    ci_lower = max(0.0, center - margin)
    ci_upper = min(1.0, center + margin)

    return accuracy, ci_lower, ci_upper

Properties:

  • Fair comparison: All tasks normalized to [0, 1] knowledge scale
  • Honest baseline: 0.000 explicitly means "pure guessing"
  • Statistical rigor: Wilson intervals handle edge cases (small n, extreme p)

See Also: evaluate.py:compute_bucket_stats() for complete implementation.

Truncation Handling

Truncations are not errors in task output—they are distinct quality metrics:

Treatment:

  1. Remove from accuracy calculations: Truncations ≠ wrong answers
  2. Track separately: Truncation rate as independent KPI
  3. Widen confidence intervals: Reduced effective sample size increases uncertainty
  4. Report explicitly: Leaderboard shows truncation indicators

Example:

200 total samples:
- 150 correct
- 30 incorrect
- 20 truncated

Traditional: 150/200 = 75% accuracy
ReasonScape: 150/180 = 83.3% accuracy, 10% truncation rate

Interpretation:

  • High accuracy + low truncation = strong performance
  • High accuracy + high truncation = capable but context-limited
  • Low accuracy + high truncation = both capability and context issues
  • Low accuracy + low truncation = capability issue, not context

Token Analysis

Completion token counts are tracked and analyzed separately for truncated, correct and incorrect answers.

Applications:

  • Efficiency analysis: Identify models that waste tokens
  • Pattern detection: Do incorrect answers consistently use more/fewer tokens?
  • Truncation prediction: What token counts correlate with truncation?
  • Cost optimization: Balance accuracy vs token usage

See Also:

Compression Pre-Computation

During processing, we compute gzip(reasoning_trace) for every reasoning trace output by the LLM.

Rationale: Compression size is a proxy for entropy:

  • Low compression (large output) = high entropy, information-rich reasoning
  • High compression (small output) = low entropy, repetitive or degenerate reasoning

Storage Format:

(status: int, tokens: int, compressed_size: int)
# Status values: 0 "incorrect", 1 "correct",  2 "truncated"

Applications:

  • Failure mechanism investigation: Do failing samples show low entropy (loops)?
  • Efficiency comparison: Which models maintain high entropy under load?
  • Cognitive analysis: Information-theoretic view of reasoning quality

See Also: - analyze.py compression - analyze.py hazard

ReasonScore Calculation

ReasonScore provides a single, interpretable metric that aggregates performance across all tasks and difficulty levels while accounting for statistical reliability and context limit issues.

ReasonScore is computed in four layers using mathematically appropriate operations at each level:

Layer 1: Samples → Point Score       [Wilson CI]
Layer 2: Points → Task Score         [Wilson CI re-aggregation]
Layer 3: Tasks → Tier ReasonScore    [Geometric Mean × 1000]
Layer 4: Tiers → score/token         [Arithmetic Mean ÷ tokens]

Point-level formula:

point_score = adjusted_center + adjusted_margin - truncated_ratio

Tier-level aggregation:

ReasonScore = 1000 × Geometric_Mean([bucket_scores])

Key properties: - Optimistic about uncertainty (add margin) - Pessimistic about failures (subtract truncation) - Punishes imbalance (geometric mean across tasks)

For complete design rationale, layer-by-layer computation, and philosophical motivation: See reasonscore.md

Data Model: Two-Plane Structure

ReasonScape stores evaluation results in a two-plane structure where each point exists simultaneously in both an Evaluation Plane (model/template/sampler) and a Task-Complexity Plane (task/params).

Identity dimensions (5D):

  • Evaluation Plane: model, template, sampler
  • Task-Complexity Plane: base_task, params

Facet dimensions (multi-valued tags):

  • Evaluation Plane: eval_id, groups[]
  • Task-Complexity Plane: tiers[], surfaces[], projections[]

Key features:

  • Points with identical 5D identity are de-duplicated
  • Facets enable multi-view organization (group by tier, filter by surface, etc.)
  • Orthogonal planes enable independent variation (same model across difficulties, different models at same difficulty)

For complete design decisions, orthogonality rationale, and facet computation: See manifold.md

For complete API reference and query patterns: See pointsdb.md


Implementation References

Core Components

Stage 1: Definition

  • tasks/*.py - Parametric test generators with Pydantic schemas
  • configs/*.yaml - Manifold definitions and sampling strategies
  • resolver.py - Configuration validator and cost predictor
  • tasks.md - Abstract task API reference
  • config.md - Templates, samplers, and dataset configuration

Stage 2: Execution

Stage 3: Evaluation

Stage 4: Exploration

Stage 5: Research


See Also