Skip to content

ReasonScape Technical Details

Prerequisites:

This document provides low-level algorithms and implementation details of the core technical mechanisms underlying ReasonScape.


Data Model: Two-Plane Structure

ReasonScape stores evaluation results in a two-plane structure where each point exists simultaneously in both an Evaluation Plane (model/template/sampler) and a Task-Complexity Plane (task/params).

Identity dimensions (5D):

  • Evaluation Plane: model, template, sampler
  • Task-Complexity Plane: base_task, params

Key features:

  • Points with identical 5D identity are de-duplicated
  • Orthogonal planes enable independent variation (same model across difficulties, different models at same difficulty)

For complete design decisions, orthogonality rationale, and facet computation: See manifold.md

For complete API reference and query patterns: See pointsdb.md


Parametric Test Generation

See Architecture: Stage 1 for the design philosophy.

Coordinate-Based Seeding

Every test sequence is deterministically generated from the parameter coordinates:

# From runner.py:472-476
# Create stable seed based on params (excluding 'count')
seed_params = {k: v for k, v in step_info['params'].items() if k != 'count'}
param_hash = generate_cache_key(seed_params)  # SHA-256 hash of JSON-serialized params
base_seed = int(param_hash[-8:], 16)  # Take last 8 hex digits as seed
generator.rng = random.Random(args.seed + base_seed)

Where generate_cache_key() is:

def generate_cache_key(cache_data):
    cache_str = json.dumps(cache_data, sort_keys=True)
    return hashlib.sha256(cache_str.encode()).hexdigest()

How It Works:

  1. Extract parameter coordinates (e.g., {"length": 16, "depth": 3})
  2. Remove count parameter (doesn't affect test content, only quantity)
  3. Compute SHA-256 hash of JSON-serialized parameters
  4. Take last 8 hex digits as integer seed
  5. Add global seed offset (args.seed)
  6. Initialize random.Random() with this seed
  7. Call generator.generate_random(**params) which uses the seeded RNG

Properties:

  • Same coordinates always produce identical test sequences
  • Smaller count values are perfect subsets of larger ones (hierarchical sampling)
  • Different global seeds (args.seed) produce different test sets for same coordinates
  • Enables deterministic caching and reproducible evaluations

Manifold Parameter Types

Tasks define difficulty through multiple dimensions:

Dimension Type Example Parameters Effect on Difficulty
Length length, num_terms, num_steps Working memory load
Depth max_depth, nesting Structural complexity
Interference distractors, noise_ratio Selective attention demand
Format whitespace, case_mutations Tokenization stress
Multi-step num_operations Sequential reasoning

See Also: Task Documentation for per-task manifold specifications.


Progressive Evaluation Architecture

See Architecture: Stage 2 for the execution philosophy.

Response Caching Implementation

All inference requests are cached using SHA-256 hashes of the complete request payload:

def generate_cache_key(cache_data: dict) -> str:
    """
    Generate deterministic cache key from request payload.

    Args:
        cache_data: {
            "model": "phi-4-fp16",
            "messages": [...],
            "temperature": 0.0,
            "max_tokens": 4096,
            "top_p": 1.0,
            # ... all sampling parameters
        }

    Returns:
        64-character hex string (SHA-256 hash)
    """
    cache_str = json.dumps(cache_data, sort_keys=True)
    return hashlib.sha256(cache_str.encode()).hexdigest()

Cache Behavior:

  • Every unique prompt gets its own cache entry
  • Changing any parameter (temperature, max_tokens, etc.) creates new entry
  • Identical requests (same model + prompt + parameters) are never re-executed
  • Deterministic test generation ensures reproducible cache keys
  • Typical cost reduction of 30% for repeated evaluations

Hierarchical Sampling

Test generators guarantee subset relationships:

# Example: count-invariant generation
manifold_point = {"length": 16, "depth": 3}
seed = hash(("arithmetic", (("depth", 3), ("length", 16))))

# Generate different sample counts
tests_32 = generate_tests(seed, count=32)   # [t0, t1, ..., t31]
tests_128 = generate_tests(seed, count=128) # [t0, t1, ..., t127]

# Verify subset property
assert tests_32 == tests_128[:32]  # Always True

Enabled Workflows:

  • Upsampling: Add more samples to existing evaluation (no waste)
  • Downsampling: Use subset of large evaluation for quick comparison
  • Cost Optimization: Start small, scale only where needed

Statistical Methodology

ReasonScape evaluation trials produce three observable outcomes per trial: completed and correct ($n_e$), completed and incorrect ($n_u - n_e$), and truncated ($n_t$). Together these yield four counters: $n$ = total trials, $n_u$ = non-truncated trials, $n_e$ = correct trials, $n_t$ = truncated trials ($n = n_u + n_t$). For finite-option tasks (boolean, multiple-choice), a fifth counter $g$ (guess_accum) is the sum of per-trial guess probabilities $1/|\text{options}_i|$ over completed trials.

Two corrections are needed to recover an unbiased accuracy estimate:

  1. Censoring correction — truncated trials are right-censored observations, not failures. Standard accuracy ($n_e/n_u$) conditions on completion, which enriches the completed pool for easier problems as truncation rises.
  2. Guess correction — on finite-option tasks, some completions agree with the reference by chance. Because truncated trials never produced an answer, the guess correction applies only to completed trials; the two corrections therefore compose non-trivially.

The Six Estimators

The family is indexed by two axes: event (E = equality, C = correctness) and assumption about what truncated trials would have produced.

Estimator Event Assumption Expression
$\hat{E}_I$ Equality Independence ($P[E|T] = P[E|\neg T]$) $\text{Wilson}(n_e,\; n_u)$
$\hat{E}_P$ Equality Pessimism ($P[E|T] = 0$) $\text{Wilson}(n_e,\; n)$
$\hat{E}_O$ Equality Optimism ($P[E|T] = 1$) $\text{Wilson}(n_e + n_t,\; n)$
$\hat{C}_I$ Correctness Independence $\text{Wilson}(n_e - g,\; n_u - g)$
$\hat{C}_P$ Correctness Pessimism $\text{Wilson}(n_e - g,\; n_u - g) \times \text{Wilson}(n_u,\; n)$
$\hat{C}_O$ Correctness Optimism $1 - \text{Wilson}(n_u - n_e,\; n_u - g) \times \text{Wilson}(n_u,\; n)$

E-estimators each reduce to a single Wilson interval because $n_u$ telescopes into the denominator. C-estimators for pessimism and optimism are irreducible products: the guess correction enters one factor but not the other, so the telescoping that collapses the E-family is broken. Confidence bounds propagate via interval arithmetic (Bonferroni combination).

Assumptions:

  • Independence — truncated trials are no harder than completed ones; ignore them and score on the completed pool. This is the implicit assumption of standard accuracy reporting.
  • Pessimism — the context limit is the deployment ceiling; a truncated trial is a failure. This is the recommended default: when truncation is low all assumptions agree, and when truncation is high pessimism is the only safe choice.
  • Optimism — the context limit is an evaluation artifact; truncated trials would have been correct given more tokens. Useful as an upper bound.

Boundary behavior: As $P[T] \to 0$, all six estimators converge. As $g \to 0$ (write-in tasks), the C-family collapses to the E-family. When both are zero, all six reduce to $\text{Wilson}(n_e, n_u)$ — the standard estimator is the corner-stone of the family.

PointsDB Mode Names

The six estimators are exposed via the mode parameter of aggregate() and query_points(). The naming scheme is <event>_<assumption>:

Mode Estimator Default for
'E_I' $\hat{E}_I$
'E_P' $\hat{E}_P$
'E_O' $\hat{E}_O$
'C_I' $\hat{C}_I$ query_points()
'C_P' $\hat{C}_P$ aggregate()
'C_O' $\hat{C}_O$

C_P is the recommended mode for leaderboard analysis: it applies pessimistic truncation handling and guess correction together.

See also: Censored Equality paper for full derivations, empirical validation, and boundary-condition proofs. PointsDB Statistical Modes for implementation details and SQL macro definitions.

Compression Pre-Computation

During processing, we compute gzip(reasoning_trace) for every reasoning trace output by the LLM.

Rationale: Compression size is a proxy for entropy:

  • Low compression (large output) = high entropy, information-rich reasoning
  • High compression (small output) = low entropy, repetitive or degenerate reasoning

Storage Format:

(status: int, tokens: int, compressed_size: int)
# Status values: 0 "incorrect", 1 "correct",  2 "truncated"

Applications:

  • Failure mechanism investigation: Do failing samples show low entropy (loops)?
  • Efficiency comparison: Which models maintain high entropy under load?
  • Cognitive analysis: Information-theoretic view of reasoning quality

See Also: - analyze.py compression - analyze.py hazard

Pairwise Win Rate Statistical Methodology

The pairwise comparison approach uses probabilistic head-to-head win computation with Monte Carlo sampling from confidence interval distributions.

Step 1: Task-Level Aggregation

Rather than comparing at 453 difficulty points (which would produce mostly ties with ±9% CIs), we aggregate to task level first: - 12 comparisons per model pair (one per task) - Leverages existing Wilson CI re-aggregation (tight confidence intervals)

Step 2: Probabilistic Win Computation

For each (model_a, model_b, task) combination:

# Model Wilson CI as beta distributions
alpha_a, beta_a = wilson_to_beta(center_a, margin_a)
alpha_b, beta_b = wilson_to_beta(center_b, margin_b)

# Monte Carlo sampling (10,000 samples)
samples_a = beta(alpha_a, beta_a, 10000)
samples_b = beta(alpha_b, beta_b, 10000)

# Win probability
P(A > B) = count(samples_a > samples_b) / 10000

Step 3: Aggregate Across Tasks

overall_win_rate[A vs B] = mean(P(A > B) across 12 tasks)

Step 4: Compute Rankings

Expected Wins (Linear):

expected_wins[A] = sum(win_rate[A vs i] for all opponents i)

Bradley-Terry (Iterative ML):

# Solve: P(A > B) = rating_A / (rating_A + rating_B)
# Via iterative proportional fitting

Properties:

  • Graceful uncertainty handling: Uses full probability distributions, not hard thresholds
  • Statistically principled: Beta distributions naturally represent binomial confidence intervals
  • Computationally efficient: 10,000 samples provides stable estimates (±0.5% precision)
  • Complementary rankings: Expected Wins (intuitive) and Bradley-Terry (sophisticated) provide different perspectives

See Also: Pairwise Win Rate workflow for usage guidance and interpretation.


See Also