ReasonScape Technical Details¶

Prerequisites:

architecture.md - The five-stage methodology
implementation.md - How the codebase is organized
insight.md - The information processing paradigm

This document provides low-level algorithms and implementation details of the core technical mechanisms underlying ReasonScape.

Data Model: Two-Plane Structure¶

ReasonScape stores evaluation results in a two-plane structure where each point exists simultaneously in both an Evaluation Plane (model/template/sampler) and a Task-Complexity Plane (task/params).

Identity dimensions (5D):

Evaluation Plane: model, template, sampler
Task-Complexity Plane: base_task, params

Key features:

Points with identical 5D identity are de-duplicated
Orthogonal planes enable independent variation (same model across difficulties, different models at same difficulty)

For complete design decisions, orthogonality rationale, and facet computation: See manifold.md

For complete API reference and query patterns: See pointsdb.md

Parametric Test Generation¶

See Architecture: Stage 1 for the design philosophy.

Coordinate-Based Seeding¶

Every test sequence is deterministically generated from the parameter coordinates:

# From runner.py:472-476
# Create stable seed based on params (excluding 'count')
seed_params = {k: v for k, v in step_info['params'].items() if k != 'count'}
param_hash = generate_cache_key(seed_params)  # SHA-256 hash of JSON-serialized params
base_seed = int(param_hash[-8:], 16)  # Take last 8 hex digits as seed
generator.rng = random.Random(args.seed + base_seed)

Where generate_cache_key() is:

def generate_cache_key(cache_data):
    cache_str = json.dumps(cache_data, sort_keys=True)
    return hashlib.sha256(cache_str.encode()).hexdigest()

How It Works:

Extract parameter coordinates (e.g., {"length": 16, "depth": 3})
Remove count parameter (doesn't affect test content, only quantity)
Compute SHA-256 hash of JSON-serialized parameters
Take last 8 hex digits as integer seed
Add global seed offset (args.seed)
Initialize random.Random() with this seed
Call generator.generate_random(**params) which uses the seeded RNG

Properties:

Same coordinates always produce identical test sequences
Smaller count values are perfect subsets of larger ones (hierarchical sampling)
Different global seeds (args.seed) produce different test sets for same coordinates
Enables deterministic caching and reproducible evaluations

Manifold Parameter Types¶

Tasks define difficulty through multiple dimensions:

Dimension Type	Example Parameters	Effect on Difficulty
Length	`length`, `num_terms`, `num_steps`	Working memory load
Depth	`max_depth`, `nesting`	Structural complexity
Interference	`distractors`, `noise_ratio`	Selective attention demand
Format	`whitespace`, `case_mutations`	Tokenization stress
Multi-step	`num_operations`	Sequential reasoning

See Also: Task Documentation for per-task manifold specifications.

Progressive Evaluation Architecture¶

See Architecture: Stage 2 for the execution philosophy.

Response Caching Implementation¶

All inference requests are cached using SHA-256 hashes of the complete request payload:

def generate_cache_key(cache_data: dict) -> str:
    """
    Generate deterministic cache key from request payload.

    Args:
        cache_data: {
            "model": "phi-4-fp16",
            "messages": [...],
            "temperature": 0.0,
            "max_tokens": 4096,
            "top_p": 1.0,
            # ... all sampling parameters
        }

    Returns:
        64-character hex string (SHA-256 hash)
    """
    cache_str = json.dumps(cache_data, sort_keys=True)
    return hashlib.sha256(cache_str.encode()).hexdigest()

Cache Behavior:

Every unique prompt gets its own cache entry
Changing any parameter (temperature, max_tokens, etc.) creates new entry
Identical requests (same model + prompt + parameters) are never re-executed
Deterministic test generation ensures reproducible cache keys
Typical cost reduction of 30% for repeated evaluations

Hierarchical Sampling¶

Test generators guarantee subset relationships:

# Example: count-invariant generation
manifold_point = {"length": 16, "depth": 3}
seed = hash(("arithmetic", (("depth", 3), ("length", 16))))

# Generate different sample counts
tests_32 = generate_tests(seed, count=32)   # [t0, t1, ..., t31]
tests_128 = generate_tests(seed, count=128) # [t0, t1, ..., t127]

# Verify subset property
assert tests_32 == tests_128[:32]  # Always True

Enabled Workflows:

Upsampling: Add more samples to existing evaluation (no waste)
Downsampling: Use subset of large evaluation for quick comparison
Cost Optimization: Start small, scale only where needed

Statistical Methodology¶

ReasonScape evaluation trials produce three observable outcomes per trial: completed and correct ($n_e$), completed and incorrect ($n_u - n_e$), and truncated ($n_t$). Together these yield four counters: $n$ = total trials, $n_u$ = non-truncated trials, $n_e$ = correct trials, $n_t$ = truncated trials ($n = n_u + n_t$). For finite-option tasks (boolean, multiple-choice), a fifth counter $g$ (guess_accum) is the sum of per-trial guess probabilities $1/|\text{options}_i|$ over completed trials.

Two corrections are needed to recover an unbiased accuracy estimate:

Censoring correction — truncated trials are right-censored observations, not failures. Standard accuracy ($n_e/n_u$) conditions on completion, which enriches the completed pool for easier problems as truncation rises.
Guess correction — on finite-option tasks, some completions agree with the reference by chance. Because truncated trials never produced an answer, the guess correction applies only to completed trials; the two corrections therefore compose non-trivially.

The Six Estimators¶

The family is indexed by two axes: event (E = equality, C = correctness) and assumption about what truncated trials would have produced.

Estimator	Event	Assumption	Expression
$\hat{E}_I$	Equality	Independence ($P[E\|T] = P[E\|\neg T]$)	$\text{Wilson}(n_e,\; n_u)$
$\hat{E}_P$	Equality	Pessimism ($P[E\|T] = 0$)	$\text{Wilson}(n_e,\; n)$
$\hat{E}_O$	Equality	Optimism ($P[E\|T] = 1$)	$\text{Wilson}(n_e + n_t,\; n)$
$\hat{C}_I$	Correctness	Independence	$\text{Wilson}(n_e - g,\; n_u - g)$
$\hat{C}_P$	Correctness	Pessimism	$\text{Wilson}(n_e - g,\; n_u - g) \times \text{Wilson}(n_u,\; n)$
$\hat{C}_O$	Correctness	Optimism	$1 - \text{Wilson}(n_u - n_e,\; n_u - g) \times \text{Wilson}(n_u,\; n)$

E-estimators each reduce to a single Wilson interval because $n_u$ telescopes into the denominator. C-estimators for pessimism and optimism are irreducible products: the guess correction enters one factor but not the other, so the telescoping that collapses the E-family is broken. Confidence bounds propagate via interval arithmetic (Bonferroni combination).

Assumptions:

Independence — truncated trials are no harder than completed ones; ignore them and score on the completed pool. This is the implicit assumption of standard accuracy reporting.
Pessimism — the context limit is the deployment ceiling; a truncated trial is a failure. This is the recommended default: when truncation is low all assumptions agree, and when truncation is high pessimism is the only safe choice.
Optimism — the context limit is an evaluation artifact; truncated trials would have been correct given more tokens. Useful as an upper bound.

Boundary behavior: As $P[T] \to 0$, all six estimators converge. As $g \to 0$ (write-in tasks), the C-family collapses to the E-family. When both are zero, all six reduce to $\text{Wilson}(n_e, n_u)$ — the standard estimator is the corner-stone of the family.

PointsDB Mode Names¶

The six estimators are exposed via the mode parameter of aggregate() and query_points(). The naming scheme is <event>_<assumption>:

Mode	Estimator	Default for
`'E_I'`	$\hat{E}_I$	—
`'E_P'`	$\hat{E}_P$	—
`'E_O'`	$\hat{E}_O$	—
`'C_I'`	$\hat{C}_I$	`query_points()`
`'C_P'`	$\hat{C}_P$	`aggregate()`
`'C_O'`	$\hat{C}_O$	—

C_P is the recommended mode for leaderboard analysis: it applies pessimistic truncation handling and guess correction together.

See also: Censored Equality paper for full derivations, empirical validation, and boundary-condition proofs. PointsDB Statistical Modes for implementation details and SQL macro definitions.

Compression Pre-Computation¶

During processing, we compute gzip(reasoning_trace) for every reasoning trace output by the LLM.

Rationale: Compression size is a proxy for entropy:

Low compression (large output) = high entropy, information-rich reasoning
High compression (small output) = low entropy, repetitive or degenerate reasoning

Storage Format:

(status: int, tokens: int, compressed_size: int)
# Status values: 0 "incorrect", 1 "correct",  2 "truncated"

Applications:

Failure mechanism investigation: Do failing samples show low entropy (loops)?
Efficiency comparison: Which models maintain high entropy under load?
Cognitive analysis: Information-theoretic view of reasoning quality

See Also: - analyze.py compression - analyze.py hazard

Pairwise Win Rate Statistical Methodology¶

The pairwise comparison approach uses probabilistic head-to-head win computation with Monte Carlo sampling from confidence interval distributions.

Step 1: Task-Level Aggregation

Rather than comparing at 453 difficulty points (which would produce mostly ties with ±9% CIs), we aggregate to task level first: - 12 comparisons per model pair (one per task) - Leverages existing Wilson CI re-aggregation (tight confidence intervals)

Step 2: Probabilistic Win Computation

For each (model_a, model_b, task) combination:

# Model Wilson CI as beta distributions
alpha_a, beta_a = wilson_to_beta(center_a, margin_a)
alpha_b, beta_b = wilson_to_beta(center_b, margin_b)

# Monte Carlo sampling (10,000 samples)
samples_a = beta(alpha_a, beta_a, 10000)
samples_b = beta(alpha_b, beta_b, 10000)

# Win probability
P(A > B) = count(samples_a > samples_b) / 10000

Step 3: Aggregate Across Tasks

overall_win_rate[A vs B] = mean(P(A > B) across 12 tasks)

Step 4: Compute Rankings

Expected Wins (Linear):

expected_wins[A] = sum(win_rate[A vs i] for all opponents i)

Bradley-Terry (Iterative ML):

# Solve: P(A > B) = rating_A / (rating_A + rating_B)
# Via iterative proportional fitting