ReasonScape Technical Details¶
Prerequisites:
- architecture.md - The five-stage methodology
- implementation.md - How the codebase is organized
- insight.md - The information processing paradigm
This document provides low-level algorithms and implementation details of the core technical mechanisms underlying ReasonScape.
Data Model: Two-Plane Structure¶
ReasonScape stores evaluation results in a two-plane structure where each point exists simultaneously in both an Evaluation Plane (model/template/sampler) and a Task-Complexity Plane (task/params).
Identity dimensions (5D):
- Evaluation Plane:
model,template,sampler - Task-Complexity Plane:
base_task,params
Key features:
- Points with identical 5D identity are de-duplicated
- Orthogonal planes enable independent variation (same model across difficulties, different models at same difficulty)
For complete design decisions, orthogonality rationale, and facet computation: See manifold.md
For complete API reference and query patterns: See pointsdb.md
Parametric Test Generation¶
See Architecture: Stage 1 for the design philosophy.
Coordinate-Based Seeding¶
Every test sequence is deterministically generated from the parameter coordinates:
# From runner.py:472-476
# Create stable seed based on params (excluding 'count')
seed_params = {k: v for k, v in step_info['params'].items() if k != 'count'}
param_hash = generate_cache_key(seed_params) # SHA-256 hash of JSON-serialized params
base_seed = int(param_hash[-8:], 16) # Take last 8 hex digits as seed
generator.rng = random.Random(args.seed + base_seed)
Where generate_cache_key() is:
def generate_cache_key(cache_data):
cache_str = json.dumps(cache_data, sort_keys=True)
return hashlib.sha256(cache_str.encode()).hexdigest()
How It Works:
- Extract parameter coordinates (e.g.,
{"length": 16, "depth": 3}) - Remove
countparameter (doesn't affect test content, only quantity) - Compute SHA-256 hash of JSON-serialized parameters
- Take last 8 hex digits as integer seed
- Add global seed offset (
args.seed) - Initialize
random.Random()with this seed - Call
generator.generate_random(**params)which uses the seeded RNG
Properties:
- Same coordinates always produce identical test sequences
- Smaller
countvalues are perfect subsets of larger ones (hierarchical sampling) - Different global seeds (
args.seed) produce different test sets for same coordinates - Enables deterministic caching and reproducible evaluations
Manifold Parameter Types¶
Tasks define difficulty through multiple dimensions:
| Dimension Type | Example Parameters | Effect on Difficulty |
|---|---|---|
| Length | length, num_terms, num_steps |
Working memory load |
| Depth | max_depth, nesting |
Structural complexity |
| Interference | distractors, noise_ratio |
Selective attention demand |
| Format | whitespace, case_mutations |
Tokenization stress |
| Multi-step | num_operations |
Sequential reasoning |
See Also: Task Documentation for per-task manifold specifications.
Progressive Evaluation Architecture¶
See Architecture: Stage 2 for the execution philosophy.
Response Caching Implementation¶
All inference requests are cached using SHA-256 hashes of the complete request payload:
def generate_cache_key(cache_data: dict) -> str:
"""
Generate deterministic cache key from request payload.
Args:
cache_data: {
"model": "phi-4-fp16",
"messages": [...],
"temperature": 0.0,
"max_tokens": 4096,
"top_p": 1.0,
# ... all sampling parameters
}
Returns:
64-character hex string (SHA-256 hash)
"""
cache_str = json.dumps(cache_data, sort_keys=True)
return hashlib.sha256(cache_str.encode()).hexdigest()
Cache Behavior:
- Every unique prompt gets its own cache entry
- Changing any parameter (temperature, max_tokens, etc.) creates new entry
- Identical requests (same model + prompt + parameters) are never re-executed
- Deterministic test generation ensures reproducible cache keys
- Typical cost reduction of 30% for repeated evaluations
Hierarchical Sampling¶
Test generators guarantee subset relationships:
# Example: count-invariant generation
manifold_point = {"length": 16, "depth": 3}
seed = hash(("arithmetic", (("depth", 3), ("length", 16))))
# Generate different sample counts
tests_32 = generate_tests(seed, count=32) # [t0, t1, ..., t31]
tests_128 = generate_tests(seed, count=128) # [t0, t1, ..., t127]
# Verify subset property
assert tests_32 == tests_128[:32] # Always True
Enabled Workflows:
- Upsampling: Add more samples to existing evaluation (no waste)
- Downsampling: Use subset of large evaluation for quick comparison
- Cost Optimization: Start small, scale only where needed
Statistical Methodology¶
ReasonScape evaluation trials produce three observable outcomes per trial: completed and correct ($n_e$), completed and incorrect ($n_u - n_e$), and truncated ($n_t$). Together these yield four counters: $n$ = total trials, $n_u$ = non-truncated trials, $n_e$ = correct trials, $n_t$ = truncated trials ($n = n_u + n_t$). For finite-option tasks (boolean, multiple-choice), a fifth counter $g$ (guess_accum) is the sum of per-trial guess probabilities $1/|\text{options}_i|$ over completed trials.
Two corrections are needed to recover an unbiased accuracy estimate:
- Censoring correction — truncated trials are right-censored observations, not failures. Standard accuracy ($n_e/n_u$) conditions on completion, which enriches the completed pool for easier problems as truncation rises.
- Guess correction — on finite-option tasks, some completions agree with the reference by chance. Because truncated trials never produced an answer, the guess correction applies only to completed trials; the two corrections therefore compose non-trivially.
The Six Estimators¶
The family is indexed by two axes: event (E = equality, C = correctness) and assumption about what truncated trials would have produced.
| Estimator | Event | Assumption | Expression |
|---|---|---|---|
| $\hat{E}_I$ | Equality | Independence ($P[E|T] = P[E|\neg T]$) | $\text{Wilson}(n_e,\; n_u)$ |
| $\hat{E}_P$ | Equality | Pessimism ($P[E|T] = 0$) | $\text{Wilson}(n_e,\; n)$ |
| $\hat{E}_O$ | Equality | Optimism ($P[E|T] = 1$) | $\text{Wilson}(n_e + n_t,\; n)$ |
| $\hat{C}_I$ | Correctness | Independence | $\text{Wilson}(n_e - g,\; n_u - g)$ |
| $\hat{C}_P$ | Correctness | Pessimism | $\text{Wilson}(n_e - g,\; n_u - g) \times \text{Wilson}(n_u,\; n)$ |
| $\hat{C}_O$ | Correctness | Optimism | $1 - \text{Wilson}(n_u - n_e,\; n_u - g) \times \text{Wilson}(n_u,\; n)$ |
E-estimators each reduce to a single Wilson interval because $n_u$ telescopes into the denominator. C-estimators for pessimism and optimism are irreducible products: the guess correction enters one factor but not the other, so the telescoping that collapses the E-family is broken. Confidence bounds propagate via interval arithmetic (Bonferroni combination).
Assumptions:
- Independence — truncated trials are no harder than completed ones; ignore them and score on the completed pool. This is the implicit assumption of standard accuracy reporting.
- Pessimism — the context limit is the deployment ceiling; a truncated trial is a failure. This is the recommended default: when truncation is low all assumptions agree, and when truncation is high pessimism is the only safe choice.
- Optimism — the context limit is an evaluation artifact; truncated trials would have been correct given more tokens. Useful as an upper bound.
Boundary behavior: As $P[T] \to 0$, all six estimators converge. As $g \to 0$ (write-in tasks), the C-family collapses to the E-family. When both are zero, all six reduce to $\text{Wilson}(n_e, n_u)$ — the standard estimator is the corner-stone of the family.
PointsDB Mode Names¶
The six estimators are exposed via the mode parameter of aggregate() and query_points(). The naming scheme is <event>_<assumption>:
| Mode | Estimator | Default for |
|---|---|---|
'E_I' |
$\hat{E}_I$ | — |
'E_P' |
$\hat{E}_P$ | — |
'E_O' |
$\hat{E}_O$ | — |
'C_I' |
$\hat{C}_I$ | query_points() |
'C_P' |
$\hat{C}_P$ | aggregate() |
'C_O' |
$\hat{C}_O$ | — |
C_P is the recommended mode for leaderboard analysis: it applies pessimistic truncation handling and guess correction together.
See also: Censored Equality paper for full derivations, empirical validation, and boundary-condition proofs. PointsDB Statistical Modes for implementation details and SQL macro definitions.
Compression Pre-Computation¶
During processing, we compute gzip(reasoning_trace) for every reasoning trace output by the LLM.
Rationale: Compression size is a proxy for entropy:
- Low compression (large output) = high entropy, information-rich reasoning
- High compression (small output) = low entropy, repetitive or degenerate reasoning
Storage Format:
(status: int, tokens: int, compressed_size: int)
# Status values: 0 "incorrect", 1 "correct", 2 "truncated"
Applications:
- Failure mechanism investigation: Do failing samples show low entropy (loops)?
- Efficiency comparison: Which models maintain high entropy under load?
- Cognitive analysis: Information-theoretic view of reasoning quality
See Also: - analyze.py compression - analyze.py hazard
Pairwise Win Rate Statistical Methodology¶
The pairwise comparison approach uses probabilistic head-to-head win computation with Monte Carlo sampling from confidence interval distributions.
Step 1: Task-Level Aggregation
Rather than comparing at 453 difficulty points (which would produce mostly ties with ±9% CIs), we aggregate to task level first: - 12 comparisons per model pair (one per task) - Leverages existing Wilson CI re-aggregation (tight confidence intervals)
Step 2: Probabilistic Win Computation
For each (model_a, model_b, task) combination:
# Model Wilson CI as beta distributions
alpha_a, beta_a = wilson_to_beta(center_a, margin_a)
alpha_b, beta_b = wilson_to_beta(center_b, margin_b)
# Monte Carlo sampling (10,000 samples)
samples_a = beta(alpha_a, beta_a, 10000)
samples_b = beta(alpha_b, beta_b, 10000)
# Win probability
P(A > B) = count(samples_a > samples_b) / 10000
Step 3: Aggregate Across Tasks
overall_win_rate[A vs B] = mean(P(A > B) across 12 tasks)
Step 4: Compute Rankings
Expected Wins (Linear):
expected_wins[A] = sum(win_rate[A vs i] for all opponents i)
Bradley-Terry (Iterative ML):
# Solve: P(A > B) = rating_A / (rating_A + rating_B)
# Via iterative proportional fitting
Properties:
- Graceful uncertainty handling: Uses full probability distributions, not hard thresholds
- Statistically principled: Beta distributions naturally represent binomial confidence intervals
- Computationally efficient: 10,000 samples provides stable estimates (±0.5% precision)
- Complementary rankings: Expected Wins (intuitive) and Bradley-Terry (sophisticated) provide different perspectives
See Also: Pairwise Win Rate workflow for usage guidance and interpretation.
See Also¶
- implementation.md - Top-level implementation overview
- architecture.md - Five-stage methodology
- manifold.md - Two-plane data model design
- reasonscore.md - Unified metric design
- config.md - Configuration reference
- tasks.md - Task API specifications