ReasonScore v2: Statistical Foundations for r12¶

Introduction¶

ReasonScore is ReasonScape's unified metric for LLM evaluation, ranging from 10 to 1000, combining four aspects of practical LLM deployment into a single scalar value:

Accuracy - Correctness of answers (adjusted for guessing)
Statistical confidence - Uncertainty from finite sampling
Context reliability - Truncation rates and context limit utilization
Task balance - Performance consistency across reasoning domains

Two Orthogonal Concerns¶

ReasonScore's innovation separates two independent questions:

How do we estimate each task's success probability from sampled responses?
How do we aggregate per-task success probabilities into an overall score?

The Estimation: Per-Task P[Correct] Computation¶

See Statistical Methodology to understand the theoretical foundation of the C_P estimator. This estimator applies joint corrections for both guessing on MCQ/Boolean questions and truncation and produces a finite-sample-aware confidence interval.

The Aggregation: Bootstrap Geometric Mean Across Tasks¶

This is the defining feature of ReasonScore.

The core insight: models have inverted difficulty preferences. There is no universal definition of "easy" or "hard"—what's easy for one model is hard for another. Therefore, we do not use task-based tiers or task-based grouping. Instead, we treat each task as a distinct entity and aggregate across all of them with a geometric mean, which punishes catastrophic failures.

Why Geometric Mean?¶

Scenario: Task specialist model

task_scores = [0.95, 0.95, 0.95, ..., 0.05]  # Great at 11 tasks, fails 1

Arithmetic mean: (11 × 0.95 + 0.05) / 12 = 0.875 → Score: 875 (looks good!)
Geometric mean: (0.95¹¹ × 0.05)^(1/12) = 0.618 → Score: 618 (reveals weakness)

Why this matters: In deployment, users encounter all task types. A catastrophic failure on one task makes the model unusable.

Catastrophic failure protection:

task_scores = [0.95, 0.95, ..., 0.01]
geometric_mean = (0.95¹¹ × 0.01)^(1/12) = 0.495 → Score: 495 (heavily penalized)

Per-Task P[Correct] Bounds as Input¶

The geometric mean operates on bounds, not point estimates:

Per-task computation → (success_low, success_high)
                       ↓
Bootstrap geometric mean → (overall_low, overall_high)
                          ↓
Scale by 1000 → ReasonScore with preserved CI

Each task produces a confidence interval (low, high) via Wilson CI on adjusted counts. These bounds are sampled uniformly during bootstrap, preserving uncertainty through aggregation.

Bootstrap Sampling for Confidence Intervals¶

Process - Bootstrap sampling:

For i = 1 to 5000:
    For each task j:
        sample[j] = uniform(low_j, high_j)

    geomean[i] = exp(mean(log(sample[j]) for j))

Sort geomean array
geomean_low = geomean[125]     # 2.5th percentile
geomean_high = geomean[4875]   # 97.5th percentile

reasonscore_low = 1000 × geomean_low
reasonscore_high = 1000 × geomean_high
reasonscore_center = (reasonscore_low + reasonscore_high) / 2
reasonscore_margin = (reasonscore_high - reasonscore_low) / 2

Output:

{
    center: reasonscore_center,
    margin: reasonscore_margin,
    ci_low: reasonscore_low,
    ci_high: reasonscore_high
}

Why Bootstrap, Not Min/Max?¶

Two methods were evaluated for computing overall bounds:

Min/Max (Conservative):

geomean_low = exp(mean(log(low_j) for j))   # All tasks at lower bound
geomean_high = exp(mean(log(high_j) for j)) # All tasks at upper bound

- Assumes perfect correlation (all tasks move together) - Result: Maximum possible margin (±19.3 points average)

Bootstrap (Empirical):

Sample uniformly from each task's CI 5000 times
Compute empirical 95% CI from distribution

- Assumes independent variation (tasks move independently) - Result: Realistic margin accounting for averaging effects (±7.3 points average)

Results (37 models, 12 tasks):

Method	Avg Margin	CI Width	Ratio
Min/Max	±19.3	38.6	1.00
Bootstrap	±7.3	14.7	0.38

Bootstrap margins are 62% tighter (3x reduction).

Why it works: Independent variation causes averaging effects. High value on task A cancels low value on task B, tightening the overall CI beyond Min/Max assumptions.

Bootstrap Stability and Sample Size¶

Question: How low can we set n (bootstrap samples) while remaining stable?

Test: Mid-tier model (rank 22) across n ∈ {100, 500, 1000, 5000, 10000}, seeds ∈ {0, ..., 9}

Results (range Δ = max margin - min margin across seeds):

n=  100  |  margin: 8.9 ± 0.71  Δ=2.50  ❌
n=  500  |  margin: 8.1 ± 0.34  Δ=1.21  ❌
n= 1000  |  margin: 8.0 ± 0.30  Δ=1.10  ❌
n= 5000  |  margin: 8.0 ± 0.12  Δ=0.37  ✓
n=10000  |  margin: 8.0 ± 0.08  Δ=0.25  ✓

Rule of thumb: Choose n where Δ < 0.5 points

The Decision: Bootstrap with n=5000, seed=42

Interpretation Guide¶

ReasonScore Values¶

Range	Interpretation
900-1000	Near-perfect performance across all tasks
700-900	Strong overall with minor weaknesses
500-700	Good capability but notable gaps
300-500	Limited capability or severe task imbalance
100-300	Significant deficits across multiple domains
10-100	Catastrophic failures (geometric mean penalty)

Confidence Intervals¶

Tight CIs (< ±10 points): - High accuracy models (90%+) - Low truncation rates (< 5%) - Consistent performance across tasks - Example: Qwen3.5-122B: 946 ± 4

Moderate CIs (±10-20 points): - Mid-tier models (70-90%) - Moderate truncation (5-15%) - Variable task performance - Example: Qwen3-8B: 752 ± 8

Wide CIs (> ±20 points): - High truncation (> 15%) - High task-to-task variance - Rare in r12

Statistical Significance¶

Non-overlapping CIs → Significant difference:

Model A: 946 ± 4 [942, 950]
Model B: 935 ± 4 [931, 939]
→ Statistically distinct (#1 vs #2)

Overlapping CIs → Statistical tie:

Model C: 892 ± 17 [875, 909]
Model D: 887 ± 17 [870, 904]
→ Tied despite ranking difference