ReasonScore v2: Statistical Foundations for r12¶
Introduction¶
ReasonScore is ReasonScape's unified metric for LLM evaluation, ranging from 10 to 1000, combining four aspects of practical LLM deployment into a single scalar value:
- Accuracy - Correctness of answers (adjusted for guessing)
- Statistical confidence - Uncertainty from finite sampling
- Context reliability - Truncation rates and context limit utilization
- Task balance - Performance consistency across reasoning domains
Two Orthogonal Concerns¶
ReasonScore's innovation separates two independent questions:
-
How do we estimate each task's success probability from sampled responses?
-
How do we aggregate per-task success probabilities into an overall score?
The Estimation: Per-Task P[Correct] Computation¶
See Statistical Methodology to understand the theoretical foundation of the C_P estimator. This estimator applies joint corrections for both guessing on MCQ/Boolean questions and truncation and produces a finite-sample-aware confidence interval.
The Aggregation: Bootstrap Geometric Mean Across Tasks¶
This is the defining feature of ReasonScore.
The core insight: models have inverted difficulty preferences. There is no universal definition of "easy" or "hard"—what's easy for one model is hard for another. Therefore, we do not use task-based tiers or task-based grouping. Instead, we treat each task as a distinct entity and aggregate across all of them with a geometric mean, which punishes catastrophic failures.
Why Geometric Mean?¶
Scenario: Task specialist model
task_scores = [0.95, 0.95, 0.95, ..., 0.05] # Great at 11 tasks, fails 1
Arithmetic mean: (11 × 0.95 + 0.05) / 12 = 0.875 → Score: 875 (looks good!)
Geometric mean: (0.95¹¹ × 0.05)^(1/12) = 0.618 → Score: 618 (reveals weakness)
Why this matters: In deployment, users encounter all task types. A catastrophic failure on one task makes the model unusable.
Catastrophic failure protection:
task_scores = [0.95, 0.95, ..., 0.01]
geometric_mean = (0.95¹¹ × 0.01)^(1/12) = 0.495 → Score: 495 (heavily penalized)
Per-Task P[Correct] Bounds as Input¶
The geometric mean operates on bounds, not point estimates:
Per-task computation → (success_low, success_high)
↓
Bootstrap geometric mean → (overall_low, overall_high)
↓
Scale by 1000 → ReasonScore with preserved CI
Each task produces a confidence interval (low, high) via Wilson CI on adjusted counts. These bounds are sampled uniformly during bootstrap, preserving uncertainty through aggregation.
Bootstrap Sampling for Confidence Intervals¶
Process - Bootstrap sampling:
For i = 1 to 5000:
For each task j:
sample[j] = uniform(low_j, high_j)
geomean[i] = exp(mean(log(sample[j]) for j))
Sort geomean array
geomean_low = geomean[125] # 2.5th percentile
geomean_high = geomean[4875] # 97.5th percentile
reasonscore_low = 1000 × geomean_low
reasonscore_high = 1000 × geomean_high
reasonscore_center = (reasonscore_low + reasonscore_high) / 2
reasonscore_margin = (reasonscore_high - reasonscore_low) / 2
Output:
{
center: reasonscore_center,
margin: reasonscore_margin,
ci_low: reasonscore_low,
ci_high: reasonscore_high
}
Why Bootstrap, Not Min/Max?¶
Two methods were evaluated for computing overall bounds:
Min/Max (Conservative):
geomean_low = exp(mean(log(low_j) for j)) # All tasks at lower bound
geomean_high = exp(mean(log(high_j) for j)) # All tasks at upper bound
Bootstrap (Empirical):
Sample uniformly from each task's CI 5000 times
Compute empirical 95% CI from distribution
Results (37 models, 12 tasks):
| Method | Avg Margin | CI Width | Ratio |
|---|---|---|---|
| Min/Max | ±19.3 | 38.6 | 1.00 |
| Bootstrap | ±7.3 | 14.7 | 0.38 |
Bootstrap margins are 62% tighter (3x reduction).
Why it works: Independent variation causes averaging effects. High value on task A cancels low value on task B, tightening the overall CI beyond Min/Max assumptions.
Bootstrap Stability and Sample Size¶
Question: How low can we set n (bootstrap samples) while remaining stable?
Test: Mid-tier model (rank 22) across n ∈ {100, 500, 1000, 5000, 10000}, seeds ∈ {0, ..., 9}
Results (range Δ = max margin - min margin across seeds):
n= 100 | margin: 8.9 ± 0.71 Δ=2.50 ❌
n= 500 | margin: 8.1 ± 0.34 Δ=1.21 ❌
n= 1000 | margin: 8.0 ± 0.30 Δ=1.10 ❌
n= 5000 | margin: 8.0 ± 0.12 Δ=0.37 ✓
n=10000 | margin: 8.0 ± 0.08 Δ=0.25 ✓
Rule of thumb: Choose n where Δ < 0.5 points
The Decision: Bootstrap with n=5000, seed=42
Interpretation Guide¶
ReasonScore Values¶
| Range | Interpretation |
|---|---|
| 900-1000 | Near-perfect performance across all tasks |
| 700-900 | Strong overall with minor weaknesses |
| 500-700 | Good capability but notable gaps |
| 300-500 | Limited capability or severe task imbalance |
| 100-300 | Significant deficits across multiple domains |
| 10-100 | Catastrophic failures (geometric mean penalty) |
Confidence Intervals¶
Tight CIs (< ±10 points): - High accuracy models (90%+) - Low truncation rates (< 5%) - Consistent performance across tasks - Example: Qwen3.5-122B: 946 ± 4
Moderate CIs (±10-20 points): - Mid-tier models (70-90%) - Moderate truncation (5-15%) - Variable task performance - Example: Qwen3-8B: 752 ± 8
Wide CIs (> ±20 points): - High truncation (> 15%) - High task-to-task variance - Rare in r12
Statistical Significance¶
Non-overlapping CIs → Significant difference:
Model A: 946 ± 4 [942, 950]
Model B: 935 ± 4 [931, 939]
→ Statistically distinct (#1 vs #2)
Overlapping CIs → Statistical tie:
Model C: 892 ± 17 [875, 909]
Model D: 887 ± 17 [870, 904]
→ Tied despite ranking difference