ReasonScore: A Unified LLM Evaluation Metric¶
Overview¶
ReasonScore is ReasonScape's unified metric for LLM evaluation, designed to capture every practical aspect of model performance in a single, interpretable number. The ultimate output—score/token—ranges from 0 to 1 and represents the balance between reasoning quality and computational efficiency.
What ReasonScore captures:
- ✅ Accuracy - Correctness of answers (adjusted for guessing)
- ✅ Statistical confidence - Uncertainty from finite sampling
- ✅ Context reliability - Truncation and context limit issues
- ✅ Task balance - Performance consistency across reasoning domains
- ✅ Difficulty scaling - Capability maintenance under increasing complexity
- ✅ Token efficiency - Computational cost per unit of quality
Unlike traditional benchmarks that report only accuracy, ReasonScore provides a deployment-focused metric that answers: "Should I use this model? What will it cost me? Where will it fail?"
Design Philosophy¶
Three Core Principles¶
1. Optimistic About Uncertainty (Statistical Margin)¶
Statistical uncertainty is "our fault" for not sampling infinitely.
→ Give models the benefit of the doubt.
→ Use upper bound of confidence interval (center + margin).
Why: A perfect model with 100 samples might have center=0.95, margin=0.05. If we subtracted margin, we'd penalize the model purely for our sampling decision. By adding margin (upper CI bound = 1.0), the perfect model scores 1.0 regardless of sample count.
2. Pessimistic About Failures (Truncation)¶
Context limits and truncations are "the model's fault."
→ Penalize directly.
→ Subtract truncation ratio from score.
Why: If a model hits context limits, it's a practical deployment failure. This must hurt the score.
3. Punish Imbalance (Geometric Mean)¶
Being great at 11 tasks doesn't excuse catastrophic failure at 1 task.
→ Use geometric mean across tasks.
→ Failures drag down overall score significantly.
Why: In real deployment, users will hit all task types. A model that fails catastrophically on date reasoning is unusable, regardless of arithmetic prowess.
The Four-Layer Architecture¶
ReasonScore is computed in four layers, each using the mathematically appropriate operation for its purpose:
Layer 1: Samples → Point Score [Wilson CI]
Layer 2: Points → Task Score [Wilson CI re-aggregation]
Layer 3: Tasks → Tier ReasonScore [Geometric Mean × 1000]
Layer 4: Tiers → score/token [Arithmetic Mean ÷ tokens]
Layer 1: Point Score (Wilson CI Per Point)¶
Input¶
Raw test samples at one difficulty coordinate (e.g., 128 samples at length=16, depth=2)
Process¶
Step 1: Excess Accuracy Correction
# Remove guessing contribution
guess_accumulator = sum(guess_chance for each sample)
adjusted_successes = correct_count - guess_accumulator
adjusted_trials = total_count - guess_accumulator
Result: 0.0 = pure guessing, 1.0 = perfect knowledge above guessing
Step 2: Wilson Confidence Interval
# Compute 95% CI bounds
adjusted_center, adjusted_margin = wilson_interval(adjusted_successes, adjusted_trials)
Step 3: Truncation Penalty
truncated_ratio = truncated_count / total_count
Step 4: Point Score
point_score = adjusted_center + adjusted_margin - truncated_ratio
Output¶
One score per point, stored in PointsDB:
adjusted_successes(DOUBLE)adjusted_trials(DOUBLE)adjusted_center(DOUBLE)adjusted_margin(DOUBLE)truncated_ratio(DOUBLE)
Key Insight: Why Add Margin?¶
Perfect model example:
# 100 samples, all correct, no truncation
center = 0.95 # Wilson center approaches 1.0 only at infinity
margin = 0.05 # Uncertainty from finite sampling
# If we subtracted margin:
score = 0.95 - 0.05 - 0 = 0.90 # Penalized for our sampling choice!
# By adding margin (upper CI bound):
score = 0.95 + 0.05 - 0 = 1.00 # Perfect score, as it should be
Wilson CI mathematics guarantee: center + margin ≤ 1.0 (upper bound never exceeds 1.0)
Therefore: point_score ∈ [-1.0, 1.0] theoretically, [0.0, 1.0] typically.
Layer 2: Task Score (Wilson CI Re-Aggregation)¶
Input¶
Multiple points within one (tier, task) combination. Example: 26 points for (tier=easy, task=arithmetic)
Process¶
Step 1: Query via PointsDB aggregate()
df = db.aggregate(
filters={"tiers": ["easy"], "base_task": "arithmetic"},
group_by=["eval_id", "base_task"]
)
Step 2: Sum Adjusted Counts
-- aggregate() internally performs:
total_successes = SUM(adjusted_successes) -- across all points
total_trials = SUM(adjusted_trials) -- across all points
total_truncated = SUM(truncated)
total_count = SUM(total)
Example:
Point 1 (length=8): adjusted_successes=118.5, adjusted_trials=126.5
Point 2 (length=16): adjusted_successes=48.2, adjusted_trials=62.2
Point 3 (length=24): adjusted_successes=16.8, adjusted_trials=30.8
Aggregated: total_successes=183.5, total_trials=219.5
Step 3: Re-Compute Wilson CI
# Treat aggregated population as single large sample
task_center, task_margin = wilson_interval(total_successes, total_trials)
task_truncated_ratio = total_truncated / total_count
Step 4: Task Score
task_score = task_center + task_margin - task_truncated_ratio
Output¶
One score per (tier, task) combination. Example:
(easy, arithmetic): 0.849
(easy, boolean): 0.823
(easy, dates): 0.765
...
Key Insight: Why Re-Aggregate with Wilson CI?¶
Alternative (wrong): Geometric mean of point scores
task_score = geomean([point1_score, point2_score, point3_score])
# Problem: Throws away sample size information!
# Point with 128 samples should weigh more than point with 32 samples.
Correct: Wilson CI re-aggregation
# Preserve sample sizes by summing adjusted counts
# Wilson CI naturally weighs points by sample count
task_score = wilson_on_sum(adjusted_successes, adjusted_trials)
Result: Points with more samples contribute more statistical weight, as they should.
Layer 3: Tier ReasonScore (Geometric Mean Across Tasks)¶
Input¶
12 task scores within one tier (e.g., tier=easy)
Process¶
Step 1: Collect Task Scores
task_scores = [
0.849, # arithmetic
0.823, # boolean
0.765, # dates
0.891, # jsonpath
... # 12 tasks total
]
Step 2: Geometric Mean
tier_score = geometric_mean(task_scores)
Step 3: Scale to [10, 1000] Range
ReasonScore_tier = 1000 × tier_score
Output¶
One ReasonScore per tier:
ReasonScore_easy: 850
ReasonScore_medium: 720
ReasonScore_hard: 580
Key Insight: Why Geometric Mean?¶
Scenario: Task specialist model
task_scores = [0.95, 0.95, 0.95, ..., 0.05] # Great at 11 tasks, fails 1 task
# Arithmetic mean (wrong):
arithmetic_mean = (11 × 0.95 + 0.05) / 12 = 0.875
# → Score: 875 (looks pretty good!)
# Geometric mean (correct):
geometric_mean = (0.95^11 × 0.05)^(1/12) = 0.618
# → Score: 618 (reveals the weakness)
Why this matters: In deployment, users encounter all task types. A catastrophic failure on one task makes the model unusable, regardless of strengths elsewhere.
Catastrophic failure protection:
# If one task scores near 0:
task_scores = [0.95, 0.95, ..., 0.01]
geometric_mean = (0.95^11 × 0.01)^(1/12) = 0.495
# → Score: 495 (heavily penalized)
The 1000× Scaling: Intentional Design¶
Why multiply by 1000?
- Makes scores human-readable (850 vs 0.850)
- Scales both numerator and denominator to ~1k range
- Result: score/token naturally lives in [0, 1] for intuitive interpretation
Layer 4: The Uber-KPI (score/token)¶
Input¶
Three tier ReasonScores plus average token consumption
Process¶
Step 1: Arithmetic Mean Across Tiers
avg_score = (ReasonScore_easy + ReasonScore_medium + ReasonScore_hard) / 3
Step 2: Average Token Consumption
avg_tokens = (tokens_easy + tokens_medium + tokens_hard) / 3
Step 3: Compute Uber-KPI
score_per_token = avg_score / avg_tokens
Output¶
Single number in [0, 1] range capturing everything:
Model A: score/token = 0.720
Model B: score/token = 0.680
Model C: score/token = 0.520
Key Insight: Why Switch to Arithmetic Mean?¶
Geometric mean already applied at Layer 3 (within each tier, punishing task imbalance).
At Layer 4, we're asking a different question: - Layer 3: "Are you consistently good across tasks?" (geometric) - Layer 4: "What's your overall performance level?" (arithmetic)
Example: Difficulty scaling collapse
ReasonScore_easy = 850
ReasonScore_medium = 720
ReasonScore_hard = 120 # Catastrophic collapse
# Geometric mean (too harsh):
geomean([850, 720, 120]) = 369 # Double jeopardy!
# (Already penalized within hard tier for task failures)
# Arithmetic mean (fair):
mean([850, 720, 120]) = 563 # Still hurt, but not catastrophically
Result: Model gets credit for strengths (easy/medium) while being penalized for collapse (hard).
Why score/token is "The Uber-KPI"¶
Single number captures six dimensions:
| Dimension | How It's Captured |
|---|---|
| Accuracy | Adjusted center in Layer 1 |
| Confidence | Adjusted margin in Layer 1 |
| Reliability | Truncation penalty in Layer 1 |
| Balance | Geometric mean in Layer 3 |
| Scaling | Arithmetic mean in Layer 4 |
| Efficiency | Token division in Layer 4 |
Range [0, 1] is not accidental:
Perfect model: score=1000, tokens=1000 → 1.0
Good model: score=750, tokens=1500 → 0.5
Poor model: score=300, tokens=2000 → 0.15
Terrible model: score=50, tokens=5000 → 0.01
Complete Mathematical Flow¶
Example: Model X Evaluated at All Tiers¶
Tier = Easy¶
Arithmetic task (26 points):
# Layer 1: Each point has pre-computed Wilson CI
point_1: adjusted_successes=118.5, adjusted_trials=126.5, truncated=2/128
point_2: adjusted_successes=48.2, adjusted_trials=62.2, truncated=1/64
...
# Layer 2: Re-aggregate via PointsDB
total_successes = sum([118.5, 48.2, ...]) = 520.4
total_trials = sum([126.5, 62.2, ...]) = 614.8
total_truncated = 12, total_count = 640
task_center, task_margin = wilson_interval(520.4, 614.8)
# = (0.821, 0.048)
arithmetic_easy = 0.821 + 0.048 - (12/640) = 0.849
Repeat for 11 more tasks...
boolean_easy = 0.823
dates_easy = 0.765
...
Layer 3: Geometric mean across tasks
ReasonScore_easy = 1000 × geomean([0.849, 0.823, 0.765, ...])
= 1000 × 0.850
= 850
Tier = Medium¶
ReasonScore_medium = 720
Tier = Hard¶
ReasonScore_hard = 580
Layer 4: Uber-KPI¶
avg_score = (850 + 720 + 580) / 3 = 716.67
avg_tokens = (1180 + 1250 + 1380) / 3 = 1270
score_per_token = 716.67 / 1270 = 0.564
Interpretation Guide¶
ReasonScore Per Tier¶
| Range | Interpretation |
|---|---|
| 900-1000 | Near-perfect performance across all tasks |
| 700-900 | Strong overall with minor weaknesses |
| 500-700 | Good capability but notable gaps or truncation issues |
| 300-500 | Limited capability or severe task imbalance |
| 100-300 | Significant deficits across multiple domains |
| 10-100 | Catastrophic failures (geometric mean pulling score down) |
Minimum ReasonScore: 10 (from 0.01 clamping in Layer 2 to prevent negative scores)
Difficulty Scaling Profiles¶
Balanced Scaler:
Easy: 850, Medium: 720, Hard: 580
→ Graceful degradation, maintains capability ratios
Catastrophic Scaler:
Easy: 820, Medium: 650, Hard: 120
→ Collapses at high difficulty (task failures at hard tier)
Early Breaker:
Easy: 480, Medium: 460, Hard: 440
→ Already struggling at easy tier, never had the capability
score/token (Uber-KPI)¶
| Range | Interpretation |
|---|---|
| 0.7-1.0 | Excellent quality + efficiency (deployment ready) |
| 0.5-0.7 | Good quality + reasonable efficiency (production viable) |
| 0.3-0.5 | Decent quality but inefficient, or poor quality but efficient |
| 0.1-0.3 | Poor quality and/or very inefficient (questionable viability) |
| 0.0-0.1 | Catastrophic (unusable for most purposes) |
Comparative Rankings¶
Traditional benchmark (accuracy only):
Model B: 90% accuracy → Rank 1
Model A: 85% accuracy → Rank 2
Model C: 82% accuracy → Rank 3
Model D: 42% accuracy → Rank 4
ReasonScape (score/token):
Model A: 0.720 (85% @ 1180 tokens) → Rank 1 (best value)
Model B: 0.680 (90% @ 1320 tokens) → Rank 2 (accurate but inefficient)
Model C: 0.520 (82% @ 1580 tokens) → Rank 3 (decent but wasteful)
Model D: 0.180 (42% @ 2340 tokens) → Rank 4 (expensive failure)
Why rankings differ: Model A has best quality-per-cost ratio, even though Model B has higher raw accuracy.
Innovations¶
Compared to Traditional Benchmarks¶
MMLU, HumanEval, etc.:
- Single aggregate accuracy (hides failure modes)
- Fixed test sets (memorization risk)
- No difficulty control (coarse-grained)
- No truncation handling (conflated with wrong answers)
- No token efficiency (ignores deployment cost)
- No statistical rigor (no confidence intervals)
ReasonScape:
- Multi-dimensional metric (accuracy + confidence + reliability + balance + scaling + efficiency)
- Parametric generation (no memorization)
- Controlled difficulty manifolds (fine-grained analysis)
- Truncation tracked separately (practical deployment concern)
- Token efficiency explicit (cost-aware)
- Statistical rigor (Wilson CI, excess accuracy correction)
Key Innovations¶
1. Excess Accuracy Correction¶
Most benchmarks treat guessing as skill. ReasonScape removes guessing contribution.
2. Optimistic Margin Handling¶
Ensures perfect models score 1000 regardless of sample count (uncertainty is "our fault").
3. Pessimistic Truncation Handling¶
Context limits are practical failures, penalized directly ("model's fault").
4. Geometric Mean for Balance¶
Task imbalance severely penalizes score (reflects deployment reality).
5. Wilson CI Re-Aggregation¶
Preserves sample size information when aggregating across difficulty manifold.
6. Three-Tier Scaling Analysis¶
Reveals how models maintain (or lose) capability under pressure.
7. Token Efficiency Integration¶
Makes cost explicit in the final metric (deployment-focused).
8. [0, 1] Range by Design¶
Both numerator (~1000) and denominator (~1000) scaled for intuitive interpretation.
Design Decisions Summary¶
Why These Specific Choices?¶
| Decision | Rationale |
|---|---|
| Add margin | Perfect models should score 1000 regardless of samples |
| Subtract truncation | Context failures are model limitations, not uncertainty |
| Wilson CI twice | Layer 1: per-point stats; Layer 2: re-aggregate preserving sample sizes |
| Geometric mean (tasks) | Can't hide catastrophic task failures |
| Arithmetic mean (tiers) | Fair difficulty averaging (geometric would double-penalize) |
| 1000× scaling | Makes score/token naturally live in [0, 1] |
| 0.01 clamping | Prevents negative scores while still heavily penalizing failures |
| Divide by tokens | Makes deployment cost explicit |
Usage in ReasonScape¶
Stage 4: Leaderboard (leaderboard.py)¶
python leaderboard.py data/dataset-m12x.json
Displays per-tier ReasonScores and overall score/token rankings.
Stage 5: Analysis (analyze.py scores)¶
python analyze.py scores data/dataset-m12x.json --output scores.md
Computes score/token with per-task breakdowns for filtered model sets.
PointsDB Integration¶
ReasonScore calculation relies on:
- Layer 1 data: Stored in
pointstable (adjusted_successes,adjusted_trials, etc.) - Layer 2 aggregation: Via
aggregate()function (Wilson CI re-computation) - Layer 3-4 computation: In
src/scores.pyusing tier faceting
References¶
- technical-details.md - Statistical foundations (Wilson CI, excess accuracy correction)
- pointsdb.md - Data structure and aggregation functions
- architecture.md - Integration with five-stage pipeline
- workflows/1-ranking.md - Using ReasonScore for model selection
Appendix: The Perfect Model¶
What should score 1000?
A model that: 1. Gets all answers correct (100% accuracy above guessing) 2. Never truncates (0% context failures) 3. Maintains performance across all tasks (perfect balance) 4. Maintains performance across all difficulty levels (perfect scaling)
With finite samples:
# 100 samples per point
point_score = 0.95 + 0.05 - 0.0 = 1.00 # Upper CI bound
# All points perfect
task_score = wilson_on_sum(perfect_points) = 1.00
# All tasks perfect
ReasonScore_tier = 1000 × geomean([1.0, 1.0, ...]) = 1000
# All tiers perfect
avg_score = (1000 + 1000 + 1000) / 3 = 1000
# Minimum possible tokens (1 token per point required to form an answer)
avg_tokens = 1
# Uber-KPI (maximum possible score_per_token)
score_per_token = 1000 / 1 = 1000
The design ensures: A ceiling of 1000 score_per_token (1000 ReasonScore / 1 token), with real models fairly ranked by their practical deployment value.