ReasonScore: A Unified LLM Evaluation Metric¶

Overview¶

ReasonScore is ReasonScape's unified metric for LLM evaluation, designed to capture every practical aspect of model performance in a single, interpretable number. The ultimate output—score/token—ranges from 0 to 1 and represents the balance between reasoning quality and computational efficiency.

What ReasonScore captures:

✅ Accuracy - Correctness of answers (adjusted for guessing)
✅ Statistical confidence - Uncertainty from finite sampling
✅ Context reliability - Truncation and context limit issues
✅ Task balance - Performance consistency across reasoning domains
✅ Difficulty scaling - Capability maintenance under increasing complexity
✅ Token efficiency - Computational cost per unit of quality

Unlike traditional benchmarks that report only accuracy, ReasonScore provides a deployment-focused metric that answers: "Should I use this model? What will it cost me? Where will it fail?"

Design Philosophy¶

Three Core Principles¶

1. Optimistic About Uncertainty (Statistical Margin)¶

Statistical uncertainty is "our fault" for not sampling infinitely.
→ Give models the benefit of the doubt.
→ Use upper bound of confidence interval (center + margin).

Why: A perfect model with 100 samples might have center=0.95, margin=0.05. If we subtracted margin, we'd penalize the model purely for our sampling decision. By adding margin (upper CI bound = 1.0), the perfect model scores 1.0 regardless of sample count.

2. Pessimistic About Failures (Truncation)¶

Context limits and truncations are "the model's fault."
→ Penalize directly.
→ Subtract truncation ratio from score.

Why: If a model hits context limits, it's a practical deployment failure. This must hurt the score.

3. Punish Imbalance (Geometric Mean)¶

Being great at 11 tasks doesn't excuse catastrophic failure at 1 task.
→ Use geometric mean across tasks.
→ Failures drag down overall score significantly.

Why: In real deployment, users will hit all task types. A model that fails catastrophically on date reasoning is unusable, regardless of arithmetic prowess.

The Four-Layer Architecture¶

ReasonScore is computed in four layers, each using the mathematically appropriate operation for its purpose:

Layer 1: Samples → Point Score       [Wilson CI]
Layer 2: Points → Task Score         [Wilson CI re-aggregation]
Layer 3: Tasks → Tier ReasonScore    [Geometric Mean × 1000]
Layer 4: Tiers → score/token         [Arithmetic Mean ÷ tokens]

Layer 1: Point Score (Wilson CI Per Point)¶

Input¶

Raw test samples at one difficulty coordinate (e.g., 128 samples at length=16, depth=2)

Process¶

Step 1: Excess Accuracy Correction

# Remove guessing contribution
guess_accumulator = sum(guess_chance for each sample)
adjusted_successes = correct_count - guess_accumulator
adjusted_trials = total_count - guess_accumulator

Result: 0.0 = pure guessing, 1.0 = perfect knowledge above guessing

Step 2: Wilson Confidence Interval

# Compute 95% CI bounds
adjusted_center, adjusted_margin = wilson_interval(adjusted_successes, adjusted_trials)

Step 3: Truncation Penalty

truncated_ratio = truncated_count / total_count

Step 4: Point Score

point_score = adjusted_center + adjusted_margin - truncated_ratio

Output¶

One score per point, stored in PointsDB:

adjusted_successes (DOUBLE)
adjusted_trials (DOUBLE)
adjusted_center (DOUBLE)
adjusted_margin (DOUBLE)
truncated_ratio (DOUBLE)

Key Insight: Why Add Margin?¶

Perfect model example:

# 100 samples, all correct, no truncation
center = 0.95  # Wilson center approaches 1.0 only at infinity
margin = 0.05  # Uncertainty from finite sampling

# If we subtracted margin:
score = 0.95 - 0.05 - 0 = 0.90  # Penalized for our sampling choice!

# By adding margin (upper CI bound):
score = 0.95 + 0.05 - 0 = 1.00  # Perfect score, as it should be

Wilson CI mathematics guarantee: center + margin ≤ 1.0 (upper bound never exceeds 1.0)

Therefore: point_score ∈ [-1.0, 1.0] theoretically, [0.0, 1.0] typically.

Layer 2: Task Score (Wilson CI Re-Aggregation)¶

Input¶

Multiple points within one (tier, task) combination. Example: 26 points for (tier=easy, task=arithmetic)

Process¶

Step 1: Query via PointsDB aggregate()

df = db.aggregate(
    filters={"tiers": ["easy"], "base_task": "arithmetic"},
    group_by=["eval_id", "base_task"]
)

Step 2: Sum Adjusted Counts

-- aggregate() internally performs:
total_successes = SUM(adjusted_successes)  -- across all points
total_trials = SUM(adjusted_trials)        -- across all points
total_truncated = SUM(truncated)
total_count = SUM(total)

Example:

Point 1 (length=8):  adjusted_successes=118.5, adjusted_trials=126.5
Point 2 (length=16): adjusted_successes=48.2,  adjusted_trials=62.2
Point 3 (length=24): adjusted_successes=16.8,  adjusted_trials=30.8

Aggregated: total_successes=183.5, total_trials=219.5

Step 3: Re-Compute Wilson CI

# Treat aggregated population as single large sample
task_center, task_margin = wilson_interval(total_successes, total_trials)
task_truncated_ratio = total_truncated / total_count

Step 4: Task Score

task_score = task_center + task_margin - task_truncated_ratio

Output¶

One score per (tier, task) combination. Example:

(easy, arithmetic): 0.849
(easy, boolean):    0.823
(easy, dates):      0.765
...

Key Insight: Why Re-Aggregate with Wilson CI?¶

Alternative (wrong): Geometric mean of point scores

task_score = geomean([point1_score, point2_score, point3_score])
# Problem: Throws away sample size information!
# Point with 128 samples should weigh more than point with 32 samples.

Correct: Wilson CI re-aggregation

# Preserve sample sizes by summing adjusted counts
# Wilson CI naturally weighs points by sample count
task_score = wilson_on_sum(adjusted_successes, adjusted_trials)

Result: Points with more samples contribute more statistical weight, as they should.

Layer 3: Tier ReasonScore (Geometric Mean Across Tasks)¶

Input¶

12 task scores within one tier (e.g., tier=easy)

Process¶

Step 1: Collect Task Scores

task_scores = [
    0.849,  # arithmetic
    0.823,  # boolean
    0.765,  # dates
    0.891,  # jsonpath
    ...     # 12 tasks total
]

Step 2: Geometric Mean

tier_score = geometric_mean(task_scores)

Step 3: Scale to [10, 1000] Range

ReasonScore_tier = 1000 × tier_score

Output¶

One ReasonScore per tier:

ReasonScore_easy:   850
ReasonScore_medium: 720
ReasonScore_hard:   580

Key Insight: Why Geometric Mean?¶

Scenario: Task specialist model

task_scores = [0.95, 0.95, 0.95, ..., 0.05]  # Great at 11 tasks, fails 1 task

# Arithmetic mean (wrong):
arithmetic_mean = (11 × 0.95 + 0.05) / 12 = 0.875
# → Score: 875 (looks pretty good!)

# Geometric mean (correct):
geometric_mean = (0.95^11 × 0.05)^(1/12) = 0.618
# → Score: 618 (reveals the weakness)

Why this matters: In deployment, users encounter all task types. A catastrophic failure on one task makes the model unusable, regardless of strengths elsewhere.

Catastrophic failure protection:

# If one task scores near 0:
task_scores = [0.95, 0.95, ..., 0.01]
geometric_mean = (0.95^11 × 0.01)^(1/12) = 0.495
# → Score: 495 (heavily penalized)

The 1000× Scaling: Intentional Design¶

Why multiply by 1000? - Makes scores human-readable (850 vs 0.850) - Scales both numerator and denominator to ~1k range - Result: score/token naturally lives in [0, 1] for intuitive interpretation

Layer 4: The Uber-KPI (score/token)¶

Input¶

Three tier ReasonScores plus average token consumption

Process¶

Step 1: Arithmetic Mean Across Tiers

avg_score = (ReasonScore_easy + ReasonScore_medium + ReasonScore_hard) / 3

Step 2: Average Token Consumption

avg_tokens = (tokens_easy + tokens_medium + tokens_hard) / 3

Step 3: Compute Uber-KPI

score_per_token = avg_score / avg_tokens

Output¶

Single number in [0, 1] range capturing everything:

Model A: score/token = 0.720
Model B: score/token = 0.680
Model C: score/token = 0.520

Key Insight: Why Switch to Arithmetic Mean?¶

Geometric mean already applied at Layer 3 (within each tier, punishing task imbalance).

At Layer 4, we're asking a different question: - Layer 3: "Are you consistently good across tasks?" (geometric) - Layer 4: "What's your overall performance level?" (arithmetic)

Example: Difficulty scaling collapse

ReasonScore_easy   = 850
ReasonScore_medium = 720
ReasonScore_hard   = 120  # Catastrophic collapse

# Geometric mean (too harsh):
geomean([850, 720, 120]) = 369  # Double jeopardy!
# (Already penalized within hard tier for task failures)

# Arithmetic mean (fair):
mean([850, 720, 120]) = 563  # Still hurt, but not catastrophically

Result: Model gets credit for strengths (easy/medium) while being penalized for collapse (hard).

Why score/token is "The Uber-KPI"¶

Single number captures six dimensions:

Dimension	How It's Captured
Accuracy	Adjusted center in Layer 1
Confidence	Adjusted margin in Layer 1
Reliability	Truncation penalty in Layer 1
Balance	Geometric mean in Layer 3
Scaling	Arithmetic mean in Layer 4
Efficiency	Token division in Layer 4

Range [0, 1] is not accidental:

Perfect model:  score=1000, tokens=1000 → 1.0
Good model:     score=750,  tokens=1500 → 0.5
Poor model:     score=300,  tokens=2000 → 0.15
Terrible model: score=50,   tokens=5000 → 0.01

Complete Mathematical Flow¶

Example: Model X Evaluated at All Tiers¶

Tier = Easy¶

Arithmetic task (26 points):

# Layer 1: Each point has pre-computed Wilson CI
point_1: adjusted_successes=118.5, adjusted_trials=126.5, truncated=2/128
point_2: adjusted_successes=48.2,  adjusted_trials=62.2,  truncated=1/64
...

# Layer 2: Re-aggregate via PointsDB
total_successes = sum([118.5, 48.2, ...]) = 520.4
total_trials = sum([126.5, 62.2, ...]) = 614.8
total_truncated = 12, total_count = 640

task_center, task_margin = wilson_interval(520.4, 614.8)
# = (0.821, 0.048)

arithmetic_easy = 0.821 + 0.048 - (12/640) = 0.849

Repeat for 11 more tasks...

boolean_easy = 0.823
dates_easy = 0.765
...

Layer 3: Geometric mean across tasks

ReasonScore_easy = 1000 × geomean([0.849, 0.823, 0.765, ...])
                 = 1000 × 0.850
                 = 850

Tier = Medium¶

ReasonScore_medium = 720

Tier = Hard¶

ReasonScore_hard = 580

Layer 4: Uber-KPI¶

avg_score = (850 + 720 + 580) / 3 = 716.67
avg_tokens = (1180 + 1250 + 1380) / 3 = 1270

score_per_token = 716.67 / 1270 = 0.564

Interpretation Guide¶

ReasonScore Per Tier¶

Range	Interpretation
900-1000	Near-perfect performance across all tasks
700-900	Strong overall with minor weaknesses
500-700	Good capability but notable gaps or truncation issues
300-500	Limited capability or severe task imbalance
100-300	Significant deficits across multiple domains
10-100	Catastrophic failures (geometric mean pulling score down)

Minimum ReasonScore: 10 (from 0.01 clamping in Layer 2 to prevent negative scores)

Difficulty Scaling Profiles¶

Balanced Scaler:

Easy: 850, Medium: 720, Hard: 580
→ Graceful degradation, maintains capability ratios

Catastrophic Scaler:

Easy: 820, Medium: 650, Hard: 120
→ Collapses at high difficulty (task failures at hard tier)

Early Breaker:

Easy: 480, Medium: 460, Hard: 440
→ Already struggling at easy tier, never had the capability

score/token (Uber-KPI)¶

Range	Interpretation
0.7-1.0	Excellent quality + efficiency (deployment ready)
0.5-0.7	Good quality + reasonable efficiency (production viable)
0.3-0.5	Decent quality but inefficient, or poor quality but efficient
0.1-0.3	Poor quality and/or very inefficient (questionable viability)
0.0-0.1	Catastrophic (unusable for most purposes)

Comparative Rankings¶

Traditional benchmark (accuracy only):

Model B: 90% accuracy → Rank 1
Model A: 85% accuracy → Rank 2
Model C: 82% accuracy → Rank 3
Model D: 42% accuracy → Rank 4

ReasonScape (score/token):

Model A: 0.720 (85% @ 1180 tokens) → Rank 1 (best value)
Model B: 0.680 (90% @ 1320 tokens) → Rank 2 (accurate but inefficient)
Model C: 0.520 (82% @ 1580 tokens) → Rank 3 (decent but wasteful)
Model D: 0.180 (42% @ 2340 tokens) → Rank 4 (expensive failure)

Why rankings differ: Model A has best quality-per-cost ratio, even though Model B has higher raw accuracy.

Innovations¶

Compared to Traditional Benchmarks¶

MMLU, HumanEval, etc.:

Single aggregate accuracy (hides failure modes)
Fixed test sets (memorization risk)
No difficulty control (coarse-grained)
No truncation handling (conflated with wrong answers)
No token efficiency (ignores deployment cost)
No statistical rigor (no confidence intervals)

ReasonScape:

Multi-dimensional metric (accuracy + confidence + reliability + balance + scaling + efficiency)
Parametric generation (no memorization)
Controlled difficulty manifolds (fine-grained analysis)
Truncation tracked separately (practical deployment concern)
Token efficiency explicit (cost-aware)
Statistical rigor (Wilson CI, excess accuracy correction)

Key Innovations¶

1. Excess Accuracy Correction¶

Most benchmarks treat guessing as skill. ReasonScape removes guessing contribution.

2. Optimistic Margin Handling¶

Ensures perfect models score 1000 regardless of sample count (uncertainty is "our fault").

3. Pessimistic Truncation Handling¶

Context limits are practical failures, penalized directly ("model's fault").

4. Geometric Mean for Balance¶

Task imbalance severely penalizes score (reflects deployment reality).

5. Wilson CI Re-Aggregation¶

Preserves sample size information when aggregating across difficulty manifold.

6. Three-Tier Scaling Analysis¶

Reveals how models maintain (or lose) capability under pressure.

7. Token Efficiency Integration¶

Makes cost explicit in the final metric (deployment-focused).

8. [0, 1] Range by Design¶

Both numerator (~1000) and denominator (~1000) scaled for intuitive interpretation.

Design Decisions Summary¶

Why These Specific Choices?¶

Decision	Rationale
Add margin	Perfect models should score 1000 regardless of samples
Subtract truncation	Context failures are model limitations, not uncertainty
Wilson CI twice	Layer 1: per-point stats; Layer 2: re-aggregate preserving sample sizes
Geometric mean (tasks)	Can't hide catastrophic task failures
Arithmetic mean (tiers)	Fair difficulty averaging (geometric would double-penalize)
1000× scaling	Makes score/token naturally live in [0, 1]
0.01 clamping	Prevents negative scores while still heavily penalizing failures
Divide by tokens	Makes deployment cost explicit

Usage in ReasonScape¶

Stage 4: Leaderboard (`leaderboard.py`)¶

python leaderboard.py data/dataset-m12x.json

Displays per-tier ReasonScores and overall score/token rankings.

Stage 5: Analysis (`analyze.py scores`)¶

python analyze.py scores data/dataset-m12x.json --output scores.md

Computes score/token with per-task breakdowns for filtered model sets.

PointsDB Integration¶

ReasonScore calculation relies on:

Layer 1 data: Stored in points table (adjusted_successes, adjusted_trials, etc.)
Layer 2 aggregation: Via aggregate() function (Wilson CI re-computation)
Layer 3-4 computation: In src/scores.py using tier faceting

References¶

technical-details.md - Statistical foundations (Wilson CI, excess accuracy correction)
pointsdb.md - Data structure and aggregation functions
architecture.md - Integration with five-stage pipeline
workflows/1-ranking.md - Using ReasonScore for model selection

Appendix: The Perfect Model¶

What should score 1000?

A model that: 1. Gets all answers correct (100% accuracy above guessing) 2. Never truncates (0% context failures) 3. Maintains performance across all tasks (perfect balance) 4. Maintains performance across all difficulty levels (perfect scaling)

With finite samples:

# 100 samples per point
point_score = 0.95 + 0.05 - 0.0 = 1.00  # Upper CI bound

# All points perfect
task_score = wilson_on_sum(perfect_points) = 1.00

# All tasks perfect
ReasonScore_tier = 1000 × geomean([1.0, 1.0, ...]) = 1000

# All tiers perfect
avg_score = (1000 + 1000 + 1000) / 3 = 1000

# Minimum possible tokens (1 token per point required to form an answer)
avg_tokens = 1

# Uber-KPI (maximum possible score_per_token)
score_per_token = 1000 / 1 = 1000

The design ensures: A ceiling of 1000 score_per_token (1000 ReasonScore / 1 token), with real models fairly ranked by their practical deployment value.

ReasonScore: A Unified LLM Evaluation Metric¶

Overview¶

Design Philosophy¶

Three Core Principles¶

1. Optimistic About Uncertainty (Statistical Margin)¶

2. Pessimistic About Failures (Truncation)¶

3. Punish Imbalance (Geometric Mean)¶

The Four-Layer Architecture¶

Layer 1: Point Score (Wilson CI Per Point)¶

Input¶

Process¶

Output¶

Key Insight: Why Add Margin?¶

Layer 2: Task Score (Wilson CI Re-Aggregation)¶

Input¶

Process¶

Output¶

Key Insight: Why Re-Aggregate with Wilson CI?¶

Layer 3: Tier ReasonScore (Geometric Mean Across Tasks)¶

Input¶

Process¶

Output¶

Key Insight: Why Geometric Mean?¶

The 1000× Scaling: Intentional Design¶

Layer 4: The Uber-KPI (score/token)¶

Input¶

Process¶

Output¶

Key Insight: Why Switch to Arithmetic Mean?¶

Why score/token is "The Uber-KPI"¶

Complete Mathematical Flow¶

Example: Model X Evaluated at All Tiers¶

Tier = Easy¶

Tier = Medium¶

Tier = Hard¶

Layer 4: Uber-KPI¶

Interpretation Guide¶

ReasonScore Per Tier¶

Difficulty Scaling Profiles¶

score/token (Uber-KPI)¶

Comparative Rankings¶

Innovations¶

Compared to Traditional Benchmarks¶

Key Innovations¶

1. Excess Accuracy Correction¶

2. Optimistic Margin Handling¶

3. Pessimistic Truncation Handling¶

4. Geometric Mean for Balance¶

5. Wilson CI Re-Aggregation¶

6. Three-Tier Scaling Analysis¶

7. Token Efficiency Integration¶

8. [0, 1] Range by Design¶

Design Decisions Summary¶

Why These Specific Choices?¶

Usage in ReasonScape¶

Stage 4: Leaderboard (leaderboard.py)¶

Stage 5: Analysis (analyze.py scores)¶

PointsDB Integration¶

References¶

Appendix: The Perfect Model¶

Stage 4: Leaderboard (`leaderboard.py`)¶

Stage 5: Analysis (`analyze.py scores`)¶