Skip to content

ReasonScore: A Unified LLM Evaluation Metric

Overview

ReasonScore is ReasonScape's unified metric for LLM evaluation, designed to capture every practical aspect of model performance in a single, interpretable number. The ultimate output—score/token—ranges from 0 to 1 and represents the balance between reasoning quality and computational efficiency.

What ReasonScore captures:

  • Accuracy - Correctness of answers (adjusted for guessing)
  • Statistical confidence - Uncertainty from finite sampling
  • Context reliability - Truncation and context limit issues
  • Task balance - Performance consistency across reasoning domains
  • Difficulty scaling - Capability maintenance under increasing complexity
  • Token efficiency - Computational cost per unit of quality

Unlike traditional benchmarks that report only accuracy, ReasonScore provides a deployment-focused metric that answers: "Should I use this model? What will it cost me? Where will it fail?"


Design Philosophy

Three Core Principles

1. Optimistic About Uncertainty (Statistical Margin)

Statistical uncertainty is "our fault" for not sampling infinitely.
→ Give models the benefit of the doubt.
→ Use upper bound of confidence interval (center + margin).

Why: A perfect model with 100 samples might have center=0.95, margin=0.05. If we subtracted margin, we'd penalize the model purely for our sampling decision. By adding margin (upper CI bound = 1.0), the perfect model scores 1.0 regardless of sample count.

2. Pessimistic About Failures (Truncation)

Context limits and truncations are "the model's fault."
→ Penalize directly.
→ Subtract truncation ratio from score.

Why: If a model hits context limits, it's a practical deployment failure. This must hurt the score.

3. Punish Imbalance (Geometric Mean)

Being great at 11 tasks doesn't excuse catastrophic failure at 1 task.
→ Use geometric mean across tasks.
→ Failures drag down overall score significantly.

Why: In real deployment, users will hit all task types. A model that fails catastrophically on date reasoning is unusable, regardless of arithmetic prowess.


The Four-Layer Architecture

ReasonScore is computed in four layers, each using the mathematically appropriate operation for its purpose:

Layer 1: Samples → Point Score       [Wilson CI]
Layer 2: Points → Task Score         [Wilson CI re-aggregation]
Layer 3: Tasks → Tier ReasonScore    [Geometric Mean × 1000]
Layer 4: Tiers → score/token         [Arithmetic Mean ÷ tokens]

Layer 1: Point Score (Wilson CI Per Point)

Input

Raw test samples at one difficulty coordinate (e.g., 128 samples at length=16, depth=2)

Process

Step 1: Excess Accuracy Correction

# Remove guessing contribution
guess_accumulator = sum(guess_chance for each sample)
adjusted_successes = correct_count - guess_accumulator
adjusted_trials = total_count - guess_accumulator

Result: 0.0 = pure guessing, 1.0 = perfect knowledge above guessing

Step 2: Wilson Confidence Interval

# Compute 95% CI bounds
adjusted_center, adjusted_margin = wilson_interval(adjusted_successes, adjusted_trials)

Step 3: Truncation Penalty

truncated_ratio = truncated_count / total_count

Step 4: Point Score

point_score = adjusted_center + adjusted_margin - truncated_ratio

Output

One score per point, stored in PointsDB:

  • adjusted_successes (DOUBLE)
  • adjusted_trials (DOUBLE)
  • adjusted_center (DOUBLE)
  • adjusted_margin (DOUBLE)
  • truncated_ratio (DOUBLE)

Key Insight: Why Add Margin?

Perfect model example:

# 100 samples, all correct, no truncation
center = 0.95  # Wilson center approaches 1.0 only at infinity
margin = 0.05  # Uncertainty from finite sampling

# If we subtracted margin:
score = 0.95 - 0.05 - 0 = 0.90  # Penalized for our sampling choice!

# By adding margin (upper CI bound):
score = 0.95 + 0.05 - 0 = 1.00  # Perfect score, as it should be

Wilson CI mathematics guarantee: center + margin ≤ 1.0 (upper bound never exceeds 1.0)

Therefore: point_score ∈ [-1.0, 1.0] theoretically, [0.0, 1.0] typically.


Layer 2: Task Score (Wilson CI Re-Aggregation)

Input

Multiple points within one (tier, task) combination. Example: 26 points for (tier=easy, task=arithmetic)

Process

Step 1: Query via PointsDB aggregate()

df = db.aggregate(
    filters={"tiers": ["easy"], "base_task": "arithmetic"},
    group_by=["eval_id", "base_task"]
)

Step 2: Sum Adjusted Counts

-- aggregate() internally performs:
total_successes = SUM(adjusted_successes)  -- across all points
total_trials = SUM(adjusted_trials)        -- across all points
total_truncated = SUM(truncated)
total_count = SUM(total)

Example:

Point 1 (length=8):  adjusted_successes=118.5, adjusted_trials=126.5
Point 2 (length=16): adjusted_successes=48.2,  adjusted_trials=62.2
Point 3 (length=24): adjusted_successes=16.8,  adjusted_trials=30.8

Aggregated: total_successes=183.5, total_trials=219.5

Step 3: Re-Compute Wilson CI

# Treat aggregated population as single large sample
task_center, task_margin = wilson_interval(total_successes, total_trials)
task_truncated_ratio = total_truncated / total_count

Step 4: Task Score

task_score = task_center + task_margin - task_truncated_ratio

Output

One score per (tier, task) combination. Example:

(easy, arithmetic): 0.849
(easy, boolean):    0.823
(easy, dates):      0.765
...

Key Insight: Why Re-Aggregate with Wilson CI?

Alternative (wrong): Geometric mean of point scores

task_score = geomean([point1_score, point2_score, point3_score])
# Problem: Throws away sample size information!
# Point with 128 samples should weigh more than point with 32 samples.

Correct: Wilson CI re-aggregation

# Preserve sample sizes by summing adjusted counts
# Wilson CI naturally weighs points by sample count
task_score = wilson_on_sum(adjusted_successes, adjusted_trials)

Result: Points with more samples contribute more statistical weight, as they should.


Layer 3: Tier ReasonScore (Geometric Mean Across Tasks)

Input

12 task scores within one tier (e.g., tier=easy)

Process

Step 1: Collect Task Scores

task_scores = [
    0.849,  # arithmetic
    0.823,  # boolean
    0.765,  # dates
    0.891,  # jsonpath
    ...     # 12 tasks total
]

Step 2: Geometric Mean

tier_score = geometric_mean(task_scores)

Step 3: Scale to [10, 1000] Range

ReasonScore_tier = 1000 × tier_score

Output

One ReasonScore per tier:

ReasonScore_easy:   850
ReasonScore_medium: 720
ReasonScore_hard:   580

Key Insight: Why Geometric Mean?

Scenario: Task specialist model

task_scores = [0.95, 0.95, 0.95, ..., 0.05]  # Great at 11 tasks, fails 1 task

# Arithmetic mean (wrong):
arithmetic_mean = (11 × 0.95 + 0.05) / 12 = 0.875
# → Score: 875 (looks pretty good!)

# Geometric mean (correct):
geometric_mean = (0.95^11 × 0.05)^(1/12) = 0.618
# → Score: 618 (reveals the weakness)

Why this matters: In deployment, users encounter all task types. A catastrophic failure on one task makes the model unusable, regardless of strengths elsewhere.

Catastrophic failure protection:

# If one task scores near 0:
task_scores = [0.95, 0.95, ..., 0.01]
geometric_mean = (0.95^11 × 0.01)^(1/12) = 0.495
# → Score: 495 (heavily penalized)

The 1000× Scaling: Intentional Design

Why multiply by 1000? - Makes scores human-readable (850 vs 0.850) - Scales both numerator and denominator to ~1k range - Result: score/token naturally lives in [0, 1] for intuitive interpretation


Layer 4: The Uber-KPI (score/token)

Input

Three tier ReasonScores plus average token consumption

Process

Step 1: Arithmetic Mean Across Tiers

avg_score = (ReasonScore_easy + ReasonScore_medium + ReasonScore_hard) / 3

Step 2: Average Token Consumption

avg_tokens = (tokens_easy + tokens_medium + tokens_hard) / 3

Step 3: Compute Uber-KPI

score_per_token = avg_score / avg_tokens

Output

Single number in [0, 1] range capturing everything:

Model A: score/token = 0.720
Model B: score/token = 0.680
Model C: score/token = 0.520

Key Insight: Why Switch to Arithmetic Mean?

Geometric mean already applied at Layer 3 (within each tier, punishing task imbalance).

At Layer 4, we're asking a different question: - Layer 3: "Are you consistently good across tasks?" (geometric) - Layer 4: "What's your overall performance level?" (arithmetic)

Example: Difficulty scaling collapse

ReasonScore_easy   = 850
ReasonScore_medium = 720
ReasonScore_hard   = 120  # Catastrophic collapse

# Geometric mean (too harsh):
geomean([850, 720, 120]) = 369  # Double jeopardy!
# (Already penalized within hard tier for task failures)

# Arithmetic mean (fair):
mean([850, 720, 120]) = 563  # Still hurt, but not catastrophically

Result: Model gets credit for strengths (easy/medium) while being penalized for collapse (hard).

Why score/token is "The Uber-KPI"

Single number captures six dimensions:

Dimension How It's Captured
Accuracy Adjusted center in Layer 1
Confidence Adjusted margin in Layer 1
Reliability Truncation penalty in Layer 1
Balance Geometric mean in Layer 3
Scaling Arithmetic mean in Layer 4
Efficiency Token division in Layer 4

Range [0, 1] is not accidental:

Perfect model:  score=1000, tokens=1000 → 1.0
Good model:     score=750,  tokens=1500 → 0.5
Poor model:     score=300,  tokens=2000 → 0.15
Terrible model: score=50,   tokens=5000 → 0.01


Complete Mathematical Flow

Example: Model X Evaluated at All Tiers

Tier = Easy

Arithmetic task (26 points):

# Layer 1: Each point has pre-computed Wilson CI
point_1: adjusted_successes=118.5, adjusted_trials=126.5, truncated=2/128
point_2: adjusted_successes=48.2,  adjusted_trials=62.2,  truncated=1/64
...

# Layer 2: Re-aggregate via PointsDB
total_successes = sum([118.5, 48.2, ...]) = 520.4
total_trials = sum([126.5, 62.2, ...]) = 614.8
total_truncated = 12, total_count = 640

task_center, task_margin = wilson_interval(520.4, 614.8)
# = (0.821, 0.048)

arithmetic_easy = 0.821 + 0.048 - (12/640) = 0.849

Repeat for 11 more tasks...

boolean_easy = 0.823
dates_easy = 0.765
...

Layer 3: Geometric mean across tasks

ReasonScore_easy = 1000 × geomean([0.849, 0.823, 0.765, ...])
                 = 1000 × 0.850
                 = 850

Tier = Medium

ReasonScore_medium = 720

Tier = Hard

ReasonScore_hard = 580

Layer 4: Uber-KPI

avg_score = (850 + 720 + 580) / 3 = 716.67
avg_tokens = (1180 + 1250 + 1380) / 3 = 1270

score_per_token = 716.67 / 1270 = 0.564

Interpretation Guide

ReasonScore Per Tier

Range Interpretation
900-1000 Near-perfect performance across all tasks
700-900 Strong overall with minor weaknesses
500-700 Good capability but notable gaps or truncation issues
300-500 Limited capability or severe task imbalance
100-300 Significant deficits across multiple domains
10-100 Catastrophic failures (geometric mean pulling score down)

Minimum ReasonScore: 10 (from 0.01 clamping in Layer 2 to prevent negative scores)

Difficulty Scaling Profiles

Balanced Scaler:

Easy: 850, Medium: 720, Hard: 580
→ Graceful degradation, maintains capability ratios

Catastrophic Scaler:

Easy: 820, Medium: 650, Hard: 120
→ Collapses at high difficulty (task failures at hard tier)

Early Breaker:

Easy: 480, Medium: 460, Hard: 440
→ Already struggling at easy tier, never had the capability

score/token (Uber-KPI)

Range Interpretation
0.7-1.0 Excellent quality + efficiency (deployment ready)
0.5-0.7 Good quality + reasonable efficiency (production viable)
0.3-0.5 Decent quality but inefficient, or poor quality but efficient
0.1-0.3 Poor quality and/or very inefficient (questionable viability)
0.0-0.1 Catastrophic (unusable for most purposes)

Comparative Rankings

Traditional benchmark (accuracy only):

Model B: 90% accuracy → Rank 1
Model A: 85% accuracy → Rank 2
Model C: 82% accuracy → Rank 3
Model D: 42% accuracy → Rank 4

ReasonScape (score/token):

Model A: 0.720 (85% @ 1180 tokens) → Rank 1 (best value)
Model B: 0.680 (90% @ 1320 tokens) → Rank 2 (accurate but inefficient)
Model C: 0.520 (82% @ 1580 tokens) → Rank 3 (decent but wasteful)
Model D: 0.180 (42% @ 2340 tokens) → Rank 4 (expensive failure)

Why rankings differ: Model A has best quality-per-cost ratio, even though Model B has higher raw accuracy.


Innovations

Compared to Traditional Benchmarks

MMLU, HumanEval, etc.:

  • Single aggregate accuracy (hides failure modes)
  • Fixed test sets (memorization risk)
  • No difficulty control (coarse-grained)
  • No truncation handling (conflated with wrong answers)
  • No token efficiency (ignores deployment cost)
  • No statistical rigor (no confidence intervals)

ReasonScape:

  • Multi-dimensional metric (accuracy + confidence + reliability + balance + scaling + efficiency)
  • Parametric generation (no memorization)
  • Controlled difficulty manifolds (fine-grained analysis)
  • Truncation tracked separately (practical deployment concern)
  • Token efficiency explicit (cost-aware)
  • Statistical rigor (Wilson CI, excess accuracy correction)

Key Innovations

1. Excess Accuracy Correction

Most benchmarks treat guessing as skill. ReasonScape removes guessing contribution.

2. Optimistic Margin Handling

Ensures perfect models score 1000 regardless of sample count (uncertainty is "our fault").

3. Pessimistic Truncation Handling

Context limits are practical failures, penalized directly ("model's fault").

4. Geometric Mean for Balance

Task imbalance severely penalizes score (reflects deployment reality).

5. Wilson CI Re-Aggregation

Preserves sample size information when aggregating across difficulty manifold.

6. Three-Tier Scaling Analysis

Reveals how models maintain (or lose) capability under pressure.

7. Token Efficiency Integration

Makes cost explicit in the final metric (deployment-focused).

8. [0, 1] Range by Design

Both numerator (~1000) and denominator (~1000) scaled for intuitive interpretation.


Design Decisions Summary

Why These Specific Choices?

Decision Rationale
Add margin Perfect models should score 1000 regardless of samples
Subtract truncation Context failures are model limitations, not uncertainty
Wilson CI twice Layer 1: per-point stats; Layer 2: re-aggregate preserving sample sizes
Geometric mean (tasks) Can't hide catastrophic task failures
Arithmetic mean (tiers) Fair difficulty averaging (geometric would double-penalize)
1000× scaling Makes score/token naturally live in [0, 1]
0.01 clamping Prevents negative scores while still heavily penalizing failures
Divide by tokens Makes deployment cost explicit

Usage in ReasonScape

Stage 4: Leaderboard (leaderboard.py)

python leaderboard.py data/dataset-m12x.json

Displays per-tier ReasonScores and overall score/token rankings.

Stage 5: Analysis (analyze.py scores)

python analyze.py scores data/dataset-m12x.json --output scores.md

Computes score/token with per-task breakdowns for filtered model sets.

PointsDB Integration

ReasonScore calculation relies on:

  • Layer 1 data: Stored in points table (adjusted_successes, adjusted_trials, etc.)
  • Layer 2 aggregation: Via aggregate() function (Wilson CI re-computation)
  • Layer 3-4 computation: In src/scores.py using tier faceting

References


Appendix: The Perfect Model

What should score 1000?

A model that: 1. Gets all answers correct (100% accuracy above guessing) 2. Never truncates (0% context failures) 3. Maintains performance across all tasks (perfect balance) 4. Maintains performance across all difficulty levels (perfect scaling)

With finite samples:

# 100 samples per point
point_score = 0.95 + 0.05 - 0.0 = 1.00  # Upper CI bound

# All points perfect
task_score = wilson_on_sum(perfect_points) = 1.00

# All tasks perfect
ReasonScore_tier = 1000 × geomean([1.0, 1.0, ...]) = 1000

# All tiers perfect
avg_score = (1000 + 1000 + 1000) / 3 = 1000

# Minimum possible tokens (1 token per point required to form an answer)
avg_tokens = 1

# Uber-KPI (maximum possible score_per_token)
score_per_token = 1000 / 1 = 1000

The design ensures: A ceiling of 1000 score_per_token (1000 ReasonScore / 1 token), with real models fairly ranked by their practical deployment value.