Skip to content

Position: Ranking Models

Definition: Compare models and produce a ranked ordering

Question: "Which model is better?"

Overview

Position workflows answer ranking questions. The right tool depends on two things: how many groups you're ranking across, and what question you're actually asking.

Single Group: Always Use cluster

When you have one task, one operation, one surface — any single group — the answer is always cluster. It tells you which models are statistically distinguishable from each other, with resource-aware tie-breaking via token sub-clusters. The other tools have nothing to add here.

python analyze.py cluster data/r12.json --filters '{"base_task": "arithmetic"}'
python analyze.py cluster data/tables-16k.json --filters '{"params.operation": "2"}'

This applies regardless of what you're comparing — different model families, different quants, different sizes. "Which of these things is meaningfully different" is a valid question for any cohort.

If you have a multi-task dataset but don't want to commit to an aggregation strategy, --facet-by none collapses all trials into one ranking as if they were a single task. This sidesteps the multi-task question entirely and is useful as a sanity check or when task identity genuinely doesn't matter:

python analyze.py cluster data/r12.json --facet-by none

Multiple Groups: Three Questions, Three Tools

When you're ranking across multiple groups (tasks, operations, surfaces), there is no single correct answer — there are three valid interpretations of the question. Pick the one that matches what you actually want to know.

scores — Absolute Quality

Question: "How good is this model, penalising any weak spot?"

scores computes per-group P[Success] with confidence intervals, then combines them with a bootstrap geometric mean. The geometric mean punishes catastrophic failures: a model that scores 95% on eleven tasks but 5% on one will rank far lower than its average would suggest. This is the right framing when you want an absolute quality measurement and task-specific failures matter.

python analyze.py scores data/r12.json
python analyze.py scores data/r12.json --sort ratio        # rank by score/token efficiency
python analyze.py scores data/r12.json --group-by base_task --top 10

See ReasonScore methodology for the full statistical treatment.

rank — Relative Position by Cluster Aggregation

Question: "Where does each model land, relative to its peers, across every group?"

rank clusters models independently per group, then sums each model's cluster position (penalty) across all groups. A model that consistently lands in cluster 1 accumulates low penalty and ranks high. Token sub-clusters apply a +0.5 penalty for models that are accurate but less efficient than peers in the same accuracy cluster. Task identity is fully preserved: a bad cluster position on one group cannot be diluted by good performance elsewhere.

Models with missing data in any group are excluded from the ranking.

python analyze.py rank data/r12.json
python analyze.py rank data/tables-16k.json --group-by params.operation
python analyze.py rank data/r12.json --filters '{"groups": [["arch:moe"]]}' --format json

pairwise — Head-to-Head Win Rates

Question: "Who would beat whom in a direct matchup?"

pairwise computes N² head-to-head win probabilities using full CI distributions (not hard thresholds), then aggregates to overall rankings via Expected Wins and Bradley-Terry ratings. Task boundaries dissolve into an average win rate — the tool does not preserve per-task failures. This is the right framing when you need a direct substitution answer or want to resolve statistical ties from scores or rank.

python analyze.py pairwise data/r12.json
python analyze.py pairwise data/r12.json --sort bradley-terry
python analyze.py pairwise data/r12.json --format png --output-dir results/

pairwise is purely accuracy-based and does not account for token usage. If cost efficiency matters, use scores --sort ratio or rank first.

Comparison Summary

Tool Aggregation Task failures Resource-aware Output
cluster CI overlap within one group Visible Yes (sub-clusters) Cluster membership
scores Geometric mean across groups Penalised Via --sort ratio Absolute score ×1000
rank Penalty sum across groups Preserved Yes (sub-cluster penalty) Ordinal leaderboard
pairwise Win rate average across groups Dissolved No Win probabilities

These tools give different answers because they answer different questions. Disagreement between them is informative, not a problem.

Filters

All position tools support --filters, --group-by, and --mode. Discover available values first:

python analyze.py evals data/<name>.json
python analyze.py tasks data/<name>.json

For complete filter syntax see the Filter Reference.

Output Formats

Tool Formats
cluster markdown (default), json, png
scores markdown (default), json, png
rank markdown (default), json
pairwise markdown (default), png

All four tools use --output-dir to control where files are written (default: current directory). Filenames are generated automatically from the dataset name and active options.

Next Steps

Found unexpected failures?Profile to diagnose, or Probe to inspect raw traces

Got a clear winner?Profile single-model to characterise capabilities

Task-dependent wins and losses?Profile multi-model to compare failure modes

Statistical tie between two models?pairwise for head-to-head tie-breaking

Need to justify model selection?pairwise --format png for stakeholder visualisation

Tool Reference