Position: Ranking Models¶

Definition: Compare models and produce a ranked ordering

Question: "Which model is better?"

Overview¶

Position workflows answer ranking questions. The right tool depends on two things: how many groups you're ranking across, and what question you're actually asking.

Single Group: Always Use `cluster`¶

When you have one task, one operation, one surface — any single group — the answer is always cluster. It tells you which models are statistically distinguishable from each other, with resource-aware tie-breaking via token sub-clusters. The other tools have nothing to add here.

python analyze.py cluster data/r12.json --filters '{"base_task": "arithmetic"}'
python analyze.py cluster data/tables-16k.json --filters '{"params.operation": "2"}'

This applies regardless of what you're comparing — different model families, different quants, different sizes. "Which of these things is meaningfully different" is a valid question for any cohort.

If you have a multi-task dataset but don't want to commit to an aggregation strategy, --facet-by none collapses all trials into one ranking as if they were a single task. This sidesteps the multi-task question entirely and is useful as a sanity check or when task identity genuinely doesn't matter:

python analyze.py cluster data/r12.json --facet-by none

Multiple Groups: Three Questions, Three Tools¶

When you're ranking across multiple groups (tasks, operations, surfaces), there is no single correct answer — there are three valid interpretations of the question. Pick the one that matches what you actually want to know.

`scores` — Absolute Quality¶

Question: "How good is this model, penalising any weak spot?"

scores computes per-group P[Success] with confidence intervals, then combines them with a bootstrap geometric mean. The geometric mean punishes catastrophic failures: a model that scores 95% on eleven tasks but 5% on one will rank far lower than its average would suggest. This is the right framing when you want an absolute quality measurement and task-specific failures matter.

python analyze.py scores data/r12.json
python analyze.py scores data/r12.json --sort ratio        # rank by score/token efficiency
python analyze.py scores data/r12.json --group-by base_task --top 10

See ReasonScore methodology for the full statistical treatment.

`rank` — Relative Position by Cluster Aggregation¶

Question: "Where does each model land, relative to its peers, across every group?"

rank clusters models independently per group, then sums each model's cluster position (penalty) across all groups. A model that consistently lands in cluster 1 accumulates low penalty and ranks high. Token sub-clusters apply a +0.5 penalty for models that are accurate but less efficient than peers in the same accuracy cluster. Task identity is fully preserved: a bad cluster position on one group cannot be diluted by good performance elsewhere.

Models with missing data in any group are excluded from the ranking.

python analyze.py rank data/r12.json
python analyze.py rank data/tables-16k.json --group-by params.operation
python analyze.py rank data/r12.json --filters '{"groups": [["arch:moe"]]}' --format json

`pairwise` — Head-to-Head Win Rates¶

Question: "Who would beat whom in a direct matchup?"

pairwise computes N² head-to-head win probabilities using full CI distributions (not hard thresholds), then aggregates to overall rankings via Expected Wins and Bradley-Terry ratings. Task boundaries dissolve into an average win rate — the tool does not preserve per-task failures. This is the right framing when you need a direct substitution answer or want to resolve statistical ties from scores or rank.

python analyze.py pairwise data/r12.json
python analyze.py pairwise data/r12.json --sort bradley-terry
python analyze.py pairwise data/r12.json --format png --output-dir results/

pairwise is purely accuracy-based and does not account for token usage. If cost efficiency matters, use scores --sort ratio or rank first.

Comparison Summary¶

Tool	Aggregation	Task failures	Resource-aware	Output
`cluster`	CI overlap within one group	Visible	Yes (sub-clusters)	Cluster membership
`scores`	Geometric mean across groups	Penalised	Via `--sort ratio`	Absolute score ×1000
`rank`	Penalty sum across groups	Preserved	Yes (sub-cluster penalty)	Ordinal leaderboard
`pairwise`	Win rate average across groups	Dissolved	No	Win probabilities

These tools give different answers because they answer different questions. Disagreement between them is informative, not a problem.

Filters¶

All position tools support --filters, --group-by, and --mode. Discover available values first:

python analyze.py evals data/<name>.json
python analyze.py tasks data/<name>.json

For complete filter syntax see the Filter Reference.

Output Formats¶

Tool	Formats
`cluster`	`markdown` (default), `json`, `png`
`scores`	`markdown` (default), `json`, `png`
`rank`	`markdown` (default), `json`
`pairwise`	`markdown` (default), `png`

All four tools use --output-dir to control where files are written (default: current directory). Filenames are generated automatically from the dataset name and active options.

Next Steps¶

Found unexpected failures? → Profile to diagnose, or Probe to inspect raw traces

Got a clear winner? → Profile single-model to characterise capabilities

Task-dependent wins and losses? → Profile multi-model to compare failure modes

Statistical tie between two models? → pairwise for head-to-head tie-breaking

Need to justify model selection? → pairwise --format png for stakeholder visualisation

Position: Ranking Models¶

Overview¶

Single Group: Always Use cluster¶

Multiple Groups: Three Questions, Three Tools¶

scores — Absolute Quality¶

rank — Relative Position by Cluster Aggregation¶

pairwise — Head-to-Head Win Rates¶