Workflow 2: Comparative Evaluation¶

Research Question: "Which models are statistically distinguishable?"

Duration: 5-10 minutes

Objective¶

Perform rigorous statistical comparison of model sets by accounting for confidence intervals. Identify which performance differences are real signal vs. measurement noise.

The cluster tool groups models into overlapping clusters where members are statistically indistinguishable. Models can belong to multiple clusters due to overlapping confidence intervals, revealing the true complexity of model comparisons.

When to Use This Workflow¶

You've identified 3-10 candidate models from Workflow 1: Ranking
Aggregate scores are similar and you need to know if differences are real
Comparing models from the same family (e.g., Llama 3.3 70B vs Llama 3.1 70B)
Evaluating whether your new model is truly better than baselines
Creating statistically-sound comparison reports

Primary Tool: `cluster`¶

The cluster subcommand: 1. Computes confidence intervals for each model at specified difficulty level 2. Identifies maximal sets of models with overlapping CIs 3. Groups models into clusters (members are statistically indistinguishable) 4. Handles overlapping memberships (model can be in multiple clusters) 5. Can split analysis by: tier (task groups), base_task (individual tasks), or surface (task sub-spaces)

Basic Workflow¶

Step 1: Identify Candidates for Comparison¶

From Workflow 1, you should have: - 3-10 candidate models (eval_ids or filter criteria) - Tasks of interest (optional - can compare across all tasks)

cd /home/mike/ai/reasonscape
source venv/bin/activate

# Verify your candidates
python analyze.py evals data/dataset-m12x.json --search "llama-3"
# Note eval_ids or group names

Step 2: Run Cluster Analysis¶

# Compare across all tasks (tier-level split)
python analyze.py cluster data/dataset-m12x.json \
    --filters '{"groups": ["family:llama"]}' \
    --stack tier \
    --format png --output-dir clusters-tier/

# Compare within specific task (base_task-level split)
python analyze.py cluster data/dataset-m12x.json \
    --filters '{"eval_id": [0, 1, 2]}' \
    --stack base_task \
    --format png --output-dir clusters-task/

# Deep-dive: surface-level split (task sub-spaces)
python analyze.py cluster data/dataset-m12x.json \
    --filters '{"base_task": "arithmetic"}' \
    --stack surface \
    --format png --output-dir clusters-surface/

Stack dimensions explained: - tier - Groups of related tasks (e.g., "memory", "logic", "format") - base_task - Individual tasks (e.g., "arithmetic", "boolean", "dates") - surface - Sub-spaces within tasks (e.g., "arithmetic-easy", "arithmetic-hard")

Step 3: Interpret Clustering Results¶

See Interpretation Guide below.

Step 4: Draw Conclusions¶

Typical outcomes: 1. Clear winner: One model not in any cluster with lower-ranked models → proceed to deployment 2. Statistical ties: Multiple models in same cluster → select on cost/latency/other criteria 3. Task-specific differences: Models cluster differently by task → proceed to Workflow 3: Characterization 4. Unexpected patterns: Proceed to Workflow 4: Diagnosis

Advanced Options¶

Filtering Strategies¶

# Compare specific eval_ids
python analyze.py cluster data/dataset-m12x.json \
    --filters '{"eval_id": [0, 1, 2, 3, 4]}' \
    --stack base_task

# Compare by family
python analyze.py cluster data/dataset-m12x.json \
    --filters '{"groups": ["family:qwen", "family:llama"]}' \
    --stack tier

# Compare within size class
python analyze.py cluster data/dataset-m12x.json \
    --filters '{"groups": ["size:large"]}' \
    --stack base_task

# Combine filters (AND logic)
python analyze.py cluster data/dataset-m12x.json \
    --filters '{"groups": ["family:llama", "size:large"]}' \
    --stack tier

Stack Dimension Selection¶

Use tier when: - You want high-level comparison across task categories - Comparing overall reasoning capabilities - Creating executive summaries

Use base_task when: - You need task-by-task breakdown - Models show high variance across tasks (identified in Workflow 1) - Investigating domain-specific performance

Use surface when: - You need fine-grained difficulty analysis - Models diverge at specific difficulty levels - Deep-diving a single task

Output Formats¶

# Markdown report (for analysis)
python analyze.py cluster data/dataset-m12x.json \
    --filters '{"eval_id": [0, 1, 2]}' \
    --stack base_task --format markdown --output clusters.md

# PNG visualization (for user visual inspection)
python analyze.py cluster data/dataset-m12x.json \
    --filters '{"eval_id": [0, 1, 2]}' \
    --stack base_task --format png --output-dir clusters/

Statistical Significance Rules¶

When comparing two models: - Same cluster → Cannot confidently claim one is better (insufficient evidence) - Different clusters → Performance difference is statistically significant (p < 0.05 implied by non-overlapping 95% CIs) - Overlapping membership → Performance difference is marginal (edge case)

When comparing three+ models: - Maximal clusters → The tool finds the largest possible groupings - Transitivity not guaranteed → A~B and B~C doesn't mean A~C (overlapping CIs) - Multiple interpretations → Focus on which models are NOT in any cluster together

Example Research Scenarios¶

Scenario 1: "Is my fine-tuned model better than base?"¶

# Step 1: Find eval_ids
python analyze.py evals data/dataset-m12x.json --search "my-model"
python analyze.py evals data/dataset-m12x.json --search "base-model"

# Step 2: Compare across all tasks
python analyze.py cluster data/dataset-m12x.json \
    --filters '{"eval_id": [<my-eval>, <base-eval>]}' \
    --stack base_task --format png --output-dir my-vs-base/

# Step 3: Interpret results
# - If in separate clusters on most tasks → fine-tuning worked
# - If in same cluster → fine-tuning didn't produce significant improvement
# - If task-dependent → characterize which tasks improved

Outcome decision tree: - Separate clusters → Claim improvement, proceed to deployment - Same cluster → Need more samples or different evaluation approach - Mixed → Proceed to Workflow 3 to understand task-level differences

Scenario 2: "Llama 3.3 70B vs Qwen 2.5 72B - is size the only difference?"¶

# Compare at tier level (high-level)
python analyze.py cluster data/dataset-m12x.json \
    --filters '{"model_name": ["llama-3.3-70b", "qwen-2.5-72b"]}' \
    --stack tier --format png --output llama-vs-qwen-tier.png

# Deep-dive to task level
python analyze.py cluster data/dataset-m12x.json \
    --filters '{"model_name": ["llama-3.3-70b", "qwen-2.5-72b"]}' \
    --stack base_task --format png --output llama-vs-qwen-tasks.png

# If differences emerge, go even deeper
python analyze.py cluster data/dataset-m12x.json \
    --filters '{"model_name": ["llama-3.3-70b", "qwen-2.5-72b"], "base_task": "arithmetic"}' \
    --stack surface --format png --output llama-vs-qwen-arithmetic-surface.png

Next step: If models cluster differently by task → Workflow 4: Diagnosis to understand why

Scenario 3: "Which Llama variant should I deploy?"¶

# Compare all Llama family models
python analyze.py cluster data/dataset-m12x.json \
    --filters '{"groups": ["family:llama"]}' \
    --stack base_task --format png --output-dir llama-comparison/

# Examine output PNGs task-by-task
# - Identify which variants cluster together (interchangeable)
# - Identify which variants separate (meaningful differences)

Decision criteria: - Models in same cluster → select cheapest/fastest - Models in different clusters → select highest-performing you can afford

Scenario 4: "Are quantized models as good as full-precision?"¶

# Compare FP16 vs INT8 variants
python analyze.py cluster data/dataset-m12x.json \
    --filters '{"groups": ["quant:fp16", "quant:int8"], "family": "llama"}' \
    --stack base_task --format png --output-dir quant-comparison/

Interpretation: - Same cluster → quantization doesn't hurt (within measurement error) - Different clusters → quantization introduces meaningful degradation - Task-specific clustering → some tasks more sensitive to quantization

Scenario 5: "Version upgrade evaluation - is v1.6 better than v1.5?"¶

Real-world example: Apriel-1.6 vs 1.5 comparative analysis

# Step 1: Get eval_ids for both versions
python analyze.py evals data/dataset-m12x.json --search "Apriel"
# Note: eval_id 29 (v1.5), eval_id 74 (v1.6)

# Step 2: Compare at task level (default markdown output)
python analyze.py cluster data/dataset-m12x.json \
    --filters '{"eval_id": [29, 74]}' \
    --stack base_task

# Step 3: Check aggregate tier scores
python analyze.py scores data/dataset-m12x.json \
    --filters '{"eval_id": [29, 74]}' --format markdown

Interpretation checklist:

Count per-task statistical differences:
How many tasks separate into different clusters? (expect: 1-3 of 12)
Which model wins on which tasks?
How many tasks show overlapping CIs (statistical ties)?
Examine reliability metrics from cluster table:
Truncation rates (Trunc % column) - target <10%
Invalid rates (Invalid % column) - target <1%
Which model has better reliability?
Compare against aggregate scores:
Do aggregate scores show larger improvements than per-task clustering suggests?
If yes, where do improvements come from? (reliability, difficulty scaling, etc.)

Typical outcome for version upgrades: - Few per-task accuracy wins (most tasks statistically tied) - Large aggregate score improvements (from reliability/robustness gains) - Dramatic truncation reductions (better context handling) - Mixed invalid rate changes (some tasks improve, regression on others)

Example results (Apriel case study):

Dimension	Finding
Per-task clustering	Tied on 10/12 tasks, 1.6 wins Sort, 1.5 wins Sequence
Aggregate scores	1.6: 842/757/647 vs 1.5: 798/666/480 (+34.8% hard tier)
Truncation rates	1.6: 9.4% avg vs 1.5: 19.5% avg (50% reduction!)
Invalid rates	Mixed: 1.6 worse on Dates (8.8% vs 1.4%), better elsewhere
True advantage	Reliability, not per-task accuracy

Recommendation strategy: - Upgrade IF: Truncation significantly reduced OR critical-task accuracy wins - Reconsider IF: Invalid rates increase on production-critical tasks - Test further IF: Statistically tied on all tasks AND similar reliability

Key lesson: Version upgrades optimize for production robustness (truncation, scaling, edge cases) more than per-task accuracy. Aggregate tier improvements reflect this even when per-task clustering shows ties.

Common Pitfalls¶

Pitfall 1: Too many models in comparison¶

Problem: Comparing 20+ models makes cluster plots unreadable Solution: Filter to 3-10 models per comparison. Run multiple comparisons if needed.

Pitfall 2: Wrong stack dimension¶

Problem: Using surface split with cross-task filters (creates explosion of sub-splits) Solution: Use surface only when filtering to single base_task

Pitfall 3: Ignoring overlapping clusters¶

Problem: Treating cluster membership as binary Solution: Models in multiple clusters are edge cases - need additional evaluation criteria

Pitfall 4: Confusing statistical vs practical significance¶

Problem: Small but statistically significant difference may not matter for application Solution: Consider cost, latency, and deployment constraints alongside statistical significance

Pitfall 5: Misinterpreting aggregate score improvements¶

Problem: Large aggregate tier score improvements may not reflect per-task statistical superiority.

Example from Apriel-1.6 vs 1.5 analysis: - Aggregate scores: 842/757/647 vs 798/666/480 (+34.8% on hard tier) - Clustering reality: Tied on 10 of 12 tasks, wins 1, loses 1 - True advantage: 50% lower truncation rate (9.4% vs 19.5%)

Why the discrepancy? ReasonScore directly penalizes truncation at Layer 1:

point_score = accuracy + margin - truncated_ratio

In the Apriel case: - 1.5: -19.5% truncation penalty → ~800 base score - 195 = ~605 effective - 1.6: -9.4% truncation penalty → ~800 base score - 94 = ~706 effective - Net gain: ~100 points from reliability alone

The large aggregate improvements reflect real production value (fewer failed responses), even when per-task accuracy is statistically tied.

Clustering reveals: where models differ on accuracy (10/12 ties) Aggregate scores reveal: where models differ on production reliability

Solution: Always run clustering analysis after observing large aggregate differences. Check: - Which tasks show statistical separation? - What are the truncation rates? (from cluster table: Trunc % column) - What are the invalid rates? (from cluster table: Invalid % column) - Where do the aggregate improvements actually come from?

Correct interpretation: - "Aggregate scores show +35% improvement, driven by 50% truncation reduction and better difficulty scaling, not per-task accuracy gains"

Wrong interpretation: - "Model is 35% better at all tasks"

Pitfall 6: LLM agents trying to read PNG files¶

Problem: LLM agents attempting to interpret cluster PNG visualizations instead of reading structured output.

Why it's a problem: - LLMs have poor PNG comprehension - PNG visualizations are for human inspection (visual markers like 🔴 red X for truncation, ⚠️ yellow ? for invalid) - All data is available in markdown/JSON formats that are trivial to parse

Solution for LLM agents:

# ALWAYS request markdown output (default)
python analyze.py cluster data/dataset-m12x.json \
    --filters '{...}' --stack base_task

# Or JSON for programmatic access
python analyze.py cluster data/dataset-m12x.json \
    --filters '{...}' --stack base_task --format json

The markdown table includes everything: - Cluster membership with confidence intervals - Truncation % (Trunc column) - Invalid % (Invalid column) - Token usage (Tokens column) - All in easily parseable format

Correct agent workflow: 1. Read the markdown table 2. Parse cluster membership, accuracy, truncation, invalid rates 3. Report findings: "Models A+B statistically tied on accuracy BUT Model A has 50% lower truncation" 4. Tell user: "Please check [filename].png for visual confirmation of clusters and markers"

Don't: Try to interpret PNG - you will miss critical data and make errors.

Output Reference¶

PNG Format (Primary Output)¶

Each PNG contains: - X-axis: Models being compared - Y-axis: Performance metric (score with 95% CI error bars) - Colors: Cluster membership (same color = statistically indistinguishable) - Title: Split dimension and filter criteria

One PNG per split (e.g., one per task when using --stack base_task)

Markdown Format¶

## Clustering Results: base_task=arithmetic

### Cluster 1
- Model A (score: 0.85, CI: [0.82, 0.88])
- Model B (score: 0.84, CI: [0.81, 0.87])

### Cluster 2
- Model B (score: 0.84, CI: [0.81, 0.87])
- Model C (score: 0.82, CI: [0.79, 0.85])

### Singleton
- Model D (score: 0.75, CI: [0.72, 0.78])

*Note: Model B appears in both clusters due to overlapping confidence intervals*

JSON Format¶

{
  "clusters": [
    {
      "split": "base_task=arithmetic",
      "cluster_id": 0,
      "members": [
        {"eval_id": 0, "model_name": "Model A", "score": 0.85, "ci": [0.82, 0.88]},
        {"eval_id": 1, "model_name": "Model B", "score": 0.84, "ci": [0.81, 0.87]}
      ]
    },
    {
      "split": "base_task=arithmetic",
      "cluster_id": 1,
      "members": [
        {"eval_id": 1, "model_name": "Model B", "score": 0.84, "ci": [0.81, 0.87]},
        {"eval_id": 2, "model_name": "Model C", "score": 0.82, "ci": [0.79, 0.85]}
      ]
    }
  ],
  "singletons": [
    {"eval_id": 3, "model_name": "Model D", "score": 0.75, "ci": [0.72, 0.78]}
  ]
}

Understanding Confidence Intervals¶

What the error bars mean: - 95% confidence interval on excess-accuracy-corrected score - "We are 95% confident the true score lies within this range" - Wider bars → more uncertainty (fewer samples, higher variance) - Narrower bars → more confidence (more samples, lower variance)

Why CIs matter: - Two models with overlapping CIs cannot be distinguished with 95% confidence - Non-overlapping CIs → statistically significant difference (p < 0.05) - Clustering algorithm maximizes groups of overlapping CIs

How to improve CI width: - Increase sample count during evaluation (Stage 2: runner precision) - Focus on specific difficulty ranges (reduces variance) - Use surface-level splits (more homogeneous sub-populations)

Tips for LLM Agents¶

Critical: LLMs have poor PNG comprehension. Always use markdown/JSON output.

If you're an LLM agent using this workflow:

Run cluster with markdown output (default):

python analyze.py cluster <dataset> --filters <...> --stack base_task
# Outputs markdown by default - READ IT

Parse the markdown tables directly for THREE dimensions:
Accuracy: Cluster membership, confidence intervals
Reliability: Truncation % (Trunc column) - flag if >10%
Validity: Invalid % (Invalid column) - flag if >1%

Report findings with complete picture:

"Models A and B are statistically tied on accuracy (same cluster, overlapping CIs)
BUT Model A has 50% lower truncation (9% vs 19%)
AND Model A has better validity (1% invalid vs 9%)
CONCLUSION: Despite accuracy tie, Model A superior for production reliability"

Handle aggregate vs per-task discrepancies:
If aggregate scores show large improvement but clustering shows ties
Check truncation rates - this is usually where the improvement comes from
Report: "Aggregate tier improvement driven by reliability (truncation reduction), not per-task accuracy gains"
Tell user to inspect PNG for visual confirmation:
"Please check [filename].png to see cluster visualization"
"Red X markers indicate high truncation, yellow ? indicates invalid responses"
Don't try to interpret PNG yourself
Recommend next steps:
Same cluster + similar reliability → select on cost/latency
Same cluster + better reliability → recommend reliability winner
Different clusters → characterize winner (Workflow 3)
Task-dependent → diagnose differences (Workflow 4)

Workflow 1: Ranking - Select candidates for comparison
Workflow 3: Characterization - Profile task-specific differences
Workflow 4: Diagnosis - Understand why models differ

Tool Documentation¶

analyze.py cluster reference - Complete command documentation
Methodology: Statistics - CI computation details
Clustering Algorithm - How maximal clusters are computed