Workflow 1: Ranking & Benchmarking¶
Research Question: "What's the best model within a group, or overall?"
Duration: 2-3 minutes
Objective¶
Generate aggregate rankings across all models and tasks to identify top candidates for deeper investigation. This workflow produces a single unified metric (score/token) that balances:
- Accuracy (correctness of answers)
- Truncation (via score - penalizes models that hit context limits)
- Resource utilization (token consumption per task)
When to Use This Workflow¶
- Initial triage of multiple models
- Selecting 3-5 candidates for comparative evaluation
- Quick assessment of new model releases
- Creating benchmark reports for publication
- Answering "which model is best overall?" questions
Primary Tool: scores¶
The scores subcommand computes aggregate rankings by:
- Computing excess-accuracy-corrected scores per task
- Weighing by token efficiency (
score/token) - Aggregating across all tasks in the dataset
- Producing sortable rankings with metadata
Basic Workflow¶
Step 1: Discover Available Evaluations¶
cd /home/mike/ai/reasonscape
source venv/bin/activate
# List all evaluations in the dataset
python analyze.py evals data/dataset-m12x.json --format table
What you get:
- Table of all models with eval_ids
- Group memberships (family, size, architecture, runtime)
- Metadata for filtering
Decision point: Note interesting groups or families for filtering (optional)
Step 2: Generate Rankings¶
# Generate rankings for all models
python analyze.py scores data/dataset-m12x.json --output scores.md
# Or filter to specific groups
python analyze.py scores data/dataset-m12x.json \
--filters '{"groups": ["size:large"]}' \
--output scores-large.md
What you get:
- Markdown table sorted by
score, withscore/tokenavailable for comparison - Per-tier x Per-task scores for each model
- Token consumption statistics (if enabled with
--show-token-counts) - Group/family metadata
Output modes:
scores.md- Human-readable rankingsscores.json- Machine-readable data (with--format json)scores.png- Visual heatmap (with--format png)
Step 3: Interpret Results¶
See Understanding the Markdown Scores Table below.
Step 4: Select Candidates¶
Typical selection criteria:
- Top 3-5 models by
score/token - Look for low truncations, low invalids
- Models with interesting task-level variance
- Models from different families/architectures for comparison
Next step: Proceed to Workflow 2: Comparative Evaluation with selected candidates
Advanced Options¶
Filtering by Groups¶
# Compare only Llama family models
python analyze.py scores data/dataset-m12x.json \
--filters '{"groups": ["family:llama"]}'
# Compare large models only
python analyze.py scores data/dataset-m12x.json \
--filters '{"groups": ["size:large"]}'
# Combine filters (AND logic)
python analyze.py scores data/dataset-m12x.json \
--filters '{"groups": ["family:qwen", "size:large"]}'
Filtering by Specific Models¶
# Compare specific eval_ids
python analyze.py scores data/dataset-m12x.json \
--filters '{"eval_id": [0, 1, 2, 3]}'
# Use fuzzy search first to find eval_ids
python analyze.py evals data/dataset-m12x.json --search "gpt-4"
# Then use discovered eval_ids
Output Formats¶
# Markdown table (default)
python analyze.py scores data/dataset-m12x.json --output scores.md
# JSON for programmatic access
python analyze.py scores data/dataset-m12x.json \
--format json --output scores.json
# PNG heatmap for reports
python analyze.py scores data/dataset-m12x.json \
--format png --output scores.png
Understanding the Markdown Scores Table¶
Columns:
Model- Model name/identifierTier- Difficulty tier (easy/medium/hard)ReasonScore- Unified metric (higher is better, geometric mean × 1000)Avg Tokens- Mean token consumption per taskScore/Token- Efficiency ratio (ReasonScore/Avg Tokens)Task1,Task2, ... - Per-task scores with detail indicators
Cell Format in Per-Task Columns:
Each task cell shows an accuracy range with optional indicators:
.XX - .YY [-.ZZ] (N/M)
│ │ │ └──── Optional: Completion ratio (N points collected / M expected)
│ │ └─────────── Optional: Truncation penalty (shown if > 2%)
│ └──────────────── Upper bound (center + margin, 95% CI)
└────────────────────── Lower bound (center accuracy)
Examples:
.89 - .91- Normal: 89-91% accuracy, no significant truncation.67 - .69 [-.11]- Truncated: 67-69% accuracy, 11% truncation penalty.75 - .79 [-.02]- Minor truncation: 75-79% accuracy, 2% truncation penalty.45 - .48 (123/200)- Incomplete: 45-48% accuracy, only 123/200 points collected.55 - .58 [-.22] (89/100)- Both: Truncated AND incomplete—- Missing: No data for this task
Aggregate Score Indicators:
771*- Asterisk indicates at least one task has incomplete data771- No asterisk means all tasks have complete data
What Each Metric Means:
- Center (lower bound): Mean accuracy across all test points
- Margin (upper bound): 95% confidence interval upper limit
- Truncation penalty: Fraction of responses that hit context limits
- Completion ratio: When evaluation stopped early (time/cost limits)
Reading Patterns: - Wide ranges (.67 - .79) → High variance, less confident estimate - Narrow ranges (.89 - .91) → Consistent performance, confident estimate - Large truncation [-.48] → Model struggling with context limits - Incomplete (N/M) → Evaluation stopped early, limited data
Understanding the JSON Format (scores.json)¶
The JSON format provides the complete hierarchical structure with full statistical detail:
[
{
"eval_id": 1,
"model": "Phi-4 Reasoning (FP16)",
"reasonscore": 649.53,
"tiers": {
"easy": {
"reasonscore": 771.44,
"any_incomplete": false,
"tasks": {
"arithmetic": {
"center": 0.808,
"margin": 0.018,
"truncated_ratio": 0.015,
"completion_tokens_mean": 2232.21,
"adjusted_score": 0.811,
"point_count": 26,
"expected_points": 26,
"is_incomplete": false,
// ... additional statistical fields
},
"boolean": { /* ... */ },
// ... other tasks
}
},
"medium": { /* ... */ },
"hard": { /* ... */ }
}
},
// ... other evaluations
]
Key differences from markdown format:
- Array of evaluation objects (not wrapped in a "rankings" key)
- Full tier breakdown with complete task statistics
- All raw statistical values (center, margin, truncated_ratio, etc.)
- Token consumption details per task
- Completion status (point_count, expected_points, is_incomplete)
- No pre-formatted strings - all numeric values for programmatic access
Understanding the PNG Output Format¶
Heat-bar visualization:
- Rows: Models (sorted by score/token) grouped into Tiers
- Columns: Tasks
- Color: Performance (red=low, green=high)
- Gray Crosshatch: Truncations
- Useful for pattern recognition and presentations
Example Research Scenarios¶
Scenario 1: "I need to select a model for production"¶
# Step 1: Get complete rankings
python analyze.py scores data/dataset-m12x.json --output scores.md
# Step 2: Filter to production-viable sizes
python analyze.py scores data/dataset-m12x.json \
--filters '{"groups": ["size:medium", "size:large"]}' \
--output scores-production.md
# Step 3: Identify top 3 candidates by inspecting scores-production.md
# Step 4: Proceed to Workflow 3 for detailed characterization
Next step: Workflow 3: Model Characterization
Scenario 2: "How does my new model compare to baselines?"¶
# Step 1: Find eval_id for your model
python analyze.py evals data/dataset-m12x.json --search "my-model"
# Step 2: Find comparable baselines by filtering on groups
# Example: Find all small models for fair comparison
python analyze.py evals data/dataset-m12x.json --groups size:small
# Or combine multiple group facets (AND logic)
python analyze.py evals data/dataset-m12x.json --groups size:small,family:llama
# Step 3: Compare your model against discovered baselines
python analyze.py scores data/dataset-m12x.json \
--filters '{"eval_id": [<your-eval-id>, <baseline-1>, <baseline-2>]}' \
--output comparison.md
# Alternatively, compare against all models in a group
python analyze.py scores data/dataset-m12x.json \
--filters '{"groups": ["size:small"]}' \
--output comparison-small.md
# Step 4: Interpret relative rankings by inspecting comparison.md
# Step 5: If differences are small, proceed to statistical comparison
Next step: Workflow 2: Comparative Evaluation
Scenario 3: "Which Llama variant is best?"¶
# Compare all Llama family models
python analyze.py scores data/dataset-m12x.json \
--filters '{"groups": ["family:llama"]}' \
--format png --output llama-comparison.png
Next step: Workflow 2: Comparative Evaluation to determine statistical significance
Common Pitfalls¶
Pitfall 1: Comparing models at different scales¶
Problem: Comparing 7B and 70B models directly
Solution: Use --filters '{"groups": ["size:medium"]}' to compare within size class
Pitfall 2: Ignoring task variance¶
Problem: Selecting model solely on aggregate score Solution: Inspect per-task scores for domain-specific requirements
Pitfall 3: Confusing score with accuracy¶
Problem: Expecting scores in [0, 1] range Solution: Remember score/token is a rate metric, not a percentage
Pitfall 4: Not accounting for context limits¶
Problem: High-truncation models appear worse than they are Solution: Check truncation rates in per-task breakdown
Next Steps¶
- Top 3 similar scores → Workflow 2: Comparative Evaluation - Statistical comparison of similar candidates
- Candidate selected → Workflow 3: Model Characterization - Deep-dive the selected model
- Unexpected failure → Workflow 4: Failure Diagnosis - Root-cause analysis of poor performance
Tool Documentation¶
- analyze.py scores reference - Complete command documentation
- PointsDB API - Understanding the data structure
- Methodology - Statistical foundations