Workflow 1: Ranking & Benchmarking¶

Research Question: "What's the best model within a group, or overall?"

Duration: 2-3 minutes

Objective¶

Generate aggregate rankings across all models and tasks to identify top candidates for deeper investigation. This workflow produces a single unified metric (score/token) that balances:

Accuracy (correctness of answers)
Truncation (via score - penalizes models that hit context limits)
Resource utilization (token consumption per task)

When to Use This Workflow¶

Initial triage of multiple models
Selecting 3-5 candidates for comparative evaluation
Quick assessment of new model releases
Creating benchmark reports for publication
Answering "which model is best overall?" questions

Primary Tool: `scores`¶

The scores subcommand computes aggregate rankings by:

Computing excess-accuracy-corrected scores per task
Weighing by token efficiency (score/token)
Aggregating across all tasks in the dataset
Producing sortable rankings with metadata

Basic Workflow¶

Step 1: Discover Available Evaluations¶

cd /home/mike/ai/reasonscape
source venv/bin/activate

# List all evaluations in the dataset
python analyze.py evals data/dataset-m12x.json --format table

What you get:

Table of all models with eval_ids
Group memberships (family, size, architecture, runtime)
Metadata for filtering

Decision point: Note interesting groups or families for filtering (optional)

Step 2: Generate Rankings¶

# Generate rankings for all models
python analyze.py scores data/dataset-m12x.json --output scores.md

# Or filter to specific groups
python analyze.py scores data/dataset-m12x.json \
    --filters '{"groups": ["size:large"]}' \
    --output scores-large.md

What you get:

Markdown table sorted by score, with score/token available for comparison
Per-tier x Per-task scores for each model
Token consumption statistics (if enabled with --show-token-counts)
Group/family metadata

Output modes:

scores.md - Human-readable rankings
scores.json - Machine-readable data (with --format json)
scores.png - Visual heatmap (with --format png)

Step 3: Interpret Results¶

See Understanding the Markdown Scores Table below.

Step 4: Select Candidates¶

Typical selection criteria:

Top 3-5 models by score/token
Look for low truncations, low invalids
Models with interesting task-level variance
Models from different families/architectures for comparison

Next step: Proceed to Workflow 2: Comparative Evaluation with selected candidates

Advanced Options¶

Filtering by Groups¶

# Compare only Llama family models
python analyze.py scores data/dataset-m12x.json \
    --filters '{"groups": ["family:llama"]}'

# Compare large models only
python analyze.py scores data/dataset-m12x.json \
    --filters '{"groups": ["size:large"]}'

# Combine filters (AND logic)
python analyze.py scores data/dataset-m12x.json \
    --filters '{"groups": ["family:qwen", "size:large"]}'

Filtering by Specific Models¶

# Compare specific eval_ids
python analyze.py scores data/dataset-m12x.json \
    --filters '{"eval_id": [0, 1, 2, 3]}'

# Use fuzzy search first to find eval_ids
python analyze.py evals data/dataset-m12x.json --search "gpt-4"
# Then use discovered eval_ids

Output Formats¶

# Markdown table (default)
python analyze.py scores data/dataset-m12x.json --output scores.md

# JSON for programmatic access
python analyze.py scores data/dataset-m12x.json \
    --format json --output scores.json

# PNG heatmap for reports
python analyze.py scores data/dataset-m12x.json \
    --format png --output scores.png

Understanding the Markdown Scores Table¶

Columns:

Model - Model name/identifier
Tier - Difficulty tier (easy/medium/hard)
ReasonScore - Unified metric (higher is better, geometric mean × 1000)
Avg Tokens - Mean token consumption per task
Score/Token - Efficiency ratio (ReasonScore/Avg Tokens)
Task1, Task2, ... - Per-task scores with detail indicators

Cell Format in Per-Task Columns:

Each task cell shows an accuracy range with optional indicators:

.XX - .YY [-.ZZ] (N/M)
│     │    │      └──── Optional: Completion ratio (N points collected / M expected)
│     │    └─────────── Optional: Truncation penalty (shown if > 2%)
│     └──────────────── Upper bound (center + margin, 95% CI)
└────────────────────── Lower bound (center accuracy)

Examples:

.89 - .91 - Normal: 89-91% accuracy, no significant truncation
.67 - .69 [-.11] - Truncated: 67-69% accuracy, 11% truncation penalty
.75 - .79 [-.02] - Minor truncation: 75-79% accuracy, 2% truncation penalty
.45 - .48 (123/200) - Incomplete: 45-48% accuracy, only 123/200 points collected
.55 - .58 [-.22] (89/100) - Both: Truncated AND incomplete
— - Missing: No data for this task

Aggregate Score Indicators:

771* - Asterisk indicates at least one task has incomplete data
771 - No asterisk means all tasks have complete data

What Each Metric Means:

Center (lower bound): Mean accuracy across all test points
Margin (upper bound): 95% confidence interval upper limit
Truncation penalty: Fraction of responses that hit context limits
Completion ratio: When evaluation stopped early (time/cost limits)

Reading Patterns: - Wide ranges (.67 - .79) → High variance, less confident estimate - Narrow ranges (.89 - .91) → Consistent performance, confident estimate - Large truncation [-.48] → Model struggling with context limits - Incomplete (N/M) → Evaluation stopped early, limited data

Understanding the JSON Format (`scores.json`)¶

The JSON format provides the complete hierarchical structure with full statistical detail:

[
  {
    "eval_id": 1,
    "model": "Phi-4 Reasoning (FP16)",
    "reasonscore": 649.53,
    "tiers": {
      "easy": {
        "reasonscore": 771.44,
        "any_incomplete": false,
        "tasks": {
          "arithmetic": {
            "center": 0.808,
            "margin": 0.018,
            "truncated_ratio": 0.015,
            "completion_tokens_mean": 2232.21,
            "adjusted_score": 0.811,
            "point_count": 26,
            "expected_points": 26,
            "is_incomplete": false,
            // ... additional statistical fields
          },
          "boolean": { /* ... */ },
          // ... other tasks
        }
      },
      "medium": { /* ... */ },
      "hard": { /* ... */ }
    }
  },
  // ... other evaluations
]

Key differences from markdown format:

Array of evaluation objects (not wrapped in a "rankings" key)
Full tier breakdown with complete task statistics
All raw statistical values (center, margin, truncated_ratio, etc.)
Token consumption details per task
Completion status (point_count, expected_points, is_incomplete)
No pre-formatted strings - all numeric values for programmatic access

Understanding the PNG Output Format¶

Heat-bar visualization:

Rows: Models (sorted by score/token) grouped into Tiers
Columns: Tasks
Color: Performance (red=low, green=high)
Gray Crosshatch: Truncations
Useful for pattern recognition and presentations

Example Research Scenarios¶

Scenario 1: "I need to select a model for production"¶

# Step 1: Get complete rankings
python analyze.py scores data/dataset-m12x.json --output scores.md

# Step 2: Filter to production-viable sizes
python analyze.py scores data/dataset-m12x.json \
    --filters '{"groups": ["size:medium", "size:large"]}' \
    --output scores-production.md

# Step 3: Identify top 3 candidates by inspecting scores-production.md

# Step 4: Proceed to Workflow 3 for detailed characterization

Next step: Workflow 3: Model Characterization

Scenario 2: "How does my new model compare to baselines?"¶

# Step 1: Find eval_id for your model
python analyze.py evals data/dataset-m12x.json --search "my-model"

# Step 2: Find comparable baselines by filtering on groups
# Example: Find all small models for fair comparison
python analyze.py evals data/dataset-m12x.json --groups size:small

# Or combine multiple group facets (AND logic)
python analyze.py evals data/dataset-m12x.json --groups size:small,family:llama

# Step 3: Compare your model against discovered baselines
python analyze.py scores data/dataset-m12x.json \
    --filters '{"eval_id": [<your-eval-id>, <baseline-1>, <baseline-2>]}' \
    --output comparison.md

# Alternatively, compare against all models in a group
python analyze.py scores data/dataset-m12x.json \
    --filters '{"groups": ["size:small"]}' \
    --output comparison-small.md

# Step 4: Interpret relative rankings by inspecting comparison.md

# Step 5: If differences are small, proceed to statistical comparison

Next step: Workflow 2: Comparative Evaluation

Scenario 3: "Which Llama variant is best?"¶

# Compare all Llama family models
python analyze.py scores data/dataset-m12x.json \
    --filters '{"groups": ["family:llama"]}' \
    --format png --output llama-comparison.png

Next step: Workflow 2: Comparative Evaluation to determine statistical significance

Common Pitfalls¶

Pitfall 1: Comparing models at different scales¶

Problem: Comparing 7B and 70B models directly Solution: Use --filters '{"groups": ["size:medium"]}' to compare within size class

Pitfall 2: Ignoring task variance¶

Problem: Selecting model solely on aggregate score Solution: Inspect per-task scores for domain-specific requirements

Pitfall 3: Confusing score with accuracy¶

Problem: Expecting scores in [0, 1] range Solution: Remember score/token is a rate metric, not a percentage

Pitfall 4: Not accounting for context limits¶

Problem: High-truncation models appear worse than they are Solution: Check truncation rates in per-task breakdown

Next Steps¶

Top 3 similar scores → Workflow 2: Comparative Evaluation - Statistical comparison of similar candidates
Candidate selected → Workflow 3: Model Characterization - Deep-dive the selected model
Unexpected failure → Workflow 4: Failure Diagnosis - Root-cause analysis of poor performance

Tool Documentation¶

analyze.py scores reference - Complete command documentation
PointsDB API - Understanding the data structure
Methodology - Statistical foundations