Skip to content

Workflow 1: Ranking & Benchmarking

Research Question: "What's the best model within a group, or overall?"

Duration: 2-3 minutes

Objective

Generate aggregate rankings across all models and tasks to identify top candidates for deeper investigation. This workflow produces a single unified metric (score/token) that balances:

  • Accuracy (correctness of answers)
  • Truncation (via score - penalizes models that hit context limits)
  • Resource utilization (token consumption per task)

When to Use This Workflow

  • Initial triage of multiple models
  • Selecting 3-5 candidates for comparative evaluation
  • Quick assessment of new model releases
  • Creating benchmark reports for publication
  • Answering "which model is best overall?" questions

Primary Tool: scores

The scores subcommand computes aggregate rankings by:

  1. Computing excess-accuracy-corrected scores per task
  2. Weighing by token efficiency (score/token)
  3. Aggregating across all tasks in the dataset
  4. Producing sortable rankings with metadata

Basic Workflow

Step 1: Discover Available Evaluations

cd /home/mike/ai/reasonscape
source venv/bin/activate

# List all evaluations in the dataset
python analyze.py evals data/dataset-m12x.json --format table

What you get:

  • Table of all models with eval_ids
  • Group memberships (family, size, architecture, runtime)
  • Metadata for filtering

Decision point: Note interesting groups or families for filtering (optional)

Step 2: Generate Rankings

# Generate rankings for all models
python analyze.py scores data/dataset-m12x.json --output scores.md

# Or filter to specific groups
python analyze.py scores data/dataset-m12x.json \
    --filters '{"groups": ["size:large"]}' \
    --output scores-large.md

What you get:

  • Markdown table sorted by score, with score/token available for comparison
  • Per-tier x Per-task scores for each model
  • Token consumption statistics (if enabled with --show-token-counts)
  • Group/family metadata

Output modes:

  • scores.md - Human-readable rankings
  • scores.json - Machine-readable data (with --format json)
  • scores.png - Visual heatmap (with --format png)

Step 3: Interpret Results

See Understanding the Markdown Scores Table below.

Step 4: Select Candidates

Typical selection criteria:

  • Top 3-5 models by score/token
  • Look for low truncations, low invalids
  • Models with interesting task-level variance
  • Models from different families/architectures for comparison

Next step: Proceed to Workflow 2: Comparative Evaluation with selected candidates

Advanced Options

Filtering by Groups

# Compare only Llama family models
python analyze.py scores data/dataset-m12x.json \
    --filters '{"groups": ["family:llama"]}'

# Compare large models only
python analyze.py scores data/dataset-m12x.json \
    --filters '{"groups": ["size:large"]}'

# Combine filters (AND logic)
python analyze.py scores data/dataset-m12x.json \
    --filters '{"groups": ["family:qwen", "size:large"]}'

Filtering by Specific Models

# Compare specific eval_ids
python analyze.py scores data/dataset-m12x.json \
    --filters '{"eval_id": [0, 1, 2, 3]}'

# Use fuzzy search first to find eval_ids
python analyze.py evals data/dataset-m12x.json --search "gpt-4"
# Then use discovered eval_ids

Output Formats

# Markdown table (default)
python analyze.py scores data/dataset-m12x.json --output scores.md

# JSON for programmatic access
python analyze.py scores data/dataset-m12x.json \
    --format json --output scores.json

# PNG heatmap for reports
python analyze.py scores data/dataset-m12x.json \
    --format png --output scores.png

Understanding the Markdown Scores Table

Columns:

  • Model - Model name/identifier
  • Tier - Difficulty tier (easy/medium/hard)
  • ReasonScore - Unified metric (higher is better, geometric mean × 1000)
  • Avg Tokens - Mean token consumption per task
  • Score/Token - Efficiency ratio (ReasonScore/Avg Tokens)
  • Task1, Task2, ... - Per-task scores with detail indicators

Cell Format in Per-Task Columns:

Each task cell shows an accuracy range with optional indicators:

.XX - .YY [-.ZZ] (N/M)
│     │    │      └──── Optional: Completion ratio (N points collected / M expected)
│     │    └─────────── Optional: Truncation penalty (shown if > 2%)
│     └──────────────── Upper bound (center + margin, 95% CI)
└────────────────────── Lower bound (center accuracy)

Examples:

  • .89 - .91 - Normal: 89-91% accuracy, no significant truncation
  • .67 - .69 [-.11] - Truncated: 67-69% accuracy, 11% truncation penalty
  • .75 - .79 [-.02] - Minor truncation: 75-79% accuracy, 2% truncation penalty
  • .45 - .48 (123/200) - Incomplete: 45-48% accuracy, only 123/200 points collected
  • .55 - .58 [-.22] (89/100) - Both: Truncated AND incomplete
  • - Missing: No data for this task

Aggregate Score Indicators:

  • 771* - Asterisk indicates at least one task has incomplete data
  • 771 - No asterisk means all tasks have complete data

What Each Metric Means:

  • Center (lower bound): Mean accuracy across all test points
  • Margin (upper bound): 95% confidence interval upper limit
  • Truncation penalty: Fraction of responses that hit context limits
  • Completion ratio: When evaluation stopped early (time/cost limits)

Reading Patterns: - Wide ranges (.67 - .79) → High variance, less confident estimate - Narrow ranges (.89 - .91) → Consistent performance, confident estimate - Large truncation [-.48] → Model struggling with context limits - Incomplete (N/M) → Evaluation stopped early, limited data

Understanding the JSON Format (scores.json)

The JSON format provides the complete hierarchical structure with full statistical detail:

[
  {
    "eval_id": 1,
    "model": "Phi-4 Reasoning (FP16)",
    "reasonscore": 649.53,
    "tiers": {
      "easy": {
        "reasonscore": 771.44,
        "any_incomplete": false,
        "tasks": {
          "arithmetic": {
            "center": 0.808,
            "margin": 0.018,
            "truncated_ratio": 0.015,
            "completion_tokens_mean": 2232.21,
            "adjusted_score": 0.811,
            "point_count": 26,
            "expected_points": 26,
            "is_incomplete": false,
            // ... additional statistical fields
          },
          "boolean": { /* ... */ },
          // ... other tasks
        }
      },
      "medium": { /* ... */ },
      "hard": { /* ... */ }
    }
  },
  // ... other evaluations
]

Key differences from markdown format:

  • Array of evaluation objects (not wrapped in a "rankings" key)
  • Full tier breakdown with complete task statistics
  • All raw statistical values (center, margin, truncated_ratio, etc.)
  • Token consumption details per task
  • Completion status (point_count, expected_points, is_incomplete)
  • No pre-formatted strings - all numeric values for programmatic access

Understanding the PNG Output Format

Heat-bar visualization:

  • Rows: Models (sorted by score/token) grouped into Tiers
  • Columns: Tasks
  • Color: Performance (red=low, green=high)
  • Gray Crosshatch: Truncations
  • Useful for pattern recognition and presentations

Example Research Scenarios

Scenario 1: "I need to select a model for production"

# Step 1: Get complete rankings
python analyze.py scores data/dataset-m12x.json --output scores.md

# Step 2: Filter to production-viable sizes
python analyze.py scores data/dataset-m12x.json \
    --filters '{"groups": ["size:medium", "size:large"]}' \
    --output scores-production.md

# Step 3: Identify top 3 candidates by inspecting scores-production.md

# Step 4: Proceed to Workflow 3 for detailed characterization

Next step: Workflow 3: Model Characterization

Scenario 2: "How does my new model compare to baselines?"

# Step 1: Find eval_id for your model
python analyze.py evals data/dataset-m12x.json --search "my-model"

# Step 2: Find comparable baselines by filtering on groups
# Example: Find all small models for fair comparison
python analyze.py evals data/dataset-m12x.json --groups size:small

# Or combine multiple group facets (AND logic)
python analyze.py evals data/dataset-m12x.json --groups size:small,family:llama

# Step 3: Compare your model against discovered baselines
python analyze.py scores data/dataset-m12x.json \
    --filters '{"eval_id": [<your-eval-id>, <baseline-1>, <baseline-2>]}' \
    --output comparison.md

# Alternatively, compare against all models in a group
python analyze.py scores data/dataset-m12x.json \
    --filters '{"groups": ["size:small"]}' \
    --output comparison-small.md

# Step 4: Interpret relative rankings by inspecting comparison.md

# Step 5: If differences are small, proceed to statistical comparison

Next step: Workflow 2: Comparative Evaluation

Scenario 3: "Which Llama variant is best?"

# Compare all Llama family models
python analyze.py scores data/dataset-m12x.json \
    --filters '{"groups": ["family:llama"]}' \
    --format png --output llama-comparison.png

Next step: Workflow 2: Comparative Evaluation to determine statistical significance

Common Pitfalls

Pitfall 1: Comparing models at different scales

Problem: Comparing 7B and 70B models directly Solution: Use --filters '{"groups": ["size:medium"]}' to compare within size class

Pitfall 2: Ignoring task variance

Problem: Selecting model solely on aggregate score Solution: Inspect per-task scores for domain-specific requirements

Pitfall 3: Confusing score with accuracy

Problem: Expecting scores in [0, 1] range Solution: Remember score/token is a rate metric, not a percentage

Pitfall 4: Not accounting for context limits

Problem: High-truncation models appear worse than they are Solution: Check truncation rates in per-task breakdown

Next Steps

Tool Documentation