Skip to content

Workflow 3: Model Characterization

Research Question: "What are this model's strengths, weaknesses, and costs?"

Duration: 5-10 minutes

Objective

Profile a single model's capability landscape to understand: - Task-level strengths and weaknesses - Which reasoning domains work well? - Difficulty boundaries - Where does performance break down? - Capability zones - What difficulty ranges are safe to use? - Resource costs - Token consumption and truncation patterns - Cognitive archetype - What kind of reasoner is this model?

This workflow answers deployment questions: "Should I use this model? For what tasks? What will it cost me?"

When to Use This Workflow

  • You've selected a candidate from Workflow 1: Ranking or Workflow 2: Comparative
  • Making deployment decisions (cost/performance trade-offs)
  • Understanding a new model release
  • Identifying task-specific strengths for ensemble systems
  • Creating model capability reports
  • Debugging unexpected production behavior

Primary Tools

1. spider - High-Level Cross-Task Fingerprint

Creates visual "fingerprint" of model's performance across all tasks: - Web-PNG format: Radar chart for pattern recognition and archetype identification - Bar-PNG format: Bar chart for precise numerical values - Shows accuracy, token usage, and truncation rates per task

2. surface - Low-Level Task-Specific Capability Zones

Visualizes performance across difficulty dimensions within a single task: - 3D surface plots showing accuracy landscape - Green spheres indicate "capability zones" (where model succeeds) - Reveals failure boundaries and truncation onset - Shows where difficulty parameters cause breakdown

Basic Workflow

Step 1: Identify Model for Characterization

cd /home/mike/ai/reasonscape
source venv/bin/activate

# Find your model's eval_id
python analyze.py evals data/dataset-m12x.json --search "gpt-4o"
# Note the eval_id

Step 2: Generate High-Level Fingerprint

# Web-PNG format (pattern recognition)
python analyze.py spiderweb data/dataset-m12x.json \
    --filters '{"eval_id": [<your-eval-id>]}' \
    --format webpng --output spider-pattern.png

# Bar-PNG format (precise values)
python analyze.py spiderweb data/dataset-m12x.json \
    --filters '{"eval_id": [<your-eval-id>]}' \
    --format barpng --output spider-values.png

# JSON format (programmatic analysis)
python analyze.py spiderweb data/dataset-m12x.json \
    --filters '{"eval_id": [<your-eval-id>]}' \
    --format json --output spider-data.json

Step 3: Identify Cognitive Archetype

See Cognitive Archetypes section below.

Key archetypes: - Balanced Generalist - Catastrophic Scaler - Early Breaker - Task Specialist - Truncation Victim - Inefficient Reasoner - [Others to be documented]

Step 4: Deep-Dive Weak Tasks

Based on spider plot, identify 1-3 weakest tasks and generate surface plots:

# Generate surface for weak task
python analyze.py surface data/dataset-m12x.json \
    --filters '{"eval_id": [<your-eval-id>], "base_task": "arithmetic"}' \
    --output-dir surfaces/

# Repeat for other weak tasks
python analyze.py surface data/dataset-m12x.json \
    --filters '{"eval_id": [<your-eval-id>], "base_task": "boolean"}' \
    --output-dir surfaces/

Step 5: Synthesize Capability Profile

Combine insights from spider + surfaces to create deployment guide: - Safe tasks: High spider scores + large green capability zones - Risky tasks: Low spider scores + small/no capability zones - Cost estimate: Token usage from spider plot - Scaling limits: Difficulty boundaries from surface plots

Advanced Options

Spider Plot Variations

# Compare multiple models side-by-side
python analyze.py spiderweb data/dataset-m12x.json \
    --filters '{"eval_id": [0, 1, 2]}' \
    --format webpng --output multi-spider.png

# Multiple difficulty levels
python analyze.py spiderweb data/dataset-m12x.json \
    --filters '{"eval_id": [0], "tier": ["easy", "medium", "hard"]}' \
    --format webpng

Surface Plot Variations

# Single task surface
python analyze.py surface data/dataset-m12x.json \
    --filters '{"eval_id": [0], "base_task": "arithmetic"}'

# Multiple surfaces (one per task)
python analyze.py surface data/dataset-m12x.json \
    --filters '{"eval_id": [0], "tier": "logic"}' \
    --output-dir surfaces-logic/

# Compare two models on same task
python analyze.py surface data/dataset-m12x.json \
    --filters '{"eval_id": [0, 1], "base_task": "arithmetic"}' \
    --output-dir surface-comparison/

Interactive Exploration (Optional)

For hands-on surface exploration:

# Launch 3D explorer web app
python explorer.py data/dataset-m12x.json

# Open http://localhost:8051
# Select your model and task from dropdowns
# Rotate, zoom, inspect capability zones interactively

Interpretation Guide

Cognitive Archetypes

Based on the spider plot pattern, identify which archetype best describes the model:

Archetype: Balanced Generalist

Visual pattern: [PLACEHOLDER - describe radar chart shape] Characteristics: [PLACEHOLDER] Strengths: [PLACEHOLDER] Weaknesses: [PLACEHOLDER] Deployment recommendation: [PLACEHOLDER]

Archetype: Catastrophic Scaler

Visual pattern: [PLACEHOLDER - describe radar chart shape] Characteristics: [PLACEHOLDER] Strengths: [PLACEHOLDER] Weaknesses: [PLACEHOLDER] Deployment recommendation: [PLACEHOLDER]

Archetype: Early Breaker

Visual pattern: [PLACEHOLDER - describe radar chart shape] Characteristics: [PLACEHOLDER] Strengths: [PLACEHOLDER] Weaknesses: [PLACEHOLDER] Deployment recommendation: [PLACEHOLDER]

Archetype: Task Specialist

Visual pattern: [PLACEHOLDER - describe radar chart shape] Characteristics: [PLACEHOLDER] Strengths: [PLACEHOLDER] Weaknesses: [PLACEHOLDER] Deployment recommendation: [PLACEHOLDER]

Archetype: Truncation Victim

Visual pattern: [PLACEHOLDER - describe radar chart shape] Characteristics: [PLACEHOLDER] Strengths: [PLACEHOLDER] Weaknesses: [PLACEHOLDER] Deployment recommendation: [PLACEHOLDER]

Archetype: Inefficient Reasoner

Visual pattern: [PLACEHOLDER - describe radar chart shape] Characteristics: [PLACEHOLDER] Strengths: [PLACEHOLDER] Weaknesses: [PLACEHOLDER] Deployment recommendation: [PLACEHOLDER]

Archetype: Universal Failure

Visual pattern: [PLACEHOLDER - describe radar chart shape] Characteristics: [PLACEHOLDER] Strengths: [PLACEHOLDER] Weaknesses: [PLACEHOLDER] Deployment recommendation: [PLACEHOLDER]

Archetype: Chaotic Performer

Visual pattern: [PLACEHOLDER - describe radar chart shape] Characteristics: [PLACEHOLDER] Strengths: [PLACEHOLDER] Weaknesses: [PLACEHOLDER] Deployment recommendation: [PLACEHOLDER]

Archetype: Systemic Truncation Crisis

Visual pattern: [PLACEHOLDER - describe radar chart shape] Characteristics: [PLACEHOLDER] Strengths: [PLACEHOLDER] Weaknesses: [PLACEHOLDER] Deployment recommendation: [PLACEHOLDER]

Surface Plot Interpretation

Green Capability Zones

What they show: [PLACEHOLDER - explain green spheres] How to use them: [PLACEHOLDER - safe operating ranges] Warning signs: [PLACEHOLDER - small/missing zones]

Accuracy Cliffs

What they look like: [PLACEHOLDER - steep dropoffs] What they mean: [PLACEHOLDER - sudden failure] Implications: [PLACEHOLDER - deployment risk]

Truncation Patterns

Visual indicators: [PLACEHOLDER - how truncation appears] Root causes: [PLACEHOLDER - context limits] Mitigation: [PLACEHOLDER - prompt engineering]

Difficulty Dimensions

X/Y axes: [PLACEHOLDER - what parameters mean] Interpretation: [PLACEHOLDER - which dimensions matter] Task-specific: [PLACEHOLDER - varies by task]

Example Research Scenarios

Scenario 1: "Should I deploy this model for production?"

# Step 1: Generate capability fingerprint
python analyze.py spiderweb data/dataset-m12x.json \
    --filters '{"model_name": "candidate-model"}' \
    --format webpng --output candidate-fingerprint.png

# Step 2: Identify archetype
# (visual inspection of radar chart)

# Step 3: Check critical tasks for your use case
python analyze.py surface data/dataset-m12x.json \
    --filters '{"model_name": "candidate-model", "base_task": "arithmetic"}' \
    --output-dir candidate-surfaces/

# Step 4: Estimate costs
python analyze.py spiderweb data/dataset-m12x.json \
    --filters '{"model_name": "candidate-model"}' \
    --format json --output candidate-costs.json
# Parse token usage from JSON

Decision criteria: - Green light: Balanced generalist + large capability zones on critical tasks + acceptable token costs - Yellow light: Task specialist + strong on your use case + acceptable costs - Red light: Early breaker / truncation victim / universal failure

Scenario 2: "Why does this model work in dev but fail in production?"

# Step 1: Characterize the model
python analyze.py spiderweb data/dataset-m12x.json \
    --filters '{"model_name": "production-model"}' \
    --format webpng

# Step 2: Check if production tasks are in capability zones
python analyze.py surface data/dataset-m12x.json \
    --filters '{"model_name": "production-model", "base_task": "<production-task>"}' \
    --output prod-surface.png

# Step 3: Compare difficulty of dev vs prod data
# (visual inspection - are prod queries outside green zones?)

Common findings: - Production queries exceed difficulty boundaries (outside green zones) - Truncation under production context sizes - Task distribution mismatch (dev focused on strong tasks, prod hits weak tasks)

Scenario 3: "Which tasks should I use this model for?"

# Generate complete task breakdown
python analyze.py spiderweb data/dataset-m12x.json \
    --filters '{"model_name": "target-model"}' \
    --format barpng --output task-breakdown.png

# For each strong task, verify capability zones
python analyze.py surface data/dataset-m12x.json \
    --filters '{"model_name": "target-model", "tier": "logic"}' \
    --output-dir capability-zones/

Recommendation template: - Strongly recommended: [Tasks with score > 0.8 and large capability zones] - Use with caution: [Tasks with score 0.5-0.8 or small capability zones] - Avoid: [Tasks with score < 0.5 or no capability zones]

Scenario 4: "How does this model's cost/performance compare?"

# Get detailed statistics
python analyze.py spiderweb data/dataset-m12x.json \
    --filters '{"model_name": "target-model"}' \
    --format json --output model-stats.json

# Compare to baseline
python analyze.py spiderweb data/dataset-m12x.json \
    --filters '{"model_name": ["target-model", "baseline-model"]}' \
    --format webpng --output cost-comparison.png

Analysis: - Extract average tokens per task from JSON - Multiply by $/1M tokens for cost estimate - Compare accuracy vs cost across models - Identify efficiency frontier (best accuracy per dollar)

Common Pitfalls

Pitfall 1: Ignoring capability zones

Problem: Using model outside safe difficulty ranges Solution: Always check surface plots for critical tasks

Pitfall 2: Focusing only on aggregate scores

Problem: Missing task-specific weaknesses Solution: Generate spider plot to see task-level breakdown

Pitfall 3: Not accounting for truncation

Problem: Model "works" but truncates frequently Solution: Check truncation rates in spider plot and surface plots

Pitfall 4: Misidentifying archetype

Problem: Treating early breaker as balanced generalist Solution: Test at multiple difficulty levels, not just "hard"

Output Reference

Spider Web-PNG Format

Visual radar chart with: - Axes: One per task (typically 12) - Lines: One per model (multiple models can be overlaid) - Shaded area: Represents model's capability profile - Shape: Identifies cognitive archetype

Spider Bar-PNG Format

Grouped bar chart with: - X-axis: Tasks - Y-axis: Performance metric (score) - Bars: One per model (grouped if multiple models) - Error bars: Confidence intervals - Colors: Distinguish models or difficulty levels

Spider JSON Format

{
  "eval_id": 0,
  "model_name": "Model A",
  "tasks": [
    {
      "base_task": "arithmetic",
      "score": 0.85,
      "ci": [0.82, 0.88],
      "tokens_mean": 1234,
      "tokens_std": 456,
      "truncation_rate": 0.02
    },
    ...
  ],
  "archetype": "balanced_generalist",
  "summary": {
    "avg_score": 0.82,
    "avg_tokens": 1150,
    "total_truncation_rate": 0.03
  }
}

Surface PNG Format

3D visualization with: - X-axis: First difficulty parameter (e.g., length) - Y-axis: Second difficulty parameter (e.g., depth) - Z-axis: Accuracy (color-coded) - Green spheres: Capability zones (high accuracy regions) - Color gradient: Red (failure) → Yellow (marginal) → Green (success) - Grid: Discretized difficulty space

Tips for LLM Agents

If you're an LLM agent using this workflow:

  1. Start with spider JSON for structured data:

    python analyze.py spiderweb <dataset> --filters <...> --format json
    

  2. Parse for archetype identification:

  3. Compute mean and stddev of task scores
  4. Low stddev (<0.15) → Balanced generalist
  5. High stddev (>0.3) → Task specialist or chaotic
  6. Check truncation rates for truncation victim pattern

  7. Identify weak tasks (score < 0.5):

  8. Generate surface plots for each weak task
  9. Note: Cannot automatically interpret PNGs yet - recommend to user

  10. Synthesize deployment recommendation:

    "Model X is a [archetype] with strong performance on [high-scoring tasks]
    and weakness on [low-scoring tasks]. Average token consumption: [tokens].
    Truncation rate: [rate]. Recommended for [use cases matching strong tasks]."
    

  11. Red flags to report:

  12. High truncation rate (>10%) → context limit issues
  13. High token usage (>2x dataset median) → inefficiency
  14. Low scores across all tasks (<0.5 avg) → universal failure
  15. High variance (stddev >0.3) → unpredictable performance

Tool Documentation