Workflow 3: Model Characterization¶
Research Question: "What are this model's strengths, weaknesses, and costs?"
Duration: 5-10 minutes
Objective¶
Profile a single model's capability landscape to understand: - Task-level strengths and weaknesses - Which reasoning domains work well? - Difficulty boundaries - Where does performance break down? - Capability zones - What difficulty ranges are safe to use? - Resource costs - Token consumption and truncation patterns - Cognitive archetype - What kind of reasoner is this model?
This workflow answers deployment questions: "Should I use this model? For what tasks? What will it cost me?"
When to Use This Workflow¶
- You've selected a candidate from Workflow 1: Ranking or Workflow 2: Comparative
- Making deployment decisions (cost/performance trade-offs)
- Understanding a new model release
- Identifying task-specific strengths for ensemble systems
- Creating model capability reports
- Debugging unexpected production behavior
Primary Tools¶
1. spider - High-Level Cross-Task Fingerprint¶
Creates visual "fingerprint" of model's performance across all tasks: - Web-PNG format: Radar chart for pattern recognition and archetype identification - Bar-PNG format: Bar chart for precise numerical values - Shows accuracy, token usage, and truncation rates per task
2. surface - Low-Level Task-Specific Capability Zones¶
Visualizes performance across difficulty dimensions within a single task: - 3D surface plots showing accuracy landscape - Green spheres indicate "capability zones" (where model succeeds) - Reveals failure boundaries and truncation onset - Shows where difficulty parameters cause breakdown
Basic Workflow¶
Step 1: Identify Model for Characterization¶
cd /home/mike/ai/reasonscape
source venv/bin/activate
# Find your model's eval_id
python analyze.py evals data/dataset-m12x.json --search "gpt-4o"
# Note the eval_id
Step 2: Generate High-Level Fingerprint¶
# Web-PNG format (pattern recognition)
python analyze.py spiderweb data/dataset-m12x.json \
--filters '{"eval_id": [<your-eval-id>]}' \
--format webpng --output spider-pattern.png
# Bar-PNG format (precise values)
python analyze.py spiderweb data/dataset-m12x.json \
--filters '{"eval_id": [<your-eval-id>]}' \
--format barpng --output spider-values.png
# JSON format (programmatic analysis)
python analyze.py spiderweb data/dataset-m12x.json \
--filters '{"eval_id": [<your-eval-id>]}' \
--format json --output spider-data.json
Step 3: Identify Cognitive Archetype¶
See Cognitive Archetypes section below.
Key archetypes: - Balanced Generalist - Catastrophic Scaler - Early Breaker - Task Specialist - Truncation Victim - Inefficient Reasoner - [Others to be documented]
Step 4: Deep-Dive Weak Tasks¶
Based on spider plot, identify 1-3 weakest tasks and generate surface plots:
# Generate surface for weak task
python analyze.py surface data/dataset-m12x.json \
--filters '{"eval_id": [<your-eval-id>], "base_task": "arithmetic"}' \
--output-dir surfaces/
# Repeat for other weak tasks
python analyze.py surface data/dataset-m12x.json \
--filters '{"eval_id": [<your-eval-id>], "base_task": "boolean"}' \
--output-dir surfaces/
Step 5: Synthesize Capability Profile¶
Combine insights from spider + surfaces to create deployment guide: - Safe tasks: High spider scores + large green capability zones - Risky tasks: Low spider scores + small/no capability zones - Cost estimate: Token usage from spider plot - Scaling limits: Difficulty boundaries from surface plots
Advanced Options¶
Spider Plot Variations¶
# Compare multiple models side-by-side
python analyze.py spiderweb data/dataset-m12x.json \
--filters '{"eval_id": [0, 1, 2]}' \
--format webpng --output multi-spider.png
# Multiple difficulty levels
python analyze.py spiderweb data/dataset-m12x.json \
--filters '{"eval_id": [0], "tier": ["easy", "medium", "hard"]}' \
--format webpng
Surface Plot Variations¶
# Single task surface
python analyze.py surface data/dataset-m12x.json \
--filters '{"eval_id": [0], "base_task": "arithmetic"}'
# Multiple surfaces (one per task)
python analyze.py surface data/dataset-m12x.json \
--filters '{"eval_id": [0], "tier": "logic"}' \
--output-dir surfaces-logic/
# Compare two models on same task
python analyze.py surface data/dataset-m12x.json \
--filters '{"eval_id": [0, 1], "base_task": "arithmetic"}' \
--output-dir surface-comparison/
Interactive Exploration (Optional)¶
For hands-on surface exploration:
# Launch 3D explorer web app
python explorer.py data/dataset-m12x.json
# Open http://localhost:8051
# Select your model and task from dropdowns
# Rotate, zoom, inspect capability zones interactively
Interpretation Guide¶
Cognitive Archetypes¶
Based on the spider plot pattern, identify which archetype best describes the model:
Archetype: Balanced Generalist¶
Visual pattern: [PLACEHOLDER - describe radar chart shape] Characteristics: [PLACEHOLDER] Strengths: [PLACEHOLDER] Weaknesses: [PLACEHOLDER] Deployment recommendation: [PLACEHOLDER]
Archetype: Catastrophic Scaler¶
Visual pattern: [PLACEHOLDER - describe radar chart shape] Characteristics: [PLACEHOLDER] Strengths: [PLACEHOLDER] Weaknesses: [PLACEHOLDER] Deployment recommendation: [PLACEHOLDER]
Archetype: Early Breaker¶
Visual pattern: [PLACEHOLDER - describe radar chart shape] Characteristics: [PLACEHOLDER] Strengths: [PLACEHOLDER] Weaknesses: [PLACEHOLDER] Deployment recommendation: [PLACEHOLDER]
Archetype: Task Specialist¶
Visual pattern: [PLACEHOLDER - describe radar chart shape] Characteristics: [PLACEHOLDER] Strengths: [PLACEHOLDER] Weaknesses: [PLACEHOLDER] Deployment recommendation: [PLACEHOLDER]
Archetype: Truncation Victim¶
Visual pattern: [PLACEHOLDER - describe radar chart shape] Characteristics: [PLACEHOLDER] Strengths: [PLACEHOLDER] Weaknesses: [PLACEHOLDER] Deployment recommendation: [PLACEHOLDER]
Archetype: Inefficient Reasoner¶
Visual pattern: [PLACEHOLDER - describe radar chart shape] Characteristics: [PLACEHOLDER] Strengths: [PLACEHOLDER] Weaknesses: [PLACEHOLDER] Deployment recommendation: [PLACEHOLDER]
Archetype: Universal Failure¶
Visual pattern: [PLACEHOLDER - describe radar chart shape] Characteristics: [PLACEHOLDER] Strengths: [PLACEHOLDER] Weaknesses: [PLACEHOLDER] Deployment recommendation: [PLACEHOLDER]
Archetype: Chaotic Performer¶
Visual pattern: [PLACEHOLDER - describe radar chart shape] Characteristics: [PLACEHOLDER] Strengths: [PLACEHOLDER] Weaknesses: [PLACEHOLDER] Deployment recommendation: [PLACEHOLDER]
Archetype: Systemic Truncation Crisis¶
Visual pattern: [PLACEHOLDER - describe radar chart shape] Characteristics: [PLACEHOLDER] Strengths: [PLACEHOLDER] Weaknesses: [PLACEHOLDER] Deployment recommendation: [PLACEHOLDER]
Surface Plot Interpretation¶
Green Capability Zones¶
What they show: [PLACEHOLDER - explain green spheres] How to use them: [PLACEHOLDER - safe operating ranges] Warning signs: [PLACEHOLDER - small/missing zones]
Accuracy Cliffs¶
What they look like: [PLACEHOLDER - steep dropoffs] What they mean: [PLACEHOLDER - sudden failure] Implications: [PLACEHOLDER - deployment risk]
Truncation Patterns¶
Visual indicators: [PLACEHOLDER - how truncation appears] Root causes: [PLACEHOLDER - context limits] Mitigation: [PLACEHOLDER - prompt engineering]
Difficulty Dimensions¶
X/Y axes: [PLACEHOLDER - what parameters mean] Interpretation: [PLACEHOLDER - which dimensions matter] Task-specific: [PLACEHOLDER - varies by task]
Example Research Scenarios¶
Scenario 1: "Should I deploy this model for production?"¶
# Step 1: Generate capability fingerprint
python analyze.py spiderweb data/dataset-m12x.json \
--filters '{"model_name": "candidate-model"}' \
--format webpng --output candidate-fingerprint.png
# Step 2: Identify archetype
# (visual inspection of radar chart)
# Step 3: Check critical tasks for your use case
python analyze.py surface data/dataset-m12x.json \
--filters '{"model_name": "candidate-model", "base_task": "arithmetic"}' \
--output-dir candidate-surfaces/
# Step 4: Estimate costs
python analyze.py spiderweb data/dataset-m12x.json \
--filters '{"model_name": "candidate-model"}' \
--format json --output candidate-costs.json
# Parse token usage from JSON
Decision criteria: - Green light: Balanced generalist + large capability zones on critical tasks + acceptable token costs - Yellow light: Task specialist + strong on your use case + acceptable costs - Red light: Early breaker / truncation victim / universal failure
Scenario 2: "Why does this model work in dev but fail in production?"¶
# Step 1: Characterize the model
python analyze.py spiderweb data/dataset-m12x.json \
--filters '{"model_name": "production-model"}' \
--format webpng
# Step 2: Check if production tasks are in capability zones
python analyze.py surface data/dataset-m12x.json \
--filters '{"model_name": "production-model", "base_task": "<production-task>"}' \
--output prod-surface.png
# Step 3: Compare difficulty of dev vs prod data
# (visual inspection - are prod queries outside green zones?)
Common findings: - Production queries exceed difficulty boundaries (outside green zones) - Truncation under production context sizes - Task distribution mismatch (dev focused on strong tasks, prod hits weak tasks)
Scenario 3: "Which tasks should I use this model for?"¶
# Generate complete task breakdown
python analyze.py spiderweb data/dataset-m12x.json \
--filters '{"model_name": "target-model"}' \
--format barpng --output task-breakdown.png
# For each strong task, verify capability zones
python analyze.py surface data/dataset-m12x.json \
--filters '{"model_name": "target-model", "tier": "logic"}' \
--output-dir capability-zones/
Recommendation template: - Strongly recommended: [Tasks with score > 0.8 and large capability zones] - Use with caution: [Tasks with score 0.5-0.8 or small capability zones] - Avoid: [Tasks with score < 0.5 or no capability zones]
Scenario 4: "How does this model's cost/performance compare?"¶
# Get detailed statistics
python analyze.py spiderweb data/dataset-m12x.json \
--filters '{"model_name": "target-model"}' \
--format json --output model-stats.json
# Compare to baseline
python analyze.py spiderweb data/dataset-m12x.json \
--filters '{"model_name": ["target-model", "baseline-model"]}' \
--format webpng --output cost-comparison.png
Analysis: - Extract average tokens per task from JSON - Multiply by $/1M tokens for cost estimate - Compare accuracy vs cost across models - Identify efficiency frontier (best accuracy per dollar)
Common Pitfalls¶
Pitfall 1: Ignoring capability zones¶
Problem: Using model outside safe difficulty ranges Solution: Always check surface plots for critical tasks
Pitfall 2: Focusing only on aggregate scores¶
Problem: Missing task-specific weaknesses Solution: Generate spider plot to see task-level breakdown
Pitfall 3: Not accounting for truncation¶
Problem: Model "works" but truncates frequently Solution: Check truncation rates in spider plot and surface plots
Pitfall 4: Misidentifying archetype¶
Problem: Treating early breaker as balanced generalist Solution: Test at multiple difficulty levels, not just "hard"
Output Reference¶
Spider Web-PNG Format¶
Visual radar chart with: - Axes: One per task (typically 12) - Lines: One per model (multiple models can be overlaid) - Shaded area: Represents model's capability profile - Shape: Identifies cognitive archetype
Spider Bar-PNG Format¶
Grouped bar chart with: - X-axis: Tasks - Y-axis: Performance metric (score) - Bars: One per model (grouped if multiple models) - Error bars: Confidence intervals - Colors: Distinguish models or difficulty levels
Spider JSON Format¶
{
"eval_id": 0,
"model_name": "Model A",
"tasks": [
{
"base_task": "arithmetic",
"score": 0.85,
"ci": [0.82, 0.88],
"tokens_mean": 1234,
"tokens_std": 456,
"truncation_rate": 0.02
},
...
],
"archetype": "balanced_generalist",
"summary": {
"avg_score": 0.82,
"avg_tokens": 1150,
"total_truncation_rate": 0.03
}
}
Surface PNG Format¶
3D visualization with: - X-axis: First difficulty parameter (e.g., length) - Y-axis: Second difficulty parameter (e.g., depth) - Z-axis: Accuracy (color-coded) - Green spheres: Capability zones (high accuracy regions) - Color gradient: Red (failure) → Yellow (marginal) → Green (success) - Grid: Discretized difficulty space
Tips for LLM Agents¶
If you're an LLM agent using this workflow:
-
Start with spider JSON for structured data:
python analyze.py spiderweb <dataset> --filters <...> --format json -
Parse for archetype identification:
- Compute mean and stddev of task scores
- Low stddev (<0.15) → Balanced generalist
- High stddev (>0.3) → Task specialist or chaotic
-
Check truncation rates for truncation victim pattern
-
Identify weak tasks (score < 0.5):
- Generate surface plots for each weak task
-
Note: Cannot automatically interpret PNGs yet - recommend to user
-
Synthesize deployment recommendation:
"Model X is a [archetype] with strong performance on [high-scoring tasks] and weakness on [low-scoring tasks]. Average token consumption: [tokens]. Truncation rate: [rate]. Recommended for [use cases matching strong tasks]." -
Red flags to report:
- High truncation rate (>10%) → context limit issues
- High token usage (>2x dataset median) → inefficiency
- Low scores across all tasks (<0.5 avg) → universal failure
- High variance (stddev >0.3) → unpredictable performance
Related Workflows¶
- Workflow 1: Ranking - Select model for characterization
- Workflow 2: Comparative - Compare characterized models
- Workflow 4: Diagnosis - Root-cause identified weaknesses
Tool Documentation¶
- analyze.py spiderweb reference - Complete command documentation
- analyze.py surface reference - Surface plot generation
- explorer.py reference - Interactive 3D exploration
- Architecture - Full methodology overview