Workflow 3: Model Characterization¶

Research Question: "What are this model's strengths, weaknesses, and costs?"

Duration: 5-10 minutes

Objective¶

Profile a single model's capability landscape to understand: - Task-level strengths and weaknesses - Which reasoning domains work well? - Difficulty boundaries - Where does performance break down? - Capability zones - What difficulty ranges are safe to use? - Resource costs - Token consumption and truncation patterns - Cognitive archetype - What kind of reasoner is this model?

This workflow answers deployment questions: "Should I use this model? For what tasks? What will it cost me?"

When to Use This Workflow¶

You've selected a candidate from Workflow 1: Ranking or Workflow 2: Comparative
Making deployment decisions (cost/performance trade-offs)
Understanding a new model release
Identifying task-specific strengths for ensemble systems
Creating model capability reports
Debugging unexpected production behavior

Primary Tools¶

1. `spider` - High-Level Cross-Task Fingerprint¶

Creates visual "fingerprint" of model's performance across all tasks: - Web-PNG format: Radar chart for pattern recognition and archetype identification - Bar-PNG format: Bar chart for precise numerical values - Shows accuracy, token usage, and truncation rates per task

2. `surface` - Low-Level Task-Specific Capability Zones¶

Visualizes performance across difficulty dimensions within a single task: - 3D surface plots showing accuracy landscape - Green spheres indicate "capability zones" (where model succeeds) - Reveals failure boundaries and truncation onset - Shows where difficulty parameters cause breakdown

Basic Workflow¶

Step 1: Identify Model for Characterization¶

cd /home/mike/ai/reasonscape
source venv/bin/activate

# Find your model's eval_id
python analyze.py evals data/dataset-m12x.json --search "gpt-4o"
# Note the eval_id

Step 2: Generate High-Level Fingerprint¶

# Web-PNG format (pattern recognition)
python analyze.py spiderweb data/dataset-m12x.json \
    --filters '{"eval_id": [<your-eval-id>]}' \
    --format webpng --output spider-pattern.png

# Bar-PNG format (precise values)
python analyze.py spiderweb data/dataset-m12x.json \
    --filters '{"eval_id": [<your-eval-id>]}' \
    --format barpng --output spider-values.png

# JSON format (programmatic analysis)
python analyze.py spiderweb data/dataset-m12x.json \
    --filters '{"eval_id": [<your-eval-id>]}' \
    --format json --output spider-data.json

Step 3: Identify Cognitive Archetype¶

See Cognitive Archetypes section below.

Key archetypes: - Balanced Generalist - Catastrophic Scaler - Early Breaker - Task Specialist - Truncation Victim - Inefficient Reasoner - [Others to be documented]

Step 4: Deep-Dive Weak Tasks¶

Based on spider plot, identify 1-3 weakest tasks and generate surface plots:

# Generate surface for weak task
python analyze.py surface data/dataset-m12x.json \
    --filters '{"eval_id": [<your-eval-id>], "base_task": "arithmetic"}' \
    --output-dir surfaces/

# Repeat for other weak tasks
python analyze.py surface data/dataset-m12x.json \
    --filters '{"eval_id": [<your-eval-id>], "base_task": "boolean"}' \
    --output-dir surfaces/

Step 5: Synthesize Capability Profile¶

Combine insights from spider + surfaces to create deployment guide: - Safe tasks: High spider scores + large green capability zones - Risky tasks: Low spider scores + small/no capability zones - Cost estimate: Token usage from spider plot - Scaling limits: Difficulty boundaries from surface plots

Advanced Options¶

Spider Plot Variations¶

# Compare multiple models side-by-side
python analyze.py spiderweb data/dataset-m12x.json \
    --filters '{"eval_id": [0, 1, 2]}' \
    --format webpng --output multi-spider.png

# Multiple difficulty levels
python analyze.py spiderweb data/dataset-m12x.json \
    --filters '{"eval_id": [0], "tier": ["easy", "medium", "hard"]}' \
    --format webpng

Surface Plot Variations¶

# Single task surface
python analyze.py surface data/dataset-m12x.json \
    --filters '{"eval_id": [0], "base_task": "arithmetic"}'

# Multiple surfaces (one per task)
python analyze.py surface data/dataset-m12x.json \
    --filters '{"eval_id": [0], "tier": "logic"}' \
    --output-dir surfaces-logic/

# Compare two models on same task
python analyze.py surface data/dataset-m12x.json \
    --filters '{"eval_id": [0, 1], "base_task": "arithmetic"}' \
    --output-dir surface-comparison/

Interactive Exploration (Optional)¶

For hands-on surface exploration:

# Launch 3D explorer web app
python explorer.py data/dataset-m12x.json

# Open http://localhost:8051
# Select your model and task from dropdowns
# Rotate, zoom, inspect capability zones interactively