Workflow 4: Failure Diagnosis¶
Research Question: "Why/how/when did this model fail?"
Duration: 10-20 minutes
Objective¶
Root-cause analysis of model failures by investigating four information spaces:
- INPUT space (
fft) - How is the problem represented by tokenization/chat-template? - REASONING space (
compression) - How is information being processed? - OUTPUT space (
surface) - Where does performance break down? - TEMPORAL space (
hazard) - When does thinking degrade over generation time?
This workflow transforms observations ("Model X fails at task Y") into mechanistic understanding ("Model X fails because [tokenization artifacts / reasoning inefficiency / working memory limits / temporal degradation]").
When to Use This Workflow¶
- Model shows unexpected failure on specific tasks (identified in Workflow 3: Characterization)
- Comparing why two models with similar aggregate scores fail differently
- Understanding whether failures are fixable (prompt engineering, fine-tuning, architecture)
- Research into LLM reasoning mechanisms
- Forming hypotheses for model improvement
- Diagnosing production failures
The Diagnostic Toolkit¶
1. surface - OUTPUT Space Analysis¶
Question: "Where does performance break down?"
What it reveals: - Accuracy boundaries across difficulty dimensions - Truncation onset zones - Capability zones (where model succeeds) - Task-specific failure patterns
Output: 3D visualization of accuracy landscape
2. fft - INPUT Space Analysis¶
Question: "How is the problem represented?"
What it reveals: - Tokenization artifacts in difficulty manifold - Chat-template encoding effects - Spectral signatures unique to model's input processing - Key phenomena: - Information peak collapse when confounders added (signal degradation) - Band-limited gain as problem length scales (frequency characteristics)
Output: Frequency-domain visualizations (RF-like spectral analysis)
3. compression - REASONING Space Analysis¶
Question: "How is information being processed?"
What it reveals: - Entropic content of reasoning traces - Token vs entropy scatterplots per population (correct/incorrect/truncated) - Failure patterns: - Underthink: Low tokens, low entropy (insufficient reasoning) - Overthink: High tokens, low entropy (repetition/loops) - Broken loops: High tokens, flat entropy (stuck in degenerate state)
Output: Scatterplots and entropy distributions
4. hazard - TEMPORAL Space Analysis¶
Question: "When does thinking degrade?"
What it reveals: - Token generation as temporal process (completion tokens ≈ time) - Population-adjusted incidence rates (failure risk over "thinking time") - Hazard ratios (how much thinking before breakdown) - Temporal failure curves
Output: Survival analysis curves and hazard plots
Basic Workflow: The Four-Lens Investigation¶
Prerequisites¶
You should have: - Model and task identified (from previous workflows) - Specific failure observed (e.g., "GPT-4o fails at arithmetic when length > 18")
cd /home/mike/ai/reasonscape
source venv/bin/activate
# Find eval_id if needed
python analyze.py evals data/dataset-m12x.json --search "target-model"
Step 1: OUTPUT Analysis - Where Does It Fail?¶
# Generate surface plot for failing task
python analyze.py surface data/dataset-m12x.json \
--filters '{"eval_id": [<target-eval>], "base_task": "arithmetic"}' \
--output-dir diagnosis/surfaces/
What to look for: - Accuracy cliff locations (where performance drops) - Truncation zones (where context limits hit) - Size/shape of capability zones (green spheres) - Difficulty parameter sensitivity (which dimensions matter most)
Key questions answered: - At what difficulty does failure occur? - Is it gradual decline or sudden cliff? - Is truncation a factor? - Which difficulty dimensions are critical?
Step 2: INPUT Analysis - Is It the Representation?¶
# Generate FFT analysis for same task
python analyze.py fft data/dataset-m12x.json \
--filters '{"eval_id": [<target-eval>], "base_task": "arithmetic"}' \
--output-dir diagnosis/fft/
What to look for: - Spectral peak collapse (information loss due to confounders) - Band-limited gain patterns (scaling behavior) - Tokenization artifacts (frequency anomalies) - Comparison to other models on same task (is this model-specific or task-specific?)
Key questions answered: - Does tokenization create pathological representations? - Are confounders causing information collapse? - Is the chat template interfering? - Would a different tokenizer help?
Step 3: REASONING Analysis - How Is It Processing?¶
# Generate compression analysis
python analyze.py compression data/dataset-m12x.json \
--filters '{"eval_id": [<target-eval>], "base_task": "arithmetic"}' \
--output-dir diagnosis/compression/
What to look for: - Token vs entropy scatterplot patterns - Separation between correct/incorrect/truncated populations - Underthink pattern: Correct answers cluster at low tokens + reasonable entropy; incorrect at similarly low tokens but lower entropy - Overthink pattern: Incorrect answers at high tokens + low entropy (repetitive reasoning) - Broken loops: High tokens + flat entropy (stuck generating low-information loops)
Key questions answered: - Is the model thinking enough? (underthink) - Is the model wasting tokens? (overthink) - Are reasoning loops broken? (broken loops) - Does longer thinking help? (token vs accuracy correlation)
Step 4: TEMPORAL Analysis - When Does Thinking Break?¶
# Generate hazard analysis
python analyze.py hazard data/dataset-m12x.json \
--filters '{"eval_id": [<target-eval>], "base_task": "arithmetic"}' \
--output-dir diagnosis/hazard/
What to look for: - Hazard curves (risk of failure over token count) - Incidence rates (when failures occur) - Critical token thresholds (where hazard increases sharply) - Population-adjusted patterns (correct vs incorrect temporal dynamics)
Key questions answered: - How many tokens can model "think" before degradation? - Is there a token budget sweet spot? - Does extended generation help or hurt? - Are there temporal failure modes?
Step 5: Synthesis - Form Hypothesis¶
Combine insights from all four lenses:
Example synthesis 1: Tokenization Problem - OUTPUT: Failure at specific difficulty levels - INPUT: Spectral collapse for those difficulty levels - REASONING: Normal entropy patterns - TEMPORAL: Normal hazard curves - Hypothesis: Tokenization creates unrepresentable states at high difficulty - Action: Try different tokenizer or chat template
Example synthesis 2: Working Memory Limit - OUTPUT: Accuracy cliff at length=18 - INPUT: Normal spectral patterns - REASONING: Overthink pattern (high tokens, low entropy) - TEMPORAL: Hazard increases sharply at token=X - Hypothesis: Context window pressure causes reasoning breakdown - Action: Increase context limit or use sliding window
Example synthesis 3: Reasoning Inefficiency - OUTPUT: Gradual decline across difficulty - INPUT: Normal spectral patterns - REASONING: Broken loops (high tokens, flat entropy) - TEMPORAL: Early hazard increase - Hypothesis: Model enters degenerate reasoning states - Action: Fine-tune on chain-of-thought or use reasoning prompt templates
Advanced Diagnostic Patterns¶
Pattern 1: Cross-Model Comparison¶
Compare two models on same task to isolate differences:
# Surface comparison
python analyze.py surface data/dataset-m12x.json \
--filters '{"eval_id": [<model-a>, <model-b>], "base_task": "arithmetic"}' \
--output-dir diagnosis/surface-comparison/
# FFT comparison (critical for tokenizer effects)
python analyze.py fft data/dataset-m12x.json \
--filters '{"eval_id": [<model-a>, <model-b>], "base_task": "arithmetic"}' \
--output-dir diagnosis/fft-comparison/
# Compression comparison
python analyze.py compression data/dataset-m12x.json \
--filters '{"eval_id": [<model-a>, <model-b>], "base_task": "arithmetic"}' \
--output-dir diagnosis/compression-comparison/
Use case: "Model A succeeds where Model B fails - why?"
Pattern 2: Multi-Task Diagnosis¶
Investigate if failure mechanism is task-specific or systemic:
# Generate surfaces for all tasks
python analyze.py surface data/dataset-m12x.json \
--filters '{"eval_id": [<target-eval>]}' \
--output-dir diagnosis/all-surfaces/
# FFT for all tasks
python analyze.py fft data/dataset-m12x.json \
--filters '{"eval_id": [<target-eval>]}' \
--output-dir diagnosis/all-fft/
# Compression for weak tasks
python analyze.py compression data/dataset-m12x.json \
--filters '{"eval_id": [<target-eval>], "tier": "logic"}' \
--output-dir diagnosis/weak-tasks-compression/
Use case: "Model fails on multiple tasks - is it the same root cause?"
Pattern 3: Difficulty-Conditional Analysis¶
Focus on specific difficulty ranges:
# Surface for specific difficulty range
python analyze.py surface data/dataset-m12x.json \
--filters '{"eval_id": [<target>], "base_task": "arithmetic", "length": [18, 24]}' \
--output-dir diagnosis/high-difficulty/
# Compression conditioned on difficulty
python analyze.py compression data/dataset-m12x.json \
--filters '{"eval_id": [<target>], "base_task": "arithmetic"}' \
--axis length \
--output-dir diagnosis/compression-by-length/
Use case: "Model fails only at high difficulty - what changes?"
Interpretation Guide¶
Surface Plot Interpretation¶
Healthy Pattern¶
Visual: [PLACEHOLDER - describe smooth surface with large green zones] Meaning: [PLACEHOLDER] Implications: [PLACEHOLDER]
Accuracy Cliff¶
Visual: [PLACEHOLDER - describe steep dropoff] Meaning: [PLACEHOLDER] Root causes: [PLACEHOLDER] Next steps: [PLACEHOLDER]
Truncation Dominant¶
Visual: [PLACEHOLDER - describe truncation zones] Meaning: [PLACEHOLDER] Root causes: [PLACEHOLDER] Next steps: [PLACEHOLDER]
Chaotic Surface¶
Visual: [PLACEHOLDER - describe irregular patterns] Meaning: [PLACEHOLDER] Root causes: [PLACEHOLDER] Next steps: [PLACEHOLDER]
FFT Interpretation¶
Spectral Peak Collapse (Confounder Effect)¶
Visual: [PLACEHOLDER - describe peak flattening] Meaning: [PLACEHOLDER] Discovery: [PLACEHOLDER - your RF-like observation] Implications: [PLACEHOLDER] Root causes: [PLACEHOLDER]
Band-Limited Gain (Length Scaling)¶
Visual: [PLACEHOLDER - describe frequency response] Meaning: [PLACEHOLDER] Discovery: [PLACEHOLDER - your RF-like observation] Implications: [PLACEHOLDER] Comparison to baselines: [PLACEHOLDER]
Tokenization Artifacts¶
Visual: [PLACEHOLDER - describe anomalous frequencies] Meaning: [PLACEHOLDER] Diagnosis: [PLACEHOLDER] Mitigation: [PLACEHOLDER]
Clean Spectrum¶
Visual: [PLACEHOLDER - describe healthy FFT] Meaning: [PLACEHOLDER] Implications: [PLACEHOLDER]
Compression Interpretation¶
Underthink Pattern¶
Visual: [PLACEHOLDER - describe token/entropy scatter] Characteristics: - Correct: Low tokens, reasonable entropy - Incorrect: Low tokens, lower entropy Meaning: [PLACEHOLDER - insufficient reasoning] Root causes: [PLACEHOLDER] Solutions: [PLACEHOLDER - prompt for more thinking]
Overthink Pattern¶
Visual: [PLACEHOLDER - describe token/entropy scatter] Characteristics: - Correct: Moderate tokens, good entropy - Incorrect: High tokens, low entropy Meaning: [PLACEHOLDER - repetitive reasoning] Root causes: [PLACEHOLDER] Solutions: [PLACEHOLDER - early stopping, fine-tuning]
Broken Loops Pattern¶
Visual: [PLACEHOLDER - describe token/entropy scatter] Characteristics: - Incorrect: Very high tokens, flat/low entropy Meaning: [PLACEHOLDER - degenerate states] Root causes: [PLACEHOLDER] Solutions: [PLACEHOLDER - architecture fixes]
Healthy Pattern¶
Visual: [PLACEHOLDER - describe token/entropy scatter] Characteristics: - Good separation between correct/incorrect - Entropy scales with difficulty Meaning: [PLACEHOLDER - effective reasoning] Implications: [PLACEHOLDER]
Hazard Interpretation¶
Early Hazard Increase¶
Visual: [PLACEHOLDER - describe hazard curve] Meaning: [PLACEHOLDER - thinking breaks down quickly] Root causes: [PLACEHOLDER] Implications: [PLACEHOLDER] Solutions: [PLACEHOLDER]
Delayed Hazard Increase¶
Visual: [PLACEHOLDER - describe hazard curve] Meaning: [PLACEHOLDER - sustained thinking] Root causes: [PLACEHOLDER] Implications: [PLACEHOLDER] Use cases: [PLACEHOLDER]
Constant Hazard¶
Visual: [PLACEHOLDER - describe hazard curve] Meaning: [PLACEHOLDER - memoryless failure] Root causes: [PLACEHOLDER] Implications: [PLACEHOLDER]
Temporal Threshold¶
Visual: [PLACEHOLDER - describe sharp hazard transition] Meaning: [PLACEHOLDER - critical token budget] Root causes: [PLACEHOLDER] Optimization: [PLACEHOLDER - find sweet spot]
Example Research Scenarios¶
Scenario 1: "GPT-4o fails at arithmetic when length > 18"¶
# Step 1: Confirm failure boundary
python analyze.py surface data/dataset-m12x.json \
--filters '{"model_name": "gpt-4o", "base_task": "arithmetic"}' \
--output gpt4o-arithmetic-surface.png
# Step 2: Check for tokenization issues
python analyze.py fft data/dataset-m12x.json \
--filters '{"model_name": "gpt-4o", "base_task": "arithmetic"}' \
--output gpt4o-arithmetic-fft.png
# Step 3: Analyze reasoning efficiency
python analyze.py compression data/dataset-m12x.json \
--filters '{"model_name": "gpt-4o", "base_task": "arithmetic"}' \
--axis length --output-dir gpt4o-compression/
# Step 4: Check temporal dynamics
python analyze.py hazard data/dataset-m12x.json \
--filters '{"model_name": "gpt-4o", "base_task": "arithmetic"}' \
--output-dir gpt4o-hazard/
# Step 5: Synthesize findings
# (visual inspection of all outputs)
Example finding: - Surface shows cliff at length=18 - FFT shows spectral collapse at length>15 - Compression shows overthink pattern at high lengths - Hazard increases sharply at token=8000 - Hypothesis: Tokenizer creates longer sequences → context pressure → overthinking → failure
Scenario 2: "Llama 3.3 vs Qwen 2.5 - why do they fail differently?"¶
# Comparative surface analysis
python analyze.py surface data/dataset-m12x.json \
--filters '{"model_name": ["llama-3.3-70b", "qwen-2.5-72b"], "base_task": "arithmetic"}' \
--output-dir llama-vs-qwen-surfaces/
# Comparative FFT (critical for tokenizer differences)
python analyze.py fft data/dataset-m12x.json \
--filters '{"model_name": ["llama-3.3-70b", "qwen-2.5-72b"], "base_task": "arithmetic"}' \
--output-dir llama-vs-qwen-fft/
# Comparative compression
python analyze.py compression data/dataset-m12x.json \
--filters '{"model_name": ["llama-3.3-70b", "qwen-2.5-72b"], "base_task": "arithmetic"}' \
--output-dir llama-vs-qwen-compression/
# Comparative hazard
python analyze.py hazard data/dataset-m12x.json \
--filters '{"model_name": ["llama-3.3-70b", "qwen-2.5-72b"], "base_task": "arithmetic"}' \
--output-dir llama-vs-qwen-hazard/
Example finding: - Surfaces: Llama maintains plateau longer (boundary at length=21 vs 18) - FFT: Different spectral signatures (Qwen shows earlier peak collapse) - Compression: Qwen shows broken loops, Llama shows overthink - Hazard: Qwen hazard increases earlier - Hypothesis: Different tokenizers create different sequence lengths → different failure modes
Scenario 3: "My fine-tuned model is worse - what broke?"¶
# Compare base vs fine-tuned across all four lenses
python analyze.py surface data/dataset-m12x.json \
--filters '{"model_name": ["base-model", "finetuned-model"], "base_task": "arithmetic"}' \
--output-dir base-vs-ft-surfaces/
python analyze.py fft data/dataset-m12x.json \
--filters '{"model_name": ["base-model", "finetuned-model"], "base_task": "arithmetic"}' \
--output-dir base-vs-ft-fft/
python analyze.py compression data/dataset-m12x.json \
--filters '{"model_name": ["base-model", "finetuned-model"], "base_task": "arithmetic"}' \
--output-dir base-vs-ft-compression/
python analyze.py hazard data/dataset-m12x.json \
--filters '{"model_name": ["base-model", "finetuned-model"], "base_task": "arithmetic"}' \
--output-dir base-vs-ft-hazard/
Possible findings: - Surface: Capability zones shrunk → overfitting to narrow distribution - FFT: Unchanged → not a representation issue - Compression: Underthink pattern appeared → fine-tuning reduced reasoning depth - Hazard: Earlier increase → temporal reasoning degraded - Hypothesis: Fine-tuning on short examples harmed long-context reasoning
Scenario 4: "Production failures at scale - diagnose quickly"¶
# Quick diagnostic pipeline
python analyze.py surface data/dataset-m12x.json \
--filters '{"model_name": "production-model", "base_task": "production-task"}' \
--output prod-surface.png
python analyze.py compression data/dataset-m12x.json \
--filters '{"model_name": "production-model", "base_task": "production-task"}' \
--output-dir prod-compression/
# Check if production queries fall outside capability zones
# Check for overthink/broken loops in compression
Fast triage: - Surface → are production queries in capability zones? - Compression → is reasoning broken (overthink/loops)? - If both normal → failure may be out-of-distribution (not covered by eval)
Common Pitfalls¶
Pitfall 1: Analyzing without context¶
Problem: Running all four tools without understanding what you're looking for Solution: Start with surface to identify failure location, then use FFT/compression/hazard to investigate mechanism
Pitfall 2: Ignoring cross-model comparisons¶
Problem: Diagnosing single model without baselines Solution: Always compare to a working model to isolate differences
Pitfall 3: Over-interpreting single lens¶
Problem: Concluding tokenization is the issue based only on FFT Solution: Use all four lenses - patterns should corroborate across tools
Pitfall 4: Missing task-specific patterns¶
Problem: Generalizing from single task analysis Solution: Run diagnosis on multiple tasks to identify systemic vs task-specific issues
Tips for LLM Agents¶
If you're an LLM agent using this workflow:
-
Start with surface + compression (fastest diagnostic):
python analyze.py surface <dataset> --filters <...> --output surface.png python analyze.py compression <dataset> --filters <...> --output-dir compression/ -
Recommend FFT if:
- Surface shows task-specific cliff
- Comparing models with different tokenizers
-
User mentions "representation" or "input" issues
-
Recommend hazard if:
- Compression shows temporal patterns
- User mentions "long-form" or "extended reasoning"
-
High token counts correlate with failures
-
Synthesize findings systematically:
- OUTPUT (surface): "Failure occurs at [boundary]"
- INPUT (fft): "Tokenization [normal / shows artifacts]"
- REASONING (compression): "Model exhibits [underthink / overthink / broken loops]"
- TEMPORAL (hazard): "Thinking degrades after [N] tokens"
-
HYPOTHESIS: "[root cause based on pattern constellation]"
-
Red flags to escalate:
- Broken loops pattern → architecture issue, hard to fix
- Systemic truncation → context limits, need architectural change
- Spectral collapse → tokenizer incompatible with task
- Early hazard increase → reasoning instability
Related Workflows¶
- Workflow 3: Characterization - Identify failures to diagnose
- Workflow 2: Comparative - Determine if failures are significant
- Workflow 1: Ranking - Initial triage before diagnosis
Tool Documentation¶
- analyze.py surface reference - OUTPUT space analysis
- analyze.py fft reference - INPUT space analysis
- analyze.py compression reference - REASONING space analysis
- analyze.py hazard reference - TEMPORAL space analysis
- Methodology - Statistical foundations
- Architecture: Stage 5 - The forensic triumvirate