Skip to content

Workflow 4: Failure Diagnosis

Research Question: "Why/how/when did this model fail?"

Duration: 10-20 minutes

Objective

Root-cause analysis of model failures by investigating four information spaces:

  1. INPUT space (fft) - How is the problem represented by tokenization/chat-template?
  2. REASONING space (compression) - How is information being processed?
  3. OUTPUT space (surface) - Where does performance break down?
  4. TEMPORAL space (hazard) - When does thinking degrade over generation time?

This workflow transforms observations ("Model X fails at task Y") into mechanistic understanding ("Model X fails because [tokenization artifacts / reasoning inefficiency / working memory limits / temporal degradation]").

When to Use This Workflow

  • Model shows unexpected failure on specific tasks (identified in Workflow 3: Characterization)
  • Comparing why two models with similar aggregate scores fail differently
  • Understanding whether failures are fixable (prompt engineering, fine-tuning, architecture)
  • Research into LLM reasoning mechanisms
  • Forming hypotheses for model improvement
  • Diagnosing production failures

The Diagnostic Toolkit

1. surface - OUTPUT Space Analysis

Question: "Where does performance break down?"

What it reveals: - Accuracy boundaries across difficulty dimensions - Truncation onset zones - Capability zones (where model succeeds) - Task-specific failure patterns

Output: 3D visualization of accuracy landscape


2. fft - INPUT Space Analysis

Question: "How is the problem represented?"

What it reveals: - Tokenization artifacts in difficulty manifold - Chat-template encoding effects - Spectral signatures unique to model's input processing - Key phenomena: - Information peak collapse when confounders added (signal degradation) - Band-limited gain as problem length scales (frequency characteristics)

Output: Frequency-domain visualizations (RF-like spectral analysis)


3. compression - REASONING Space Analysis

Question: "How is information being processed?"

What it reveals: - Entropic content of reasoning traces - Token vs entropy scatterplots per population (correct/incorrect/truncated) - Failure patterns: - Underthink: Low tokens, low entropy (insufficient reasoning) - Overthink: High tokens, low entropy (repetition/loops) - Broken loops: High tokens, flat entropy (stuck in degenerate state)

Output: Scatterplots and entropy distributions


4. hazard - TEMPORAL Space Analysis

Question: "When does thinking degrade?"

What it reveals: - Token generation as temporal process (completion tokens ≈ time) - Population-adjusted incidence rates (failure risk over "thinking time") - Hazard ratios (how much thinking before breakdown) - Temporal failure curves

Output: Survival analysis curves and hazard plots


Basic Workflow: The Four-Lens Investigation

Prerequisites

You should have: - Model and task identified (from previous workflows) - Specific failure observed (e.g., "GPT-4o fails at arithmetic when length > 18")

cd /home/mike/ai/reasonscape
source venv/bin/activate

# Find eval_id if needed
python analyze.py evals data/dataset-m12x.json --search "target-model"

Step 1: OUTPUT Analysis - Where Does It Fail?

# Generate surface plot for failing task
python analyze.py surface data/dataset-m12x.json \
    --filters '{"eval_id": [<target-eval>], "base_task": "arithmetic"}' \
    --output-dir diagnosis/surfaces/

What to look for: - Accuracy cliff locations (where performance drops) - Truncation zones (where context limits hit) - Size/shape of capability zones (green spheres) - Difficulty parameter sensitivity (which dimensions matter most)

Key questions answered: - At what difficulty does failure occur? - Is it gradual decline or sudden cliff? - Is truncation a factor? - Which difficulty dimensions are critical?


Step 2: INPUT Analysis - Is It the Representation?

# Generate FFT analysis for same task
python analyze.py fft data/dataset-m12x.json \
    --filters '{"eval_id": [<target-eval>], "base_task": "arithmetic"}' \
    --output-dir diagnosis/fft/

What to look for: - Spectral peak collapse (information loss due to confounders) - Band-limited gain patterns (scaling behavior) - Tokenization artifacts (frequency anomalies) - Comparison to other models on same task (is this model-specific or task-specific?)

Key questions answered: - Does tokenization create pathological representations? - Are confounders causing information collapse? - Is the chat template interfering? - Would a different tokenizer help?


Step 3: REASONING Analysis - How Is It Processing?

# Generate compression analysis
python analyze.py compression data/dataset-m12x.json \
    --filters '{"eval_id": [<target-eval>], "base_task": "arithmetic"}' \
    --output-dir diagnosis/compression/

What to look for: - Token vs entropy scatterplot patterns - Separation between correct/incorrect/truncated populations - Underthink pattern: Correct answers cluster at low tokens + reasonable entropy; incorrect at similarly low tokens but lower entropy - Overthink pattern: Incorrect answers at high tokens + low entropy (repetitive reasoning) - Broken loops: High tokens + flat entropy (stuck generating low-information loops)

Key questions answered: - Is the model thinking enough? (underthink) - Is the model wasting tokens? (overthink) - Are reasoning loops broken? (broken loops) - Does longer thinking help? (token vs accuracy correlation)


Step 4: TEMPORAL Analysis - When Does Thinking Break?

# Generate hazard analysis
python analyze.py hazard data/dataset-m12x.json \
    --filters '{"eval_id": [<target-eval>], "base_task": "arithmetic"}' \
    --output-dir diagnosis/hazard/

What to look for: - Hazard curves (risk of failure over token count) - Incidence rates (when failures occur) - Critical token thresholds (where hazard increases sharply) - Population-adjusted patterns (correct vs incorrect temporal dynamics)

Key questions answered: - How many tokens can model "think" before degradation? - Is there a token budget sweet spot? - Does extended generation help or hurt? - Are there temporal failure modes?


Step 5: Synthesis - Form Hypothesis

Combine insights from all four lenses:

Example synthesis 1: Tokenization Problem - OUTPUT: Failure at specific difficulty levels - INPUT: Spectral collapse for those difficulty levels - REASONING: Normal entropy patterns - TEMPORAL: Normal hazard curves - Hypothesis: Tokenization creates unrepresentable states at high difficulty - Action: Try different tokenizer or chat template

Example synthesis 2: Working Memory Limit - OUTPUT: Accuracy cliff at length=18 - INPUT: Normal spectral patterns - REASONING: Overthink pattern (high tokens, low entropy) - TEMPORAL: Hazard increases sharply at token=X - Hypothesis: Context window pressure causes reasoning breakdown - Action: Increase context limit or use sliding window

Example synthesis 3: Reasoning Inefficiency - OUTPUT: Gradual decline across difficulty - INPUT: Normal spectral patterns - REASONING: Broken loops (high tokens, flat entropy) - TEMPORAL: Early hazard increase - Hypothesis: Model enters degenerate reasoning states - Action: Fine-tune on chain-of-thought or use reasoning prompt templates


Advanced Diagnostic Patterns

Pattern 1: Cross-Model Comparison

Compare two models on same task to isolate differences:

# Surface comparison
python analyze.py surface data/dataset-m12x.json \
    --filters '{"eval_id": [<model-a>, <model-b>], "base_task": "arithmetic"}' \
    --output-dir diagnosis/surface-comparison/

# FFT comparison (critical for tokenizer effects)
python analyze.py fft data/dataset-m12x.json \
    --filters '{"eval_id": [<model-a>, <model-b>], "base_task": "arithmetic"}' \
    --output-dir diagnosis/fft-comparison/

# Compression comparison
python analyze.py compression data/dataset-m12x.json \
    --filters '{"eval_id": [<model-a>, <model-b>], "base_task": "arithmetic"}' \
    --output-dir diagnosis/compression-comparison/

Use case: "Model A succeeds where Model B fails - why?"


Pattern 2: Multi-Task Diagnosis

Investigate if failure mechanism is task-specific or systemic:

# Generate surfaces for all tasks
python analyze.py surface data/dataset-m12x.json \
    --filters '{"eval_id": [<target-eval>]}' \
    --output-dir diagnosis/all-surfaces/

# FFT for all tasks
python analyze.py fft data/dataset-m12x.json \
    --filters '{"eval_id": [<target-eval>]}' \
    --output-dir diagnosis/all-fft/

# Compression for weak tasks
python analyze.py compression data/dataset-m12x.json \
    --filters '{"eval_id": [<target-eval>], "tier": "logic"}' \
    --output-dir diagnosis/weak-tasks-compression/

Use case: "Model fails on multiple tasks - is it the same root cause?"


Pattern 3: Difficulty-Conditional Analysis

Focus on specific difficulty ranges:

# Surface for specific difficulty range
python analyze.py surface data/dataset-m12x.json \
    --filters '{"eval_id": [<target>], "base_task": "arithmetic", "length": [18, 24]}' \
    --output-dir diagnosis/high-difficulty/

# Compression conditioned on difficulty
python analyze.py compression data/dataset-m12x.json \
    --filters '{"eval_id": [<target>], "base_task": "arithmetic"}' \
    --axis length \
    --output-dir diagnosis/compression-by-length/

Use case: "Model fails only at high difficulty - what changes?"


Interpretation Guide

Surface Plot Interpretation

Healthy Pattern

Visual: [PLACEHOLDER - describe smooth surface with large green zones] Meaning: [PLACEHOLDER] Implications: [PLACEHOLDER]

Accuracy Cliff

Visual: [PLACEHOLDER - describe steep dropoff] Meaning: [PLACEHOLDER] Root causes: [PLACEHOLDER] Next steps: [PLACEHOLDER]

Truncation Dominant

Visual: [PLACEHOLDER - describe truncation zones] Meaning: [PLACEHOLDER] Root causes: [PLACEHOLDER] Next steps: [PLACEHOLDER]

Chaotic Surface

Visual: [PLACEHOLDER - describe irregular patterns] Meaning: [PLACEHOLDER] Root causes: [PLACEHOLDER] Next steps: [PLACEHOLDER]


FFT Interpretation

Spectral Peak Collapse (Confounder Effect)

Visual: [PLACEHOLDER - describe peak flattening] Meaning: [PLACEHOLDER] Discovery: [PLACEHOLDER - your RF-like observation] Implications: [PLACEHOLDER] Root causes: [PLACEHOLDER]

Band-Limited Gain (Length Scaling)

Visual: [PLACEHOLDER - describe frequency response] Meaning: [PLACEHOLDER] Discovery: [PLACEHOLDER - your RF-like observation] Implications: [PLACEHOLDER] Comparison to baselines: [PLACEHOLDER]

Tokenization Artifacts

Visual: [PLACEHOLDER - describe anomalous frequencies] Meaning: [PLACEHOLDER] Diagnosis: [PLACEHOLDER] Mitigation: [PLACEHOLDER]

Clean Spectrum

Visual: [PLACEHOLDER - describe healthy FFT] Meaning: [PLACEHOLDER] Implications: [PLACEHOLDER]


Compression Interpretation

Underthink Pattern

Visual: [PLACEHOLDER - describe token/entropy scatter] Characteristics: - Correct: Low tokens, reasonable entropy - Incorrect: Low tokens, lower entropy Meaning: [PLACEHOLDER - insufficient reasoning] Root causes: [PLACEHOLDER] Solutions: [PLACEHOLDER - prompt for more thinking]

Overthink Pattern

Visual: [PLACEHOLDER - describe token/entropy scatter] Characteristics: - Correct: Moderate tokens, good entropy - Incorrect: High tokens, low entropy Meaning: [PLACEHOLDER - repetitive reasoning] Root causes: [PLACEHOLDER] Solutions: [PLACEHOLDER - early stopping, fine-tuning]

Broken Loops Pattern

Visual: [PLACEHOLDER - describe token/entropy scatter] Characteristics: - Incorrect: Very high tokens, flat/low entropy Meaning: [PLACEHOLDER - degenerate states] Root causes: [PLACEHOLDER] Solutions: [PLACEHOLDER - architecture fixes]

Healthy Pattern

Visual: [PLACEHOLDER - describe token/entropy scatter] Characteristics: - Good separation between correct/incorrect - Entropy scales with difficulty Meaning: [PLACEHOLDER - effective reasoning] Implications: [PLACEHOLDER]


Hazard Interpretation

Early Hazard Increase

Visual: [PLACEHOLDER - describe hazard curve] Meaning: [PLACEHOLDER - thinking breaks down quickly] Root causes: [PLACEHOLDER] Implications: [PLACEHOLDER] Solutions: [PLACEHOLDER]

Delayed Hazard Increase

Visual: [PLACEHOLDER - describe hazard curve] Meaning: [PLACEHOLDER - sustained thinking] Root causes: [PLACEHOLDER] Implications: [PLACEHOLDER] Use cases: [PLACEHOLDER]

Constant Hazard

Visual: [PLACEHOLDER - describe hazard curve] Meaning: [PLACEHOLDER - memoryless failure] Root causes: [PLACEHOLDER] Implications: [PLACEHOLDER]

Temporal Threshold

Visual: [PLACEHOLDER - describe sharp hazard transition] Meaning: [PLACEHOLDER - critical token budget] Root causes: [PLACEHOLDER] Optimization: [PLACEHOLDER - find sweet spot]


Example Research Scenarios

Scenario 1: "GPT-4o fails at arithmetic when length > 18"

# Step 1: Confirm failure boundary
python analyze.py surface data/dataset-m12x.json \
    --filters '{"model_name": "gpt-4o", "base_task": "arithmetic"}' \
    --output gpt4o-arithmetic-surface.png

# Step 2: Check for tokenization issues
python analyze.py fft data/dataset-m12x.json \
    --filters '{"model_name": "gpt-4o", "base_task": "arithmetic"}' \
    --output gpt4o-arithmetic-fft.png

# Step 3: Analyze reasoning efficiency
python analyze.py compression data/dataset-m12x.json \
    --filters '{"model_name": "gpt-4o", "base_task": "arithmetic"}' \
    --axis length --output-dir gpt4o-compression/

# Step 4: Check temporal dynamics
python analyze.py hazard data/dataset-m12x.json \
    --filters '{"model_name": "gpt-4o", "base_task": "arithmetic"}' \
    --output-dir gpt4o-hazard/

# Step 5: Synthesize findings
# (visual inspection of all outputs)

Example finding: - Surface shows cliff at length=18 - FFT shows spectral collapse at length>15 - Compression shows overthink pattern at high lengths - Hazard increases sharply at token=8000 - Hypothesis: Tokenizer creates longer sequences → context pressure → overthinking → failure


Scenario 2: "Llama 3.3 vs Qwen 2.5 - why do they fail differently?"

# Comparative surface analysis
python analyze.py surface data/dataset-m12x.json \
    --filters '{"model_name": ["llama-3.3-70b", "qwen-2.5-72b"], "base_task": "arithmetic"}' \
    --output-dir llama-vs-qwen-surfaces/

# Comparative FFT (critical for tokenizer differences)
python analyze.py fft data/dataset-m12x.json \
    --filters '{"model_name": ["llama-3.3-70b", "qwen-2.5-72b"], "base_task": "arithmetic"}' \
    --output-dir llama-vs-qwen-fft/

# Comparative compression
python analyze.py compression data/dataset-m12x.json \
    --filters '{"model_name": ["llama-3.3-70b", "qwen-2.5-72b"], "base_task": "arithmetic"}' \
    --output-dir llama-vs-qwen-compression/

# Comparative hazard
python analyze.py hazard data/dataset-m12x.json \
    --filters '{"model_name": ["llama-3.3-70b", "qwen-2.5-72b"], "base_task": "arithmetic"}' \
    --output-dir llama-vs-qwen-hazard/

Example finding: - Surfaces: Llama maintains plateau longer (boundary at length=21 vs 18) - FFT: Different spectral signatures (Qwen shows earlier peak collapse) - Compression: Qwen shows broken loops, Llama shows overthink - Hazard: Qwen hazard increases earlier - Hypothesis: Different tokenizers create different sequence lengths → different failure modes


Scenario 3: "My fine-tuned model is worse - what broke?"

# Compare base vs fine-tuned across all four lenses
python analyze.py surface data/dataset-m12x.json \
    --filters '{"model_name": ["base-model", "finetuned-model"], "base_task": "arithmetic"}' \
    --output-dir base-vs-ft-surfaces/

python analyze.py fft data/dataset-m12x.json \
    --filters '{"model_name": ["base-model", "finetuned-model"], "base_task": "arithmetic"}' \
    --output-dir base-vs-ft-fft/

python analyze.py compression data/dataset-m12x.json \
    --filters '{"model_name": ["base-model", "finetuned-model"], "base_task": "arithmetic"}' \
    --output-dir base-vs-ft-compression/

python analyze.py hazard data/dataset-m12x.json \
    --filters '{"model_name": ["base-model", "finetuned-model"], "base_task": "arithmetic"}' \
    --output-dir base-vs-ft-hazard/

Possible findings: - Surface: Capability zones shrunk → overfitting to narrow distribution - FFT: Unchanged → not a representation issue - Compression: Underthink pattern appeared → fine-tuning reduced reasoning depth - Hazard: Earlier increase → temporal reasoning degraded - Hypothesis: Fine-tuning on short examples harmed long-context reasoning


Scenario 4: "Production failures at scale - diagnose quickly"

# Quick diagnostic pipeline
python analyze.py surface data/dataset-m12x.json \
    --filters '{"model_name": "production-model", "base_task": "production-task"}' \
    --output prod-surface.png

python analyze.py compression data/dataset-m12x.json \
    --filters '{"model_name": "production-model", "base_task": "production-task"}' \
    --output-dir prod-compression/

# Check if production queries fall outside capability zones
# Check for overthink/broken loops in compression

Fast triage: - Surface → are production queries in capability zones? - Compression → is reasoning broken (overthink/loops)? - If both normal → failure may be out-of-distribution (not covered by eval)


Common Pitfalls

Pitfall 1: Analyzing without context

Problem: Running all four tools without understanding what you're looking for Solution: Start with surface to identify failure location, then use FFT/compression/hazard to investigate mechanism

Pitfall 2: Ignoring cross-model comparisons

Problem: Diagnosing single model without baselines Solution: Always compare to a working model to isolate differences

Pitfall 3: Over-interpreting single lens

Problem: Concluding tokenization is the issue based only on FFT Solution: Use all four lenses - patterns should corroborate across tools

Pitfall 4: Missing task-specific patterns

Problem: Generalizing from single task analysis Solution: Run diagnosis on multiple tasks to identify systemic vs task-specific issues

Tips for LLM Agents

If you're an LLM agent using this workflow:

  1. Start with surface + compression (fastest diagnostic):

    python analyze.py surface <dataset> --filters <...> --output surface.png
    python analyze.py compression <dataset> --filters <...> --output-dir compression/
    

  2. Recommend FFT if:

  3. Surface shows task-specific cliff
  4. Comparing models with different tokenizers
  5. User mentions "representation" or "input" issues

  6. Recommend hazard if:

  7. Compression shows temporal patterns
  8. User mentions "long-form" or "extended reasoning"
  9. High token counts correlate with failures

  10. Synthesize findings systematically:

  11. OUTPUT (surface): "Failure occurs at [boundary]"
  12. INPUT (fft): "Tokenization [normal / shows artifacts]"
  13. REASONING (compression): "Model exhibits [underthink / overthink / broken loops]"
  14. TEMPORAL (hazard): "Thinking degrades after [N] tokens"
  15. HYPOTHESIS: "[root cause based on pattern constellation]"

  16. Red flags to escalate:

  17. Broken loops pattern → architecture issue, hard to fix
  18. Systemic truncation → context limits, need architectural change
  19. Spectral collapse → tokenizer incompatible with task
  20. Early hazard increase → reasoning instability

Tool Documentation