Workflow 4: Failure Diagnosis¶

Research Question: "Why/how/when did this model fail?"

Duration: 10-20 minutes

Objective¶

Root-cause analysis of model failures by investigating four information spaces:

INPUT space (fft) - How is the problem represented by tokenization/chat-template?
REASONING space (compression) - How is information being processed?
OUTPUT space (surface) - Where does performance break down?
TEMPORAL space (hazard) - When does thinking degrade over generation time?

This workflow transforms observations ("Model X fails at task Y") into mechanistic understanding ("Model X fails because [tokenization artifacts / reasoning inefficiency / working memory limits / temporal degradation]").

When to Use This Workflow¶

Model shows unexpected failure on specific tasks (identified in Workflow 3: Characterization)
Comparing why two models with similar aggregate scores fail differently
Understanding whether failures are fixable (prompt engineering, fine-tuning, architecture)
Research into LLM reasoning mechanisms
Forming hypotheses for model improvement
Diagnosing production failures

The Diagnostic Toolkit¶

1. `surface` - OUTPUT Space Analysis¶

Question: "Where does performance break down?"

What it reveals: - Accuracy boundaries across difficulty dimensions - Truncation onset zones - Capability zones (where model succeeds) - Task-specific failure patterns

Output: 3D visualization of accuracy landscape

2. `fft` - INPUT Space Analysis¶

Question: "How is the problem represented?"

What it reveals: - Tokenization artifacts in difficulty manifold - Chat-template encoding effects - Spectral signatures unique to model's input processing - Key phenomena: - Information peak collapse when confounders added (signal degradation) - Band-limited gain as problem length scales (frequency characteristics)

Output: Frequency-domain visualizations (RF-like spectral analysis)

3. `compression` - REASONING Space Analysis¶

Question: "How is information being processed?"

What it reveals: - Entropic content of reasoning traces - Token vs entropy scatterplots per population (correct/incorrect/truncated) - Failure patterns: - Underthink: Low tokens, low entropy (insufficient reasoning) - Overthink: High tokens, low entropy (repetition/loops) - Broken loops: High tokens, flat entropy (stuck in degenerate state)

Output: Scatterplots and entropy distributions

4. `hazard` - TEMPORAL Space Analysis¶

Question: "When does thinking degrade?"

What it reveals: - Token generation as temporal process (completion tokens ≈ time) - Population-adjusted incidence rates (failure risk over "thinking time") - Hazard ratios (how much thinking before breakdown) - Temporal failure curves

Output: Survival analysis curves and hazard plots

Basic Workflow: The Four-Lens Investigation¶

Prerequisites¶

You should have: - Model and task identified (from previous workflows) - Specific failure observed (e.g., "GPT-4o fails at arithmetic when length > 18")

cd /home/mike/ai/reasonscape
source venv/bin/activate

# Find eval_id if needed
python analyze.py evals data/dataset-m12x.json --search "target-model"

Step 1: OUTPUT Analysis - Where Does It Fail?¶

# Generate surface plot for failing task
python analyze.py surface data/dataset-m12x.json \
    --filters '{"eval_id": [<target-eval>], "base_task": "arithmetic"}' \
    --output-dir diagnosis/surfaces/

What to look for: - Accuracy cliff locations (where performance drops) - Truncation zones (where context limits hit) - Size/shape of capability zones (green spheres) - Difficulty parameter sensitivity (which dimensions matter most)

Key questions answered: - At what difficulty does failure occur? - Is it gradual decline or sudden cliff? - Is truncation a factor? - Which difficulty dimensions are critical?

Step 2: INPUT Analysis - Is It the Representation?¶

# Generate FFT analysis for same task
python analyze.py fft data/dataset-m12x.json \
    --filters '{"eval_id": [<target-eval>], "base_task": "arithmetic"}' \
    --output-dir diagnosis/fft/

What to look for: - Spectral peak collapse (information loss due to confounders) - Band-limited gain patterns (scaling behavior) - Tokenization artifacts (frequency anomalies) - Comparison to other models on same task (is this model-specific or task-specific?)

Key questions answered: - Does tokenization create pathological representations? - Are confounders causing information collapse? - Is the chat template interfering? - Would a different tokenizer help?

Step 3: REASONING Analysis - How Is It Processing?¶

# Generate compression analysis
python analyze.py compression data/dataset-m12x.json \
    --filters '{"eval_id": [<target-eval>], "base_task": "arithmetic"}' \
    --output-dir diagnosis/compression/

What to look for: - Token vs entropy scatterplot patterns - Separation between correct/incorrect/truncated populations - Underthink pattern: Correct answers cluster at low tokens + reasonable entropy; incorrect at similarly low tokens but lower entropy - Overthink pattern: Incorrect answers at high tokens + low entropy (repetitive reasoning) - Broken loops: High tokens + flat entropy (stuck generating low-information loops)

Key questions answered: - Is the model thinking enough? (underthink) - Is the model wasting tokens? (overthink) - Are reasoning loops broken? (broken loops) - Does longer thinking help? (token vs accuracy correlation)

Step 4: TEMPORAL Analysis - When Does Thinking Break?¶

# Generate hazard analysis
python analyze.py hazard data/dataset-m12x.json \
    --filters '{"eval_id": [<target-eval>], "base_task": "arithmetic"}' \
    --output-dir diagnosis/hazard/

What to look for: - Hazard curves (risk of failure over token count) - Incidence rates (when failures occur) - Critical token thresholds (where hazard increases sharply) - Population-adjusted patterns (correct vs incorrect temporal dynamics)

Key questions answered: - How many tokens can model "think" before degradation? - Is there a token budget sweet spot? - Does extended generation help or hurt? - Are there temporal failure modes?

Step 5: Synthesis - Form Hypothesis¶

Combine insights from all four lenses:

Example synthesis 1: Tokenization Problem - OUTPUT: Failure at specific difficulty levels - INPUT: Spectral collapse for those difficulty levels - REASONING: Normal entropy patterns - TEMPORAL: Normal hazard curves - Hypothesis: Tokenization creates unrepresentable states at high difficulty - Action: Try different tokenizer or chat template

Example synthesis 2: Working Memory Limit - OUTPUT: Accuracy cliff at length=18 - INPUT: Normal spectral patterns - REASONING: Overthink pattern (high tokens, low entropy) - TEMPORAL: Hazard increases sharply at token=X - Hypothesis: Context window pressure causes reasoning breakdown - Action: Increase context limit or use sliding window

Example synthesis 3: Reasoning Inefficiency - OUTPUT: Gradual decline across difficulty - INPUT: Normal spectral patterns - REASONING: Broken loops (high tokens, flat entropy) - TEMPORAL: Early hazard increase - Hypothesis: Model enters degenerate reasoning states - Action: Fine-tune on chain-of-thought or use reasoning prompt templates

Advanced Diagnostic Patterns¶

Pattern 1: Cross-Model Comparison¶

Compare two models on same task to isolate differences:

# Surface comparison
python analyze.py surface data/dataset-m12x.json \
    --filters '{"eval_id": [<model-a>, <model-b>], "base_task": "arithmetic"}' \
    --output-dir diagnosis/surface-comparison/

# FFT comparison (critical for tokenizer effects)
python analyze.py fft data/dataset-m12x.json \
    --filters '{"eval_id": [<model-a>, <model-b>], "base_task": "arithmetic"}' \
    --output-dir diagnosis/fft-comparison/

# Compression comparison
python analyze.py compression data/dataset-m12x.json \
    --filters '{"eval_id": [<model-a>, <model-b>], "base_task": "arithmetic"}' \
    --output-dir diagnosis/compression-comparison/

Use case: "Model A succeeds where Model B fails - why?"

Pattern 2: Multi-Task Diagnosis¶

Investigate if failure mechanism is task-specific or systemic:

# Generate surfaces for all tasks
python analyze.py surface data/dataset-m12x.json \
    --filters '{"eval_id": [<target-eval>]}' \
    --output-dir diagnosis/all-surfaces/

# FFT for all tasks
python analyze.py fft data/dataset-m12x.json \
    --filters '{"eval_id": [<target-eval>]}' \
    --output-dir diagnosis/all-fft/

# Compression for weak tasks
python analyze.py compression data/dataset-m12x.json \
    --filters '{"eval_id": [<target-eval>], "tier": "logic"}' \
    --output-dir diagnosis/weak-tasks-compression/

Use case: "Model fails on multiple tasks - is it the same root cause?"

Pattern 3: Difficulty-Conditional Analysis¶

Focus on specific difficulty ranges:

# Surface for specific difficulty range
python analyze.py surface data/dataset-m12x.json \
    --filters '{"eval_id": [<target>], "base_task": "arithmetic", "length": [18, 24]}' \
    --output-dir diagnosis/high-difficulty/

# Compression conditioned on difficulty
python analyze.py compression data/dataset-m12x.json \
    --filters '{"eval_id": [<target>], "base_task": "arithmetic"}' \
    --axis length \
    --output-dir diagnosis/compression-by-length/

Use case: "Model fails only at high difficulty - what changes?"

Interpretation Guide¶

Surface Plot Interpretation¶

Healthy Pattern¶

Visual: [PLACEHOLDER - describe smooth surface with large green zones] Meaning: [PLACEHOLDER] Implications: [PLACEHOLDER]

Accuracy Cliff¶

Visual: [PLACEHOLDER - describe steep dropoff] Meaning: [PLACEHOLDER] Root causes: [PLACEHOLDER] Next steps: [PLACEHOLDER]

Truncation Dominant¶

Visual: [PLACEHOLDER - describe truncation zones] Meaning: [PLACEHOLDER] Root causes: [PLACEHOLDER] Next steps: [PLACEHOLDER]

Chaotic Surface¶

Visual: [PLACEHOLDER - describe irregular patterns] Meaning: [PLACEHOLDER] Root causes: [PLACEHOLDER] Next steps: [PLACEHOLDER]

FFT Interpretation¶

Spectral Peak Collapse (Confounder Effect)¶

Visual: [PLACEHOLDER - describe peak flattening] Meaning: [PLACEHOLDER] Discovery: [PLACEHOLDER - your RF-like observation] Implications: [PLACEHOLDER] Root causes: [PLACEHOLDER]

Band-Limited Gain (Length Scaling)¶

Visual: [PLACEHOLDER - describe frequency response] Meaning: [PLACEHOLDER] Discovery: [PLACEHOLDER - your RF-like observation] Implications: [PLACEHOLDER] Comparison to baselines: [PLACEHOLDER]

Tokenization Artifacts¶

Visual: [PLACEHOLDER - describe anomalous frequencies] Meaning: [PLACEHOLDER] Diagnosis: [PLACEHOLDER] Mitigation: [PLACEHOLDER]

Clean Spectrum¶

Visual: [PLACEHOLDER - describe healthy FFT] Meaning: [PLACEHOLDER] Implications: [PLACEHOLDER]

Compression Interpretation¶

Underthink Pattern¶

Visual: [PLACEHOLDER - describe token/entropy scatter] Characteristics: - Correct: Low tokens, reasonable entropy - Incorrect: Low tokens, lower entropy Meaning: [PLACEHOLDER - insufficient reasoning] Root causes: [PLACEHOLDER] Solutions: [PLACEHOLDER - prompt for more thinking]

Overthink Pattern¶

Visual: [PLACEHOLDER - describe token/entropy scatter] Characteristics: - Correct: Moderate tokens, good entropy - Incorrect: High tokens, low entropy Meaning: [PLACEHOLDER - repetitive reasoning] Root causes: [PLACEHOLDER] Solutions: [PLACEHOLDER - early stopping, fine-tuning]

Broken Loops Pattern¶

Visual: [PLACEHOLDER - describe token/entropy scatter] Characteristics: - Incorrect: Very high tokens, flat/low entropy Meaning: [PLACEHOLDER - degenerate states] Root causes: [PLACEHOLDER] Solutions: [PLACEHOLDER - architecture fixes]

Healthy Pattern¶

Visual: [PLACEHOLDER - describe token/entropy scatter] Characteristics: - Good separation between correct/incorrect - Entropy scales with difficulty Meaning: [PLACEHOLDER - effective reasoning] Implications: [PLACEHOLDER]

Hazard Interpretation¶

Early Hazard Increase¶

Visual: [PLACEHOLDER - describe hazard curve] Meaning: [PLACEHOLDER - thinking breaks down quickly] Root causes: [PLACEHOLDER] Implications: [PLACEHOLDER] Solutions: [PLACEHOLDER]

Delayed Hazard Increase¶

Visual: [PLACEHOLDER - describe hazard curve] Meaning: [PLACEHOLDER - sustained thinking] Root causes: [PLACEHOLDER] Implications: [PLACEHOLDER] Use cases: [PLACEHOLDER]

Constant Hazard¶

Visual: [PLACEHOLDER - describe hazard curve] Meaning: [PLACEHOLDER - memoryless failure] Root causes: [PLACEHOLDER] Implications: [PLACEHOLDER]

Temporal Threshold¶

Visual: [PLACEHOLDER - describe sharp hazard transition] Meaning: [PLACEHOLDER - critical token budget] Root causes: [PLACEHOLDER] Optimization: [PLACEHOLDER - find sweet spot]

Example Research Scenarios¶

Scenario 1: "GPT-4o fails at arithmetic when length > 18"¶

# Step 1: Confirm failure boundary
python analyze.py surface data/dataset-m12x.json \
    --filters '{"model_name": "gpt-4o", "base_task": "arithmetic"}' \
    --output gpt4o-arithmetic-surface.png

# Step 2: Check for tokenization issues
python analyze.py fft data/dataset-m12x.json \
    --filters '{"model_name": "gpt-4o", "base_task": "arithmetic"}' \
    --output gpt4o-arithmetic-fft.png

# Step 3: Analyze reasoning efficiency
python analyze.py compression data/dataset-m12x.json \
    --filters '{"model_name": "gpt-4o", "base_task": "arithmetic"}' \
    --axis length --output-dir gpt4o-compression/

# Step 4: Check temporal dynamics
python analyze.py hazard data/dataset-m12x.json \
    --filters '{"model_name": "gpt-4o", "base_task": "arithmetic"}' \
    --output-dir gpt4o-hazard/

# Step 5: Synthesize findings
# (visual inspection of all outputs)

Example finding: - Surface shows cliff at length=18 - FFT shows spectral collapse at length>15 - Compression shows overthink pattern at high lengths - Hazard increases sharply at token=8000 - Hypothesis: Tokenizer creates longer sequences → context pressure → overthinking → failure

Scenario 2: "Llama 3.3 vs Qwen 2.5 - why do they fail differently?"¶

# Comparative surface analysis
python analyze.py surface data/dataset-m12x.json \
    --filters '{"model_name": ["llama-3.3-70b", "qwen-2.5-72b"], "base_task": "arithmetic"}' \
    --output-dir llama-vs-qwen-surfaces/

# Comparative FFT (critical for tokenizer differences)
python analyze.py fft data/dataset-m12x.json \
    --filters '{"model_name": ["llama-3.3-70b", "qwen-2.5-72b"], "base_task": "arithmetic"}' \
    --output-dir llama-vs-qwen-fft/

# Comparative compression
python analyze.py compression data/dataset-m12x.json \
    --filters '{"model_name": ["llama-3.3-70b", "qwen-2.5-72b"], "base_task": "arithmetic"}' \
    --output-dir llama-vs-qwen-compression/

# Comparative hazard
python analyze.py hazard data/dataset-m12x.json \
    --filters '{"model_name": ["llama-3.3-70b", "qwen-2.5-72b"], "base_task": "arithmetic"}' \
    --output-dir llama-vs-qwen-hazard/

Example finding: - Surfaces: Llama maintains plateau longer (boundary at length=21 vs 18) - FFT: Different spectral signatures (Qwen shows earlier peak collapse) - Compression: Qwen shows broken loops, Llama shows overthink - Hazard: Qwen hazard increases earlier - Hypothesis: Different tokenizers create different sequence lengths → different failure modes

Scenario 3: "My fine-tuned model is worse - what broke?"¶

# Compare base vs fine-tuned across all four lenses
python analyze.py surface data/dataset-m12x.json \
    --filters '{"model_name": ["base-model", "finetuned-model"], "base_task": "arithmetic"}' \
    --output-dir base-vs-ft-surfaces/

python analyze.py fft data/dataset-m12x.json \
    --filters '{"model_name": ["base-model", "finetuned-model"], "base_task": "arithmetic"}' \
    --output-dir base-vs-ft-fft/

python analyze.py compression data/dataset-m12x.json \
    --filters '{"model_name": ["base-model", "finetuned-model"], "base_task": "arithmetic"}' \
    --output-dir base-vs-ft-compression/

python analyze.py hazard data/dataset-m12x.json \
    --filters '{"model_name": ["base-model", "finetuned-model"], "base_task": "arithmetic"}' \
    --output-dir base-vs-ft-hazard/

Possible findings: - Surface: Capability zones shrunk → overfitting to narrow distribution - FFT: Unchanged → not a representation issue - Compression: Underthink pattern appeared → fine-tuning reduced reasoning depth - Hazard: Earlier increase → temporal reasoning degraded - Hypothesis: Fine-tuning on short examples harmed long-context reasoning

Scenario 4: "Production failures at scale - diagnose quickly"¶

# Quick diagnostic pipeline
python analyze.py surface data/dataset-m12x.json \
    --filters '{"model_name": "production-model", "base_task": "production-task"}' \
    --output prod-surface.png

python analyze.py compression data/dataset-m12x.json \
    --filters '{"model_name": "production-model", "base_task": "production-task"}' \
    --output-dir prod-compression/

# Check if production queries fall outside capability zones
# Check for overthink/broken loops in compression

Fast triage: - Surface → are production queries in capability zones? - Compression → is reasoning broken (overthink/loops)? - If both normal → failure may be out-of-distribution (not covered by eval)

Common Pitfalls¶

Pitfall 1: Analyzing without context¶

Problem: Running all four tools without understanding what you're looking for Solution: Start with surface to identify failure location, then use FFT/compression/hazard to investigate mechanism

Pitfall 2: Ignoring cross-model comparisons¶

Problem: Diagnosing single model without baselines Solution: Always compare to a working model to isolate differences

Pitfall 3: Over-interpreting single lens¶

Problem: Concluding tokenization is the issue based only on FFT Solution: Use all four lenses - patterns should corroborate across tools

Pitfall 4: Missing task-specific patterns¶

Problem: Generalizing from single task analysis Solution: Run diagnosis on multiple tasks to identify systemic vs task-specific issues

Tips for LLM Agents¶

If you're an LLM agent using this workflow:

Start with surface + compression (fastest diagnostic):

python analyze.py surface <dataset> --filters <...> --output surface.png
python analyze.py compression <dataset> --filters <...> --output-dir compression/

Recommend FFT if:
Surface shows task-specific cliff
Comparing models with different tokenizers
User mentions "representation" or "input" issues
Recommend hazard if:
Compression shows temporal patterns
User mentions "long-form" or "extended reasoning"
High token counts correlate with failures
Synthesize findings systematically:
OUTPUT (surface): "Failure occurs at [boundary]"
INPUT (fft): "Tokenization [normal / shows artifacts]"
REASONING (compression): "Model exhibits [underthink / overthink / broken loops]"
TEMPORAL (hazard): "Thinking degrades after [N] tokens"
HYPOTHESIS: "[root cause based on pattern constellation]"
Red flags to escalate:
Broken loops pattern → architecture issue, hard to fix
Systemic truncation → context limits, need architectural change
Spectral collapse → tokenizer incompatible with task
Early hazard increase → reasoning instability

Workflow 3: Characterization - Identify failures to diagnose
Workflow 2: Comparative - Determine if failures are significant
Workflow 1: Ranking - Initial triage before diagnosis

Tool Documentation¶

analyze.py surface reference - OUTPUT space analysis
analyze.py fft reference - INPUT space analysis
analyze.py compression reference - REASONING space analysis
analyze.py hazard reference - TEMPORAL space analysis
Methodology - Statistical foundations
Architecture: Stage 5 - The forensic triumvirate

Workflow 4: Failure Diagnosis¶

Objective¶

When to Use This Workflow¶

The Diagnostic Toolkit¶

1. surface - OUTPUT Space Analysis¶

2. fft - INPUT Space Analysis¶

3. compression - REASONING Space Analysis¶

4. hazard - TEMPORAL Space Analysis¶

Basic Workflow: The Four-Lens Investigation¶

Prerequisites¶

Step 1: OUTPUT Analysis - Where Does It Fail?¶

Step 2: INPUT Analysis - Is It the Representation?¶

Step 3: REASONING Analysis - How Is It Processing?¶

Step 4: TEMPORAL Analysis - When Does Thinking Break?¶

Step 5: Synthesis - Form Hypothesis¶

Advanced Diagnostic Patterns¶

Pattern 1: Cross-Model Comparison¶

Pattern 2: Multi-Task Diagnosis¶

Pattern 3: Difficulty-Conditional Analysis¶

Interpretation Guide¶

Surface Plot Interpretation¶

Healthy Pattern¶

Accuracy Cliff¶

Truncation Dominant¶

Chaotic Surface¶

FFT Interpretation¶

Spectral Peak Collapse (Confounder Effect)¶

Band-Limited Gain (Length Scaling)¶

Tokenization Artifacts¶

Clean Spectrum¶

Compression Interpretation¶

Underthink Pattern¶

Overthink Pattern¶

Broken Loops Pattern¶

Healthy Pattern¶

Hazard Interpretation¶

Early Hazard Increase¶

Delayed Hazard Increase¶

Constant Hazard¶

Temporal Threshold¶

Example Research Scenarios¶

Scenario 1: "GPT-4o fails at arithmetic when length > 18"¶

Scenario 2: "Llama 3.3 vs Qwen 2.5 - why do they fail differently?"¶

Scenario 3: "My fine-tuned model is worse - what broke?"¶

Scenario 4: "Production failures at scale - diagnose quickly"¶

Common Pitfalls¶

Pitfall 1: Analyzing without context¶

Pitfall 2: Ignoring cross-model comparisons¶

Pitfall 3: Over-interpreting single lens¶

Pitfall 4: Missing task-specific patterns¶

Tips for LLM Agents¶

Related Workflows¶

Tool Documentation¶

1. `surface` - OUTPUT Space Analysis¶

2. `fft` - INPUT Space Analysis¶

3. `compression` - REASONING Space Analysis¶

4. `hazard` - TEMPORAL Space Analysis¶