Probe: Understanding What Failure Actually Looks Like¶

Definition: Raw sample analysis inside a single point (0th-level trace inspection)

Question: "What does this entropic degradation actually look like?"

The Missing Piece¶

Through compression and hazard analysis, we observe that truncated outputs show entropic degradation—their information-theoretic distributions don't match where "lots of productive thinking" should land. High token counts with flat entropy curves signal that something has gone wrong.

But what has gone wrong?

Is it a degenerate repeating stream where the sampler got stuck on junk?
Is it a complex progression starting from effective loops that make progress, falling into loops that get stuck?
Is it a metacognitive failure where thinking about some aspect of the problem traps the model in a loop?
Is it something else entirely?

Probe workflows answer this question. They examine the raw model outputs to reveal the actual mechanics of failure—turning abstract entropy curves into concrete failure modes you can reason about.

When to Use Probe¶

After Profile analysis reveals anomalies: - compression shows "broken loops" pattern → What do the loops actually say? - hazard shows late-stage failure spikes → What happened at that moment? - High truncation rates → What content filled the context window?

For failure mode classification: - Are truncations degenerate (junk) or structured (semantic loops)? - Is the model making progress before getting stuck, or stuck from the start? - Are failures task-specific or systematic across problem types?

For evidence collection: - Extract concrete examples for reports - Document failure modes for model developers - Collect training data for fine-tuning interventions

The Failure Command¶

Inspects raw answer text and genresult metadata for a specific point. Use this when you want to read what the model actually said — correct answers, wrong answers, invalid outputs — rather than loop structure.

python probe.py failure data/dataset.json --point-id <point_id>
python probe.py failure data/dataset.json --point-id <point_id> --limit 5 --full

Flag	Default	Description
`--point-id`	required	Point ID to inspect (repeatable to aggregate across points)
`--limit`	all	Show only first N failure examples
`--full`	—	Show complete answer text without truncation

Points, params{}, and genresult{}¶

To understand what this command shows, it helps to understand what a "point" actually is.

params{} are the task-complexity coordinates exposed for analysis — the dimensions used to build surfaces, define tiers, and aggregate results in PointsDB. But params{} is a projection, not the full picture. Many tasks have generators with 3-4x more internal dimensions than params{} exposes. params{} defines a region of the true complexity space, not a single location in it.

genresult{} is the complete set of variables that expand the per-task template into one specific prompt. Two samples with identical genresult{} are the same prompt — LLM-cache-level identical. genresult{} is the true complexity-space coordinate of each individual sample.

A point in PointsDB is therefore a population of samples that share the same params{} constraint but differ in their uncollapsed genresult{} dimensions underneath. When you look at a surface plot, each cell is an aggregate over this population.

What the genresult facet breakdown reveals: within a failing point-population, the failure command asks whether failures cluster on a genresult{} dimension that params{} is hiding. A point that looks like uniform 60% accuracy in the surface plot might actually contain two sub-populations: one genresult{} configuration that always succeeds and another that always fails — but params{} collapsed them together. The facet breakdown surfaces that structure.

Output: Raw answer text per sample with extracted vs. reference answers, plus a genresult facet breakdown showing outcome distributions across the hidden complexity dimensions.

Use cases: - Read what the model actually produced for a specific failing point - Find hidden structure: which genresult{} dimensions predict failure within a params{} region - Spot-check answer extraction (did the grader correctly identify the answer?) - Collect concrete failure examples for reports or fine-tuning data - Understand whether failures are systematic (same wrong answer) or random

The Truncation Command¶

python probe.py truncation <config> --point-id <point_id>

What It Does¶

Loads raw ndjson samples for the specified point
Filters to truncated samples (outputs that hit context limits)
Segments each output by \n\n (paragraph boundaries)
Computes token lengths and inter-segment similarity
Detects loop patterns using stride analysis
Classifies each detected loop by type

Discovering Point IDs¶

First, use data.py pull dataset data/dataset.json --full to download the raw data necessary to run probe workflows.

Before using probe.py, find points with truncations:

# List points sorted by truncation count
python analyze.py points data/dataset.json \
    --filters '{"eval_id": ["<eval_id>"]}' \
    --sort truncated

# Find points for a specific task
python analyze.py points data/dataset.json \
    --filters '{"eval_id": ["<eval_id>"], "base_task": "arithmetic"}'

The id column gives you point IDs for probe.py.

Failure Mode Taxonomy¶

The truncation analysis classifies loops into distinct failure modes:

1-Loop-Static: Fixed Repetition¶

Classification: 1-LOOP-STATIC
Δtok ≈ 0 (segments have constant token length)

The model repeats the same content verbatim. Each segment is essentially identical to the previous one.

Example pattern:

Segment 45: "Now we have: [ ( ) ] { } [ ] ( )"  (25 tokens)
Segment 46: "Now we have: [ ( ) ] { } [ ] ( )"  (25 tokens)
Segment 47: "Now we have: [ ( ) ] { } [ ] ( )"  (25 tokens)
...

Interpretation: The model has entered a degenerate state where it's producing the same output repeatedly. This often indicates the model has "given up" on making progress and fallen into a stable attractor.

1-Loop-Growing: Counter Loops¶

Classification: 1-LOOP-GROWING
Δtok > 0 (segments grow by constant amount)

Each iteration adds slightly more content—often a counter, step number, or accumulated state.

Example pattern:

Segment 10: "Step 10: Computing sum = 55..."     (42 tokens)
Segment 11: "Step 11: Computing sum = 55 + 11..." (44 tokens)
Segment 12: "Step 12: Computing sum = 66 + 12..." (46 tokens)
...

Interpretation: The model is making apparent progress (incrementing counters, accumulating results) but the core reasoning is stuck. This is a metacognitive failure—the model thinks it's working but isn't actually advancing toward the solution.

1-Loop-Shrinking: Diminishing Patterns¶

Classification: 1-LOOP-SHRINKING
Δtok < 0 (segments shrink by constant amount)

The model produces progressively shorter segments, often as it "winds down" from an initially verbose state.

Example pattern:

Segment 5: "Closing brackets: } ] ) } ] ) } ]..." (35 tokens)
Segment 6: "Closing brackets: } ] ) } ] )..."     (33 tokens)
Segment 7: "Closing brackets: } ] ) } ]..."       (31 tokens)
...

Interpretation: Less common than growing loops. May indicate the model is trying to "escape" a verbose state but can't fully break free.

2-Loop-Static: ABAB Patterns¶

Classification: 2-LOOP-STATIC
Stride 2: alternating segments match (A-B-A-B)

The model alternates between two distinct segment types.

Example pattern:

Segment 20: "Let's check: the brackets are..."   (40 tokens)
Segment 21: "So we need to close: } ] )..."      (38 tokens)
Segment 22: "Let's check: the brackets are..."   (40 tokens)
Segment 23: "So we need to close: } ] )..."      (38 tokens)
...

Interpretation: The model is caught in a two-state oscillation—checking then acting, then checking again without making progress. This often occurs when the model's "verification" step keeps triggering the same "action" step.

3-Loop-Static: ABCABC Patterns¶

Classification: 3-LOOP-STATIC
Stride 3: every third segment matches (A-B-C-A-B-C)

A three-phase repeating cycle, often representing a more complex reasoning template.

Interpretation: The model has fallen into a structured reasoning pattern that it can't escape. Each phase of the cycle triggers the next, creating a closed loop.

Degenerate: Pure Token Repetition¶

Classification: DEGENERATE
High autocorrelation at character level

Not structured at the segment level at all—the raw text shows high character-level periodicity.

Example pattern:

"} } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } } }"

Interpretation: The most severe failure mode. The model has completely collapsed into producing repetitive tokens without any semantic structure. This is a sampler-level failure, not a reasoning failure.

Interpreting Coverage Metrics¶

The output reports loop coverage—what percentage of tokens are consumed by detected loops.

Coverage	Interpretation
>90%	Severe failure: nearly all output is loops
50-90%	Significant failure: majority of output is repetitive
20-50%	Partial failure: loops mixed with productive content
<20%	Minor or no looping: truncation may be due to legitimate verbosity

Key insight: High coverage with 1-LOOP-GROWING is often worse than high coverage with 1-LOOP-STATIC, because it means the model "looks" like it's working while actually being stuck.

Sample-Level Output¶

--------------------------------------------------------------------------------
Sample 3 (8091 completion tokens)
--------------------------------------------------------------------------------
  Total characters: 21131
  Total tokens: 8091
  Primary segments (\n\n): 2
  Unified segments: 103
  Loop coverage: 99.1%

  DETECTED LOOPS (3 found):
    Loop 1: segs 3-14 (12 segs), Δtok=+2, 1-LOOP-GROWING
             Tokens: 42-61 (624 total, 7.7% of response)
             Example: Now the sequence is: |   { ( { } ( ) ...

    Loop 2: segs 15-41 (27 segs), Δtok=+2, 1-LOOP-GROWING
             Tokens: 61-88 (2024 total, 25.0% of response)
             Example: Now the sequence is: |   { ( { } ( ) ...

    Loop 3: segs 42-102 (61 segs), Δtok=+0, 1-LOOP-STATIC
             Tokens: 88 (5368 total, 66.3% of response)
             Example: Now the sequence is: |   { ( { } ( ) ...

Reading this output: - Sample had 8091 tokens total - Only 2 primary \n\n segments, but one was huge and got sub-segmented into 103 units - 99.1% of all tokens are inside detected loops - Three distinct loop phases: 1. Growing phase (7.7%): Making apparent progress, segments growing by +2 tokens each 2. Growing phase 2 (25%): Still growing, longer segments 3. Static phase (66.3%): Fully stuck, same content repeating

Failure narrative: The model started with counter-style loops that looked like progress, then gradually "wound up" to a stable size where it got permanently stuck.

Aggregate Summary¶

AGGREGATE SUMMARY BY LOOP TYPE
================================================================================

1-LOOP-STATIC (26 occurrences):
  Sample 1: 296 segs, 7992 tokens (98.8%)
  Sample 3: 61 segs, 5368 tokens (66.3%)

1-LOOP-GROWING (4 occurrences):
  Sample 3: 12 segs, 624 tokens (7.7%)
  Sample 3: 27 segs, 2024 tokens (25.0%)

DEGENERATE (1 occurrences):
  Sample 2: 1 segs, 8023 tokens (99.2%)

Reading this output: - Most failures end in STATIC loops (26 occurrences) - Some failures show GROWING phases before becoming STATIC - One failure was pure DEGENERATE (character-level repetition)

Probe: Understanding What Failure Actually Looks Like¶

The Missing Piece¶

When to Use Probe¶

The Failure Command¶

Points, params{}, and genresult{}¶

The Truncation Command¶

What It Does¶

Discovering Point IDs¶

Failure Mode Taxonomy¶

1-Loop-Static: Fixed Repetition¶

1-Loop-Growing: Counter Loops¶

1-Loop-Shrinking: Diminishing Patterns¶

2-Loop-Static: ABAB Patterns¶

3-Loop-Static: ABCABC Patterns¶

Degenerate: Pure Token Repetition¶

Interpreting Coverage Metrics¶

Sample-Level Output¶

Aggregate Summary¶

See also¶