Skip to content

Profile: Characterizing and Comparing Models

Questions: "What can this model do?" or "How do these models differ?"

Overview

Profile workflows characterize and compare model capabilities without producing rankings. They reveal shape, boundaries, and patterns rather than collapsing to a single score.

Profile covers two distinct but related objectives:

  • Characterization: Understand a single model's capabilities, boundaries, and cost profile
  • Comparison: Understand how multiple models differ in failure modes, efficiency, or architecture

When to Use Profile

Use Profile workflows when you need to:

  • Characterize a model before deployment: "What can this model do well? Where will it fail?"
  • Compare capabilities between models: "Where exactly do these models differ?"
  • Investigate efficiency trade-offs: "Which model gives me the best cost/performance?"

The Four Performance Dimensions

Profile is the 'scape in ReasonScape: using filters and grouping to create projections that reveal model characterization across four fundamental dimensions:

  1. Accuracy - Correctness as a function of task-specific complexity
  2. Reliability - Truncation rates and completion behavior
  3. Difficulty-Scaling - Accuracy degradation patterns under load
  4. Resource-Utilization - Token usage patterns

Each Profile tool projects these dimensions into a different geometric view:

Tool View Space What It Shows
spiderweb 2D radar/bar Characterization-Space Cross-task fingerprint: accuracy + reliability + scaling + resources
surface 3D heatmaps Output-Space How accuracy scales across 2D parameter grids
compression 2D scatterplots Reasoning-Space How entropy and resource utilization interact with difficulty
hazard 2D CIF plot Temporal-Space When failures occur during token generation

Choosing Your Projection

Use this decision tree based on what you need to see:

START: What aspect of model characterization matters?
│
├─ Overall model profile across all tasks?
│  └─ Use `spiderweb` (Characterization-Space)
│     • Reveals cognitive archetype (generalist vs specialist)
│     • Shows cross-task consistency
│     • Identifies weak/expensive tasks for deeper investigation
│
├─ Where do capabilities break down?
│  └─ Use `surface` (Output-Space)
│     • Maps accuracy across 2D difficulty grids
│     • Shows cliff vs slope failure patterns
│     • Reveals deployment safety boundaries
│
├─ How efficient is reasoning?
│  └─ Use `compression` (Reasoning-Space)
│     • Shows token vs entropy trade-offs
│     • Identifies underthink/overthink/broken loops
│     • Maps resource cost against information content
│
└─ When does the model start failing?
   └─ Use `hazard` (Temporal-Space)
      • Detects positional hazard spikes (4k, 8k boundaries)
      • Distinguishes gradual degradation from sudden collapse
      • Diagnoses RoPE/pretraining context wall effects

Key insight: All tools use the same filter→group→postprocess pattern. What differs is the projection geometry.

[!TIP] Not sure where to start? Run spiderweb first. Its merged view identifies weak tasks (for surface investigation), high-cost tasks (for compression), and whether truncation rates are elevated (motivating hazard investigation) in one plot.

Spiderweb: Cross-Task Fingerprint

Creates a visual "fingerprint" showing how a model performs across all tasks simultaneously. The shape reveals the model's cognitive archetype.

Single Model vs Multiple Models

Single model is the most common use. Filter to one eval_id and the web shows that model's full capability profile — strengths, weaknesses, and cross-task consistency in one view.

Multiple models overlays their fingerprints directly. Use this to see exactly where profiles diverge: tasks where one model dominates, tasks where they're equivalent, and whether the gap is broad or narrow. This is the primary comparison tool before committing to deeper per-task analysis.

Grouping

By default, spiderweb slices by base_task — one axis per task. For single-task datasets (e.g. tables-16k), use --group-by manifold.target_tokens or --group-by params.operation to get a meaningful web shape.

Output Formats

  • --format webpng (default): Radar chart. Shape identifies cognitive archetype at a glance.
  • --format barpng: Bar chart with error bars. Better for precise value comparison across models.

Reading the results:

Identify the cognitive archetype from the fingerprint shape:

Archetype Pattern Deployment Implication
Balanced Generalist Low variance, solid across all tasks Safe for diverse workloads
Task Specialist High variance, excellent at few tasks Use only for strong tasks
Truncation Victim High truncation rates across all tasks Consider another model
Inefficient Reasoner High token usage, decent accuracy Use large context window

See analyze.py spiderweb reference for flags and examples.

Surface: Capability Zone Mapping

Visualizes accuracy across 2D parameter grids, revealing exactly where performance breaks down and how.

What a Surface Is

A surface is any view with exactly two group_by dimensions. Views are defined in the experiment config (see config.md). The surface command renders all matching 2-dim views by default, or you can select specific ones with --view.

Grid layout: rows = views, cols = evals. This means a multi-model comparison produces a grid where each column is a model and each row is a parameter projection — making failure boundary differences immediately visible.

Single Model vs Multiple Models

Single model: Maps capability zones for deployment decisions. Green regions are safe operating ranges; red regions are where the model fails.

Multiple models: Stack them as columns in the same grid. Directly compare where each model's green zone ends. Larger green zones mean more robust capability; different boundary shapes reveal different underlying limitations.

Splitting Output

--split-by base_task (default) produces one file per task — use this when comparing across all tasks. --split-by none produces a single combined file — use this when focused on one task or a small model set.

Reading the results:

Pattern Meaning Action
Large green zones Robust capability Safe for production
Sharp cliffs Sudden failure mode Set conservative boundaries
Gradual slopes Graceful degradation Acceptable with monitoring
Truncation overlay Context limit issues Increase context or reduce prompts
Small/missing green Fragile performance Avoid this task or difficulty range

See analyze.py surface reference for flags and examples.

Compression: Reasoning Efficiency Analysis

Analyzes token usage versus entropy patterns to understand reasoning efficiency. Each point is a model output plotted as tokens (x-axis) vs compressed entropy (y-axis), colored by outcome (correct/incorrect/invalid/truncated).

What It Reveals

The distribution of outcome populations tells you how the model fails, not just that it fails:

Pattern Meaning Action
Underthink Incorrect answers cluster at low tokens, low entropy Encourage longer reasoning via prompts
Overthink Incorrect answers cluster at high tokens, low entropy Set max_tokens limits, tune temperature
Broken loops Very high tokens, flat/zero entropy growth Architecture issue, hard to fix
Healthy Clear population separation by outcome Reasoning works, failure is elsewhere

Single Model vs Multiple Models

Single model: Identify the dominant failure mode for each task.

Multiple models: Use --facet-by eval_id to stack models as subplot rows within each task file. Directly compare whether both models overthink, or whether one is efficient where the other loops.

Dimensionality

--split-by controls output files, --facet-by controls subplot rows within a file, --group-by controls colored series within each panel. The defaults (split-by base_task, facet-by eval_id, group-by manifold.id) work well for most comparisons.

See analyze.py compression reference for flags and examples.

Hazard: Temporal Failure Analysis

Treats token generation as a temporal process and computes the Cumulative Incidence Function (CIF) to show when during generation a model produces correct answers, incorrect answers, or gets stuck.

What It Reveals

Most profile tools aggregate across all outputs for a point. Hazard preserves the time dimension — it shows whether failures cluster early (the model gives up quickly), late (the model reasons extensively before failing), or at specific token positions.

The most important pattern is the positional hazard spike: a sharp increase in failure incidence at a specific token count (commonly 4k, 8k, or 16k tokens). A spike at a round number is a strong signal of a context wall — either a pretraining length limit, a misconfigured RoPE scaling, or an attention pattern that degrades sharply at a boundary. This is distinct from gradual degradation, which shows as a smooth rising failure curve across the full token range.

Pattern Meaning Action
Early failure spike Model answers (correctly or not) at low token counts Check for underthinking
Smooth late-rising failures Gradual difficulty scaling Normal difficulty effect
Positional hazard spike at N Context wall at N tokens Check RoPE config, pretrain length
Flat hazard, high truncation Model never resolves — fills context with loops Investigate with probe.py truncation

Single Model vs Multiple Models

Single model: Establish the temporal profile — does this model have a clean CIF or a positional spike?

Multiple models: Compare CIF curves directly. If Model A has a spike at 8k and Model B does not, the difference is architectural, not task difficulty.

Relationship to Probe

Hazard tells you when failures occur across many points in aggregate. When hazard reveals a spike or anomalous pattern, use probe.py truncation on the individual points near that token position to see what the model was doing at that moment.

See analyze.py hazard reference for flags and examples.

Common Workflows

Workflow 1: Pre-Deployment Characterization

# Step 1: Get the fingerprint
python analyze.py spiderweb data/dataset.json \
    --filters '{"eval_id": ["<eval_id>"]}' \
    --output-dir research/<project>/

# Step 2: Map capability zones for weak tasks identified in spiderweb
python analyze.py surface data/dataset.json \
    --filters '{"eval_id": ["<eval_id>"], "base_task": ["weak-task"]}' \
    --output-dir research/<project>/

# Step 3: Check reasoning efficiency for high-cost tasks
python analyze.py compression data/dataset.json \
    --filters '{"eval_id": ["<eval_id>"], "base_task": ["expensive-task"]}' \
    --output-dir research/<project>/

# Step 4: Check temporal profile — any context walls?
python analyze.py hazard data/dataset.json \
    --filters '{"eval_id": ["<eval_id>"]}' \
    --output-dir research/<project>/

Goal: Determine if model is suitable for production use case.

Workflow 2: Model Comparison

# Step 1: Overlay fingerprints
python analyze.py spiderweb data/dataset.json \
    --filters '{"eval_id": ["<eval_id_1>", "<eval_id_2>", "<eval_id_3>"]}' \
    --output-dir research/<project>/

# Step 2: Compare failure boundaries on tasks where fingerprints diverge
python analyze.py surface data/dataset.json \
    --filters '{"eval_id": ["<eval_id_1>", "<eval_id_2>", "<eval_id_3>"], "base_task": ["critical-task"]}' \
    --output-dir research/<project>/

# Step 3: Compare reasoning efficiency
python analyze.py compression data/dataset.json \
    --filters '{"eval_id": ["<eval_id_1>", "<eval_id_2>", "<eval_id_3>"]}' \
    --output-dir research/<project>/

# Step 4: Compare temporal profiles — do context walls differ?
python analyze.py hazard data/dataset.json \
    --filters '{"eval_id": ["<eval_id_1>", "<eval_id_2>", "<eval_id_3>"]}' \
    --output-dir research/<project>/

Goal: Choose between statistically similar models from Position, or understand architectural differences.

Filters

All Profile tools support filtering by eval_id, groups, and base_task. Discover available values with python analyze.py evals and python analyze.py tasks.

For complete filter syntax and examples, see the Filter Reference.

Next Steps After Profiling

Found capability boundaries? → Set production guardrails within green zones

Found reasoning inefficiency? → Adjust inference parameters (max_tokens, temperature), return to Profile to verify

Found a positional hazard spike? → Check RoPE configuration and pretraining context length; compare against a model without the spike

Found high truncation with flat hazard? → Use probe.py truncation to classify the loop type

Model not suitable? → Return to Position to select a different candidate

Need to see what the model actually produced? → Use Probe (failure, truncation, fft)

Need deeper root-cause analysis? → Filter tighter (single eval_id + single task) and read the plots diagnostically

Tool Reference