Profile: Characterizing and Comparing Models¶

Questions: "What can this model do?" or "How do these models differ?"

Overview¶

Profile workflows characterize and compare model capabilities without producing rankings. They reveal shape, boundaries, and patterns rather than collapsing to a single score.

Profile covers two distinct but related objectives:

Characterization: Understand a single model's capabilities, boundaries, and cost profile
Comparison: Understand how multiple models differ in failure modes, efficiency, or architecture

When to Use Profile¶

Use Profile workflows when you need to:

Characterize a model before deployment: "What can this model do well? Where will it fail?"
Compare capabilities between models: "Where exactly do these models differ?"
Investigate efficiency trade-offs: "Which model gives me the best cost/performance?"

The Four Performance Dimensions¶

Profile is the 'scape in ReasonScape: using filters and grouping to create projections that reveal model characterization across four fundamental dimensions:

Accuracy - Correctness as a function of task-specific complexity
Reliability - Truncation rates and completion behavior
Difficulty-Scaling - Accuracy degradation patterns under load
Resource-Utilization - Token usage patterns

Each Profile tool projects these dimensions into a different geometric view:

Tool	View	Space	What It Shows
spiderweb	2D radar/bar	Characterization-Space	Cross-task fingerprint: accuracy + reliability + scaling + resources
surface	3D heatmaps	Output-Space	How accuracy scales across 2D parameter grids
compression	2D scatterplots	Reasoning-Space	How entropy and resource utilization interact with difficulty
hazard	2D CIF plot	Temporal-Space	When failures occur during token generation

Choosing Your Projection¶

Use this decision tree based on what you need to see:

START: What aspect of model characterization matters?
│
├─ Overall model profile across all tasks?
│  └─ Use `spiderweb` (Characterization-Space)
│     • Reveals cognitive archetype (generalist vs specialist)
│     • Shows cross-task consistency
│     • Identifies weak/expensive tasks for deeper investigation
│
├─ Where do capabilities break down?
│  └─ Use `surface` (Output-Space)
│     • Maps accuracy across 2D difficulty grids
│     • Shows cliff vs slope failure patterns
│     • Reveals deployment safety boundaries
│
├─ How efficient is reasoning?
│  └─ Use `compression` (Reasoning-Space)
│     • Shows token vs entropy trade-offs
│     • Identifies underthink/overthink/broken loops
│     • Maps resource cost against information content
│
└─ When does the model start failing?
   └─ Use `hazard` (Temporal-Space)
      • Detects positional hazard spikes (4k, 8k boundaries)
      • Distinguishes gradual degradation from sudden collapse
      • Diagnoses RoPE/pretraining context wall effects

Key insight: All tools use the same filter→group→postprocess pattern. What differs is the projection geometry.

[!TIP] Not sure where to start? Run spiderweb first. Its merged view identifies weak tasks (for surface investigation), high-cost tasks (for compression), and whether truncation rates are elevated (motivating hazard investigation) in one plot.

Spiderweb: Cross-Task Fingerprint¶

Creates a visual "fingerprint" showing how a model performs across all tasks simultaneously. The shape reveals the model's cognitive archetype.

Single Model vs Multiple Models¶

Single model is the most common use. Filter to one eval_id and the web shows that model's full capability profile — strengths, weaknesses, and cross-task consistency in one view.

Multiple models overlays their fingerprints directly. Use this to see exactly where profiles diverge: tasks where one model dominates, tasks where they're equivalent, and whether the gap is broad or narrow. This is the primary comparison tool before committing to deeper per-task analysis.

Grouping¶

By default, spiderweb slices by base_task — one axis per task. For single-task datasets (e.g. tables-16k), use --group-by manifold.target_tokens or --group-by params.operation to get a meaningful web shape.

Output Formats¶

--format webpng (default): Radar chart. Shape identifies cognitive archetype at a glance.
--format barpng: Bar chart with error bars. Better for precise value comparison across models.

Reading the results:

Identify the cognitive archetype from the fingerprint shape:

Archetype	Pattern	Deployment Implication
Balanced Generalist	Low variance, solid across all tasks	Safe for diverse workloads
Task Specialist	High variance, excellent at few tasks	Use only for strong tasks
Truncation Victim	High truncation rates across all tasks	Consider another model
Inefficient Reasoner	High token usage, decent accuracy	Use large context window

See analyze.py spiderweb reference for flags and examples.

Surface: Capability Zone Mapping¶

Visualizes accuracy across 2D parameter grids, revealing exactly where performance breaks down and how.

What a Surface Is¶

A surface is any view with exactly two group_by dimensions. Views are defined in the experiment config (see config.md). The surface command renders all matching 2-dim views by default, or you can select specific ones with --view.

Grid layout: rows = views, cols = evals. This means a multi-model comparison produces a grid where each column is a model and each row is a parameter projection — making failure boundary differences immediately visible.

Single Model vs Multiple Models¶

Single model: Maps capability zones for deployment decisions. Green regions are safe operating ranges; red regions are where the model fails.

Multiple models: Stack them as columns in the same grid. Directly compare where each model's green zone ends. Larger green zones mean more robust capability; different boundary shapes reveal different underlying limitations.

Splitting Output¶

--split-by base_task (default) produces one file per task — use this when comparing across all tasks. --split-by none produces a single combined file — use this when focused on one task or a small model set.

Reading the results:

Pattern	Meaning	Action
Large green zones	Robust capability	Safe for production
Sharp cliffs	Sudden failure mode	Set conservative boundaries
Gradual slopes	Graceful degradation	Acceptable with monitoring
Truncation overlay	Context limit issues	Increase context or reduce prompts
Small/missing green	Fragile performance	Avoid this task or difficulty range

See analyze.py surface reference for flags and examples.

Compression: Reasoning Efficiency Analysis¶

Analyzes token usage versus entropy patterns to understand reasoning efficiency. Each point is a model output plotted as tokens (x-axis) vs compressed entropy (y-axis), colored by outcome (correct/incorrect/invalid/truncated).

What It Reveals¶

The distribution of outcome populations tells you how the model fails, not just that it fails:

Pattern	Meaning	Action
Underthink	Incorrect answers cluster at low tokens, low entropy	Encourage longer reasoning via prompts
Overthink	Incorrect answers cluster at high tokens, low entropy	Set max_tokens limits, tune temperature
Broken loops	Very high tokens, flat/zero entropy growth	Architecture issue, hard to fix
Healthy	Clear population separation by outcome	Reasoning works, failure is elsewhere

Single Model vs Multiple Models¶

Single model: Identify the dominant failure mode for each task.

Multiple models: Use --facet-by eval_id to stack models as subplot rows within each task file. Directly compare whether both models overthink, or whether one is efficient where the other loops.

Dimensionality¶

--split-by controls output files, --facet-by controls subplot rows within a file, --group-by controls colored series within each panel. The defaults (split-by base_task, facet-by eval_id, group-by manifold.id) work well for most comparisons.

See analyze.py compression reference for flags and examples.

Hazard: Temporal Failure Analysis¶

Treats token generation as a temporal process and computes the Cumulative Incidence Function (CIF) to show when during generation a model produces correct answers, incorrect answers, or gets stuck.

What It Reveals¶

Most profile tools aggregate across all outputs for a point. Hazard preserves the time dimension — it shows whether failures cluster early (the model gives up quickly), late (the model reasons extensively before failing), or at specific token positions.

The most important pattern is the positional hazard spike: a sharp increase in failure incidence at a specific token count (commonly 4k, 8k, or 16k tokens). A spike at a round number is a strong signal of a context wall — either a pretraining length limit, a misconfigured RoPE scaling, or an attention pattern that degrades sharply at a boundary. This is distinct from gradual degradation, which shows as a smooth rising failure curve across the full token range.

Pattern	Meaning	Action
Early failure spike	Model answers (correctly or not) at low token counts	Check for underthinking
Smooth late-rising failures	Gradual difficulty scaling	Normal difficulty effect
Positional hazard spike at N	Context wall at N tokens	Check RoPE config, pretrain length
Flat hazard, high truncation	Model never resolves — fills context with loops	Investigate with `probe.py truncation`

Single Model vs Multiple Models¶

Single model: Establish the temporal profile — does this model have a clean CIF or a positional spike?

Multiple models: Compare CIF curves directly. If Model A has a spike at 8k and Model B does not, the difference is architectural, not task difficulty.

Relationship to Probe¶

Hazard tells you when failures occur across many points in aggregate. When hazard reveals a spike or anomalous pattern, use probe.py truncation on the individual points near that token position to see what the model was doing at that moment.

See analyze.py hazard reference for flags and examples.

Common Workflows¶

Workflow 1: Pre-Deployment Characterization¶

# Step 1: Get the fingerprint
python analyze.py spiderweb data/dataset.json \
    --filters '{"eval_id": ["<eval_id>"]}' \
    --output-dir research/<project>/

# Step 2: Map capability zones for weak tasks identified in spiderweb
python analyze.py surface data/dataset.json \
    --filters '{"eval_id": ["<eval_id>"], "base_task": ["weak-task"]}' \
    --output-dir research/<project>/

# Step 3: Check reasoning efficiency for high-cost tasks
python analyze.py compression data/dataset.json \
    --filters '{"eval_id": ["<eval_id>"], "base_task": ["expensive-task"]}' \
    --output-dir research/<project>/

# Step 4: Check temporal profile — any context walls?
python analyze.py hazard data/dataset.json \
    --filters '{"eval_id": ["<eval_id>"]}' \
    --output-dir research/<project>/

Goal: Determine if model is suitable for production use case.

Workflow 2: Model Comparison¶

# Step 1: Overlay fingerprints
python analyze.py spiderweb data/dataset.json \
    --filters '{"eval_id": ["<eval_id_1>", "<eval_id_2>", "<eval_id_3>"]}' \
    --output-dir research/<project>/

# Step 2: Compare failure boundaries on tasks where fingerprints diverge
python analyze.py surface data/dataset.json \
    --filters '{"eval_id": ["<eval_id_1>", "<eval_id_2>", "<eval_id_3>"], "base_task": ["critical-task"]}' \
    --output-dir research/<project>/

# Step 3: Compare reasoning efficiency
python analyze.py compression data/dataset.json \
    --filters '{"eval_id": ["<eval_id_1>", "<eval_id_2>", "<eval_id_3>"]}' \
    --output-dir research/<project>/

# Step 4: Compare temporal profiles — do context walls differ?
python analyze.py hazard data/dataset.json \
    --filters '{"eval_id": ["<eval_id_1>", "<eval_id_2>", "<eval_id_3>"]}' \
    --output-dir research/<project>/

Goal: Choose between statistically similar models from Position, or understand architectural differences.

Filters¶

All Profile tools support filtering by eval_id, groups, and base_task. Discover available values with python analyze.py evals and python analyze.py tasks.

For complete filter syntax and examples, see the Filter Reference.

Next Steps After Profiling¶

Found capability boundaries? → Set production guardrails within green zones

Found reasoning inefficiency? → Adjust inference parameters (max_tokens, temperature), return to Profile to verify

Found a positional hazard spike? → Check RoPE configuration and pretraining context length; compare against a model without the spike

Found high truncation with flat hazard? → Use probe.py truncation to classify the loop type

Model not suitable? → Return to Position to select a different candidate

Need to see what the model actually produced? → Use Probe (failure, truncation, fft)

Need deeper root-cause analysis? → Filter tighter (single eval_id + single task) and read the plots diagnostically

Profile: Characterizing and Comparing Models¶

Overview¶

When to Use Profile¶

The Four Performance Dimensions¶

Choosing Your Projection¶

Spiderweb: Cross-Task Fingerprint¶

Single Model vs Multiple Models¶

Grouping¶

Output Formats¶

Surface: Capability Zone Mapping¶

What a Surface Is¶

Single Model vs Multiple Models¶

Splitting Output¶

Compression: Reasoning Efficiency Analysis¶

What It Reveals¶

Single Model vs Multiple Models¶

Dimensionality¶

Hazard: Temporal Failure Analysis¶

What It Reveals¶

Single Model vs Multiple Models¶

Relationship to Probe¶

Common Workflows¶

Workflow 1: Pre-Deployment Characterization¶

Workflow 2: Model Comparison¶

Filters¶

Next Steps After Profiling¶

Tool Reference¶