Profile: Characterizing and Comparing Models¶
Questions: "What can this model do?" or "How do these models differ?"
Overview¶
Profile workflows characterize and compare model capabilities without producing rankings. They reveal shape, boundaries, and patterns rather than collapsing to a single score.
Profile covers two distinct but related objectives:
- Characterization: Understand a single model's capabilities, boundaries, and cost profile
- Comparison: Understand how multiple models differ in failure modes, efficiency, or architecture
When to Use Profile¶
Use Profile workflows when you need to:
- Characterize a model before deployment: "What can this model do well? Where will it fail?"
- Compare capabilities between models: "Where exactly do these models differ?"
- Investigate efficiency trade-offs: "Which model gives me the best cost/performance?"
The Four Performance Dimensions¶
Profile is the 'scape in ReasonScape: using filters and grouping to create projections that reveal model characterization across four fundamental dimensions:
- Accuracy - Correctness as a function of task-specific complexity
- Reliability - Truncation rates and completion behavior
- Difficulty-Scaling - Accuracy degradation patterns under load
- Resource-Utilization - Token usage patterns
Each Profile tool projects these dimensions into a different geometric view:
| Tool | View | Space | What It Shows |
|---|---|---|---|
| spiderweb | 2D radar/bar | Characterization-Space | Cross-task fingerprint: accuracy + reliability + scaling + resources |
| surface | 3D heatmaps | Output-Space | How accuracy scales across 2D parameter grids |
| compression | 2D scatterplots | Reasoning-Space | How entropy and resource utilization interact with difficulty |
| hazard | 2D CIF plot | Temporal-Space | When failures occur during token generation |
Choosing Your Projection¶
Use this decision tree based on what you need to see:
START: What aspect of model characterization matters?
│
├─ Overall model profile across all tasks?
│ └─ Use `spiderweb` (Characterization-Space)
│ • Reveals cognitive archetype (generalist vs specialist)
│ • Shows cross-task consistency
│ • Identifies weak/expensive tasks for deeper investigation
│
├─ Where do capabilities break down?
│ └─ Use `surface` (Output-Space)
│ • Maps accuracy across 2D difficulty grids
│ • Shows cliff vs slope failure patterns
│ • Reveals deployment safety boundaries
│
├─ How efficient is reasoning?
│ └─ Use `compression` (Reasoning-Space)
│ • Shows token vs entropy trade-offs
│ • Identifies underthink/overthink/broken loops
│ • Maps resource cost against information content
│
└─ When does the model start failing?
└─ Use `hazard` (Temporal-Space)
• Detects positional hazard spikes (4k, 8k boundaries)
• Distinguishes gradual degradation from sudden collapse
• Diagnoses RoPE/pretraining context wall effects
Key insight: All tools use the same filter→group→postprocess pattern. What differs is the projection geometry.
[!TIP] Not sure where to start? Run
spiderwebfirst. Its merged view identifies weak tasks (forsurfaceinvestigation), high-cost tasks (forcompression), and whether truncation rates are elevated (motivatinghazardinvestigation) in one plot.
Spiderweb: Cross-Task Fingerprint¶
Creates a visual "fingerprint" showing how a model performs across all tasks simultaneously. The shape reveals the model's cognitive archetype.
Single Model vs Multiple Models¶
Single model is the most common use. Filter to one eval_id and the web shows that model's full capability profile — strengths, weaknesses, and cross-task consistency in one view.
Multiple models overlays their fingerprints directly. Use this to see exactly where profiles diverge: tasks where one model dominates, tasks where they're equivalent, and whether the gap is broad or narrow. This is the primary comparison tool before committing to deeper per-task analysis.
Grouping¶
By default, spiderweb slices by base_task — one axis per task. For single-task datasets (e.g. tables-16k), use --group-by manifold.target_tokens or --group-by params.operation to get a meaningful web shape.
Output Formats¶
--format webpng(default): Radar chart. Shape identifies cognitive archetype at a glance.--format barpng: Bar chart with error bars. Better for precise value comparison across models.
Reading the results:
Identify the cognitive archetype from the fingerprint shape:
| Archetype | Pattern | Deployment Implication |
|---|---|---|
| Balanced Generalist | Low variance, solid across all tasks | Safe for diverse workloads |
| Task Specialist | High variance, excellent at few tasks | Use only for strong tasks |
| Truncation Victim | High truncation rates across all tasks | Consider another model |
| Inefficient Reasoner | High token usage, decent accuracy | Use large context window |
See analyze.py spiderweb reference for flags and examples.
Surface: Capability Zone Mapping¶
Visualizes accuracy across 2D parameter grids, revealing exactly where performance breaks down and how.
What a Surface Is¶
A surface is any view with exactly two group_by dimensions. Views are defined in the experiment config (see config.md). The surface command renders all matching 2-dim views by default, or you can select specific ones with --view.
Grid layout: rows = views, cols = evals. This means a multi-model comparison produces a grid where each column is a model and each row is a parameter projection — making failure boundary differences immediately visible.
Single Model vs Multiple Models¶
Single model: Maps capability zones for deployment decisions. Green regions are safe operating ranges; red regions are where the model fails.
Multiple models: Stack them as columns in the same grid. Directly compare where each model's green zone ends. Larger green zones mean more robust capability; different boundary shapes reveal different underlying limitations.
Splitting Output¶
--split-by base_task (default) produces one file per task — use this when comparing across all tasks. --split-by none produces a single combined file — use this when focused on one task or a small model set.
Reading the results:
| Pattern | Meaning | Action |
|---|---|---|
| Large green zones | Robust capability | Safe for production |
| Sharp cliffs | Sudden failure mode | Set conservative boundaries |
| Gradual slopes | Graceful degradation | Acceptable with monitoring |
| Truncation overlay | Context limit issues | Increase context or reduce prompts |
| Small/missing green | Fragile performance | Avoid this task or difficulty range |
See analyze.py surface reference for flags and examples.
Compression: Reasoning Efficiency Analysis¶
Analyzes token usage versus entropy patterns to understand reasoning efficiency. Each point is a model output plotted as tokens (x-axis) vs compressed entropy (y-axis), colored by outcome (correct/incorrect/invalid/truncated).
What It Reveals¶
The distribution of outcome populations tells you how the model fails, not just that it fails:
| Pattern | Meaning | Action |
|---|---|---|
| Underthink | Incorrect answers cluster at low tokens, low entropy | Encourage longer reasoning via prompts |
| Overthink | Incorrect answers cluster at high tokens, low entropy | Set max_tokens limits, tune temperature |
| Broken loops | Very high tokens, flat/zero entropy growth | Architecture issue, hard to fix |
| Healthy | Clear population separation by outcome | Reasoning works, failure is elsewhere |
Single Model vs Multiple Models¶
Single model: Identify the dominant failure mode for each task.
Multiple models: Use --facet-by eval_id to stack models as subplot rows within each task file. Directly compare whether both models overthink, or whether one is efficient where the other loops.
Dimensionality¶
--split-by controls output files, --facet-by controls subplot rows within a file, --group-by controls colored series within each panel. The defaults (split-by base_task, facet-by eval_id, group-by manifold.id) work well for most comparisons.
See analyze.py compression reference for flags and examples.
Hazard: Temporal Failure Analysis¶
Treats token generation as a temporal process and computes the Cumulative Incidence Function (CIF) to show when during generation a model produces correct answers, incorrect answers, or gets stuck.
What It Reveals¶
Most profile tools aggregate across all outputs for a point. Hazard preserves the time dimension — it shows whether failures cluster early (the model gives up quickly), late (the model reasons extensively before failing), or at specific token positions.
The most important pattern is the positional hazard spike: a sharp increase in failure incidence at a specific token count (commonly 4k, 8k, or 16k tokens). A spike at a round number is a strong signal of a context wall — either a pretraining length limit, a misconfigured RoPE scaling, or an attention pattern that degrades sharply at a boundary. This is distinct from gradual degradation, which shows as a smooth rising failure curve across the full token range.
| Pattern | Meaning | Action |
|---|---|---|
| Early failure spike | Model answers (correctly or not) at low token counts | Check for underthinking |
| Smooth late-rising failures | Gradual difficulty scaling | Normal difficulty effect |
| Positional hazard spike at N | Context wall at N tokens | Check RoPE config, pretrain length |
| Flat hazard, high truncation | Model never resolves — fills context with loops | Investigate with probe.py truncation |
Single Model vs Multiple Models¶
Single model: Establish the temporal profile — does this model have a clean CIF or a positional spike?
Multiple models: Compare CIF curves directly. If Model A has a spike at 8k and Model B does not, the difference is architectural, not task difficulty.
Relationship to Probe¶
Hazard tells you when failures occur across many points in aggregate. When hazard reveals a spike or anomalous pattern, use probe.py truncation on the individual points near that token position to see what the model was doing at that moment.
See analyze.py hazard reference for flags and examples.
Common Workflows¶
Workflow 1: Pre-Deployment Characterization¶
# Step 1: Get the fingerprint
python analyze.py spiderweb data/dataset.json \
--filters '{"eval_id": ["<eval_id>"]}' \
--output-dir research/<project>/
# Step 2: Map capability zones for weak tasks identified in spiderweb
python analyze.py surface data/dataset.json \
--filters '{"eval_id": ["<eval_id>"], "base_task": ["weak-task"]}' \
--output-dir research/<project>/
# Step 3: Check reasoning efficiency for high-cost tasks
python analyze.py compression data/dataset.json \
--filters '{"eval_id": ["<eval_id>"], "base_task": ["expensive-task"]}' \
--output-dir research/<project>/
# Step 4: Check temporal profile — any context walls?
python analyze.py hazard data/dataset.json \
--filters '{"eval_id": ["<eval_id>"]}' \
--output-dir research/<project>/
Goal: Determine if model is suitable for production use case.
Workflow 2: Model Comparison¶
# Step 1: Overlay fingerprints
python analyze.py spiderweb data/dataset.json \
--filters '{"eval_id": ["<eval_id_1>", "<eval_id_2>", "<eval_id_3>"]}' \
--output-dir research/<project>/
# Step 2: Compare failure boundaries on tasks where fingerprints diverge
python analyze.py surface data/dataset.json \
--filters '{"eval_id": ["<eval_id_1>", "<eval_id_2>", "<eval_id_3>"], "base_task": ["critical-task"]}' \
--output-dir research/<project>/
# Step 3: Compare reasoning efficiency
python analyze.py compression data/dataset.json \
--filters '{"eval_id": ["<eval_id_1>", "<eval_id_2>", "<eval_id_3>"]}' \
--output-dir research/<project>/
# Step 4: Compare temporal profiles — do context walls differ?
python analyze.py hazard data/dataset.json \
--filters '{"eval_id": ["<eval_id_1>", "<eval_id_2>", "<eval_id_3>"]}' \
--output-dir research/<project>/
Goal: Choose between statistically similar models from Position, or understand architectural differences.
Filters¶
All Profile tools support filtering by eval_id, groups, and base_task. Discover available values with python analyze.py evals and python analyze.py tasks.
For complete filter syntax and examples, see the Filter Reference.
Next Steps After Profiling¶
Found capability boundaries? → Set production guardrails within green zones
Found reasoning inefficiency? → Adjust inference parameters (max_tokens, temperature), return to Profile to verify
Found a positional hazard spike? → Check RoPE configuration and pretraining context length; compare against a model without the spike
Found high truncation with flat hazard? → Use probe.py truncation to classify the loop type
Model not suitable? → Return to Position to select a different candidate
Need to see what the model actually produced? → Use Probe (failure, truncation, fft)
Need deeper root-cause analysis? → Filter tighter (single eval_id + single task) and read the plots diagnostically