Spider Plot Visualization (spiderweb.py)¶
The Spider Plot visualization (spiderweb.py) creates per-model cognitive fingerprints that reveal performance patterns across all 12 reasoning domains simultaneously. Unlike aggregate leaderboards, spider plots expose domain-specific strengths, weaknesses, and characteristic performance signatures.
Overview¶
Spider plots (also called radar charts or web diagrams) display multi-dimensional data as a 2D shape, where each axis represents a different reasoning task. ReasonScape's implementation adds:
- Dual-axis visualization: Accuracy webs + token usage bars on the same plot
- Three difficulty levels: Easy, Medium, Hard shown simultaneously with color coding
- Statistical confidence: Shaded confidence intervals around accuracy lines
- Truncation awareness: Purple overlays indicate context window failures
- Two visualization modes: Polar spider web or vertical bar chart
Key Features¶
- Cognitive Archetypes: Identify characteristic performance patterns (see Cognitive Archetypes section)
- Multi-difficulty comparison: See how model performance degrades across difficulty levels
- Token efficiency analysis: Bars show response length for each task/difficulty
- Truncation detection: Purple overlays highlight context window issues
- Interactive webapp: Toggle between spider and bar visualizations
- Export formats: PNG (web-png, bar-png), JSON, Markdown
Quick Start¶
Launch the interactive visualization server:
python spiderweb.py data/dataset-m12x.json --port 8051
# Open http://127.0.0.1:8051 in your browser
Features: - Dropdown selector to switch between models - Toggle between spider web and bar chart views - Interactive hover tooltips with accuracy, tokens, truncation - Dynamic legend explaining all visual elements
Visualization Modes¶
Spider Web Plot¶

Visual Elements:
- Radial Grid
- 12 axes (one per reasoning domain)
- Phantom spacing between axes for clarity
- Accuracy scale: 0% (center) to 100% (outer edge)
-
Token scale: 0-4000 tokens (annotations on right)
-
Accuracy Webs (Lines with markers)
- Green = Easy difficulty
- Orange = Medium difficulty
- Red = Hard difficulty
- Solid lines connect accuracy points
- Dotted confidence bounds show ±95% CI
-
Shaded confidence regions
-
Token Usage Bars (Radial bars)
- Three bars per task (side-by-side)
- Bar height = average response tokens
- Bar opacity increases with difficulty
-
Same color scheme as accuracy lines
-
Truncation Overlays (Purple bars)
- Overlay on top of token bars
- Height = truncation ratio (0-1)
- Indicates context window failures
- Only shown when truncation occurs
Best for: - Holistic performance overview - Identifying imbalanced capabilities - Spotting characteristic cognitive patterns - Publication-ready circular visualizations
Bar Chart Plot¶
Visual Elements:
- Horizontal Layout
- X-axis: 12 reasoning domains
- Y-axis (left): Accuracy (0-1 scale)
-
Y-axis (right): Tokens (0-4k scale)
-
Accuracy Lines (Lines with markers)
- Same color scheme as spider plot
- Error bars show ±95% CI
-
Solid markers at each task position
-
Token Usage Bars (Vertical bars)
- Three bars per task (left/center/right)
- Height = average response tokens
-
Grouped by difficulty at each task
-
Truncation Overlays (Purple bars)
- Overlay on token bars
- Same interpretation as spider plot
Best for: - Precise accuracy comparisons - Reading exact values from axes - Tasks with many domains (easier to read) - Side-by-side comparison with other charts
Command Line Reference¶
Positional Arguments¶
config: Dataset configuration JSON file (e.g.,data/dataset-m12x.json)
Optional Arguments¶
--port PORT: Server port (default: 8051)-
Example:
--port 8050 -
--url-base-pathname PATH: Base URL path (default:/) - Example:
--url-base-pathname /spiderweb/ - Useful for reverse proxy setups
Cognitive Archetypes¶
Spider plots reveal characteristic performance patterns that identify cognitive archetypes:
1. Balanced Generalist¶
- Pattern: Circular or near-circular shape
- Characteristics: Consistent performance across all domains
- Interpretation: Broad training, no obvious architectural biases
2. Math Specialist¶
- Pattern: Pronounced peaks in Arithmetic, Dates, Sequence
- Characteristics: Strong symbolic/numerical reasoning, weaker language tasks
- Interpretation: Domain-specific training or architectural advantages
3. Language Native¶
- Pattern: High on Objects, Movies, Letters, Sort
- Characteristics: Strong natural language understanding, weaker pure logic
- Interpretation: Chat/dialogue training emphasis
4. Structural Parser¶
- Pattern: Peaks in Brackets, Shapes, Boolean
- Characteristics: Strong symbolic parsing, pattern recognition
- Interpretation: Syntax awareness from code training
5. State Tracker¶
- Pattern: High on Shuffle, Cars, Objects
- Characteristics: Excels at tracking evolving state
- Interpretation: Working memory architectural advantages
6. Fragile Specialist¶
- Pattern: Spiky with high variance between domains
- Characteristics: Excellent in 2-3 domains, fails others
- Interpretation: Limited capacity forces specialization
7. Scaling Failure¶
- Pattern: Progressive inward collapse from Easy → Hard
- Characteristics: Performance degrades uniformly as difficulty increases
- Interpretation: Insufficient parameters or training
8. Context Limit Victim¶
- Pattern: Purple truncation overlays on specific tasks
- Characteristics: Good accuracy until context window exceeded
- Interpretation: Architectural limit, not capability limit
Archetype Identification Workflow¶
- Visual inspection: What shape does the spider web make?
- Peak identification: Which 3-4 tasks have highest accuracy?
- Variance analysis: How much does performance vary between domains?
- Difficulty scaling: Does the shape change or shrink across difficulties?
- Truncation check: Are purple bars present? On which tasks?
- Token patterns: Do token bars correlate with accuracy?
Best Practices¶
Visualization Selection¶
Use spider plots when: - Comparing overall cognitive profile - Identifying domain-specific strengths/weaknesses - Spotting characteristic patterns (archetypes) - Creating publication figures - Initial model assessment
Use bar charts when: - Precise numerical values needed - Many tasks (>12) to display - Audience unfamiliar with radar charts - Side-by-side comparison with other bar charts
Interpretation Guidelines¶
- Shape matters: Circular = balanced, spiky = specialized
- Size matters: Larger = better overall performance
- Spacing matters: Easy→Hard shrinkage = scaling issues
- Color matters: Consistent coloring aids difficulty comparison
- Purple matters: Truncation indicates architectural limits, not capability limits
See Also¶
- analyze.py: Unified CLI with spider subcommand
- leaderboard.py: Aggregate rankings (identifies models to investigate with spider plots)
- explorer.py: 3D difficulty manifold deep-dive (complements spider plots)
- Architecture Guide: Stage 4 visualization philosophy
- Configuration Reference: Dataset configuration format