Skip to content

Spider Plot Visualization (spiderweb.py)

The Spider Plot visualization (spiderweb.py) creates per-model cognitive fingerprints that reveal performance patterns across all 12 reasoning domains simultaneously. Unlike aggregate leaderboards, spider plots expose domain-specific strengths, weaknesses, and characteristic performance signatures.

Overview

Spider plots (also called radar charts or web diagrams) display multi-dimensional data as a 2D shape, where each axis represents a different reasoning task. ReasonScape's implementation adds:

  • Dual-axis visualization: Accuracy webs + token usage bars on the same plot
  • Three difficulty levels: Easy, Medium, Hard shown simultaneously with color coding
  • Statistical confidence: Shaded confidence intervals around accuracy lines
  • Truncation awareness: Purple overlays indicate context window failures
  • Two visualization modes: Polar spider web or vertical bar chart

Key Features

  • Cognitive Archetypes: Identify characteristic performance patterns (see Cognitive Archetypes section)
  • Multi-difficulty comparison: See how model performance degrades across difficulty levels
  • Token efficiency analysis: Bars show response length for each task/difficulty
  • Truncation detection: Purple overlays highlight context window issues
  • Interactive webapp: Toggle between spider and bar visualizations
  • Export formats: PNG (web-png, bar-png), JSON, Markdown

Quick Start

Launch the interactive visualization server:

python spiderweb.py data/dataset-m12x.json --port 8051
# Open http://127.0.0.1:8051 in your browser

Features: - Dropdown selector to switch between models - Toggle between spider web and bar chart views - Interactive hover tooltips with accuracy, tokens, truncation - Dynamic legend explaining all visual elements

Visualization Modes

Spider Web Plot

Spider Plot Example

Visual Elements:

  1. Radial Grid
  2. 12 axes (one per reasoning domain)
  3. Phantom spacing between axes for clarity
  4. Accuracy scale: 0% (center) to 100% (outer edge)
  5. Token scale: 0-4000 tokens (annotations on right)

  6. Accuracy Webs (Lines with markers)

  7. Green = Easy difficulty
  8. Orange = Medium difficulty
  9. Red = Hard difficulty
  10. Solid lines connect accuracy points
  11. Dotted confidence bounds show ±95% CI
  12. Shaded confidence regions

  13. Token Usage Bars (Radial bars)

  14. Three bars per task (side-by-side)
  15. Bar height = average response tokens
  16. Bar opacity increases with difficulty
  17. Same color scheme as accuracy lines

  18. Truncation Overlays (Purple bars)

  19. Overlay on top of token bars
  20. Height = truncation ratio (0-1)
  21. Indicates context window failures
  22. Only shown when truncation occurs

Best for: - Holistic performance overview - Identifying imbalanced capabilities - Spotting characteristic cognitive patterns - Publication-ready circular visualizations

Bar Chart Plot

Visual Elements:

  1. Horizontal Layout
  2. X-axis: 12 reasoning domains
  3. Y-axis (left): Accuracy (0-1 scale)
  4. Y-axis (right): Tokens (0-4k scale)

  5. Accuracy Lines (Lines with markers)

  6. Same color scheme as spider plot
  7. Error bars show ±95% CI
  8. Solid markers at each task position

  9. Token Usage Bars (Vertical bars)

  10. Three bars per task (left/center/right)
  11. Height = average response tokens
  12. Grouped by difficulty at each task

  13. Truncation Overlays (Purple bars)

  14. Overlay on token bars
  15. Same interpretation as spider plot

Best for: - Precise accuracy comparisons - Reading exact values from axes - Tasks with many domains (easier to read) - Side-by-side comparison with other charts

Command Line Reference

Positional Arguments

  • config: Dataset configuration JSON file (e.g., data/dataset-m12x.json)

Optional Arguments

  • --port PORT: Server port (default: 8051)
  • Example: --port 8050

  • --url-base-pathname PATH: Base URL path (default: /)

  • Example: --url-base-pathname /spiderweb/
  • Useful for reverse proxy setups

Cognitive Archetypes

Spider plots reveal characteristic performance patterns that identify cognitive archetypes:

1. Balanced Generalist

  • Pattern: Circular or near-circular shape
  • Characteristics: Consistent performance across all domains
  • Interpretation: Broad training, no obvious architectural biases

2. Math Specialist

  • Pattern: Pronounced peaks in Arithmetic, Dates, Sequence
  • Characteristics: Strong symbolic/numerical reasoning, weaker language tasks
  • Interpretation: Domain-specific training or architectural advantages

3. Language Native

  • Pattern: High on Objects, Movies, Letters, Sort
  • Characteristics: Strong natural language understanding, weaker pure logic
  • Interpretation: Chat/dialogue training emphasis

4. Structural Parser

  • Pattern: Peaks in Brackets, Shapes, Boolean
  • Characteristics: Strong symbolic parsing, pattern recognition
  • Interpretation: Syntax awareness from code training

5. State Tracker

  • Pattern: High on Shuffle, Cars, Objects
  • Characteristics: Excels at tracking evolving state
  • Interpretation: Working memory architectural advantages

6. Fragile Specialist

  • Pattern: Spiky with high variance between domains
  • Characteristics: Excellent in 2-3 domains, fails others
  • Interpretation: Limited capacity forces specialization

7. Scaling Failure

  • Pattern: Progressive inward collapse from Easy → Hard
  • Characteristics: Performance degrades uniformly as difficulty increases
  • Interpretation: Insufficient parameters or training

8. Context Limit Victim

  • Pattern: Purple truncation overlays on specific tasks
  • Characteristics: Good accuracy until context window exceeded
  • Interpretation: Architectural limit, not capability limit

Archetype Identification Workflow

  1. Visual inspection: What shape does the spider web make?
  2. Peak identification: Which 3-4 tasks have highest accuracy?
  3. Variance analysis: How much does performance vary between domains?
  4. Difficulty scaling: Does the shape change or shrink across difficulties?
  5. Truncation check: Are purple bars present? On which tasks?
  6. Token patterns: Do token bars correlate with accuracy?

Best Practices

Visualization Selection

Use spider plots when: - Comparing overall cognitive profile - Identifying domain-specific strengths/weaknesses - Spotting characteristic patterns (archetypes) - Creating publication figures - Initial model assessment

Use bar charts when: - Precise numerical values needed - Many tasks (>12) to display - Audience unfamiliar with radar charts - Side-by-side comparison with other bar charts

Interpretation Guidelines

  1. Shape matters: Circular = balanced, spiky = specialized
  2. Size matters: Larger = better overall performance
  3. Spacing matters: Easy→Hard shrinkage = scaling issues
  4. Color matters: Consistent coloring aids difficulty comparison
  5. Purple matters: Truncation indicates architectural limits, not capability limits

See Also