Spider Plot Visualization (spiderweb.py)¶

The Spider Plot visualization (spiderweb.py) creates per-model cognitive fingerprints that reveal performance patterns across all 12 reasoning domains simultaneously. Unlike aggregate leaderboards, spider plots expose domain-specific strengths, weaknesses, and characteristic performance signatures.

Overview¶

Spider plots (also called radar charts or web diagrams) display multi-dimensional data as a 2D shape, where each axis represents a different reasoning task. ReasonScape's implementation adds:

Dual-axis visualization: Accuracy webs + token usage bars on the same plot
Three difficulty levels: Easy, Medium, Hard shown simultaneously with color coding
Statistical confidence: Shaded confidence intervals around accuracy lines
Truncation awareness: Purple overlays indicate context window failures
Two visualization modes: Polar spider web or vertical bar chart

Key Features¶

Cognitive Archetypes: Identify characteristic performance patterns (see Cognitive Archetypes section)
Multi-difficulty comparison: See how model performance degrades across difficulty levels
Token efficiency analysis: Bars show response length for each task/difficulty
Truncation detection: Purple overlays highlight context window issues
Interactive webapp: Toggle between spider and bar visualizations
Export formats: PNG (web-png, bar-png), JSON, Markdown

Quick Start¶

Launch the interactive visualization server:

python spiderweb.py data/dataset-m12x.json --port 8051
# Open http://127.0.0.1:8051 in your browser

Features: - Dropdown selector to switch between models - Toggle between spider web and bar chart views - Interactive hover tooltips with accuracy, tokens, truncation - Dynamic legend explaining all visual elements

Visualization Modes¶

Spider Web Plot¶

Spider Plot Example

Visual Elements:

Radial Grid
12 axes (one per reasoning domain)
Phantom spacing between axes for clarity
Accuracy scale: 0% (center) to 100% (outer edge)
Token scale: 0-4000 tokens (annotations on right)
Accuracy Webs (Lines with markers)
Green = Easy difficulty
Orange = Medium difficulty
Red = Hard difficulty
Solid lines connect accuracy points
Dotted confidence bounds show ±95% CI
Shaded confidence regions
Token Usage Bars (Radial bars)
Three bars per task (side-by-side)
Bar height = average response tokens
Bar opacity increases with difficulty
Same color scheme as accuracy lines
Truncation Overlays (Purple bars)
Overlay on top of token bars
Height = truncation ratio (0-1)
Indicates context window failures
Only shown when truncation occurs

Best for: - Holistic performance overview - Identifying imbalanced capabilities - Spotting characteristic cognitive patterns - Publication-ready circular visualizations

Bar Chart Plot¶

Visual Elements:

Horizontal Layout
X-axis: 12 reasoning domains
Y-axis (left): Accuracy (0-1 scale)
Y-axis (right): Tokens (0-4k scale)
Accuracy Lines (Lines with markers)
Same color scheme as spider plot
Error bars show ±95% CI
Solid markers at each task position
Token Usage Bars (Vertical bars)
Three bars per task (left/center/right)
Height = average response tokens
Grouped by difficulty at each task
Truncation Overlays (Purple bars)
Overlay on token bars
Same interpretation as spider plot

Best for: - Precise accuracy comparisons - Reading exact values from axes - Tasks with many domains (easier to read) - Side-by-side comparison with other charts

Command Line Reference¶

Positional Arguments¶

config: Dataset configuration JSON file (e.g., data/dataset-m12x.json)

Optional Arguments¶

--port PORT: Server port (default: 8051)
Example: --port 8050
--url-base-pathname PATH: Base URL path (default: /)
Example: --url-base-pathname /spiderweb/
Useful for reverse proxy setups

Cognitive Archetypes¶

Spider plots reveal characteristic performance patterns that identify cognitive archetypes:

1. Balanced Generalist¶

Pattern: Circular or near-circular shape
Characteristics: Consistent performance across all domains
Interpretation: Broad training, no obvious architectural biases

2. Math Specialist¶

Pattern: Pronounced peaks in Arithmetic, Dates, Sequence
Characteristics: Strong symbolic/numerical reasoning, weaker language tasks
Interpretation: Domain-specific training or architectural advantages

3. Language Native¶

Pattern: High on Objects, Movies, Letters, Sort
Characteristics: Strong natural language understanding, weaker pure logic
Interpretation: Chat/dialogue training emphasis

4. Structural Parser¶

Pattern: Peaks in Brackets, Shapes, Boolean
Characteristics: Strong symbolic parsing, pattern recognition
Interpretation: Syntax awareness from code training

5. State Tracker¶

Pattern: High on Shuffle, Cars, Objects
Characteristics: Excels at tracking evolving state
Interpretation: Working memory architectural advantages

6. Fragile Specialist¶

Pattern: Spiky with high variance between domains
Characteristics: Excellent in 2-3 domains, fails others
Interpretation: Limited capacity forces specialization

7. Scaling Failure¶

Pattern: Progressive inward collapse from Easy → Hard
Characteristics: Performance degrades uniformly as difficulty increases
Interpretation: Insufficient parameters or training

8. Context Limit Victim¶

Pattern: Purple truncation overlays on specific tasks
Characteristics: Good accuracy until context window exceeded
Interpretation: Architectural limit, not capability limit

Archetype Identification Workflow¶

Visual inspection: What shape does the spider web make?
Peak identification: Which 3-4 tasks have highest accuracy?
Variance analysis: How much does performance vary between domains?
Difficulty scaling: Does the shape change or shrink across difficulties?
Truncation check: Are purple bars present? On which tasks?
Token patterns: Do token bars correlate with accuracy?

Best Practices¶

Visualization Selection¶

Use spider plots when: - Comparing overall cognitive profile - Identifying domain-specific strengths/weaknesses - Spotting characteristic patterns (archetypes) - Creating publication figures - Initial model assessment

Use bar charts when: - Precise numerical values needed - Many tasks (>12) to display - Audience unfamiliar with radar charts - Side-by-side comparison with other bar charts

Interpretation Guidelines¶

Shape matters: Circular = balanced, spiky = specialized
Size matters: Larger = better overall performance
Spacing matters: Easy→Hard shrinkage = scaling issues
Color matters: Consistent coloring aids difficulty comparison
Purple matters: Truncation indicates architectural limits, not capability limits