Leaderboard

ReasonScape Leaderboard (leaderboard.py)¶

The ReasonScape Leaderboard provides a multi-domain, reasoning-aware LLM performance ranking system with embedded multi-task performance landscape visualization and comprehensive statistical analysis.

ReasonScape Leaderboard

Usage¶

# Interactive web dashboard
python leaderboard.py data/dataset.json

# Interactive with non-standard port
python leaderboard.py data/dataset.json --port 8080

# Generate filtered markdown report
python leaderboard.py data/dataset.json --groups "production" --markdown report.md

Command Line Options¶

config: Configuration JSON file (required)
--port PORT: Custom port for web server (default: 8050)
--url-base-pathname: Base URL pathname for deployment
--groups GROUPS: Comma-separated list of scenario groups to include
--markdown MARKDOWN: Output markdown report instead of interactive dashboard

Interactive Dashboard Features¶

Dynamic Filtering¶

Group Filter: Filter models by scenario groups (top, experimental, etc.)
Manifold Filter: Show all difficulty levels or filter to specific ranges

Multi-Difficulty Model Display¶

Models now display up to 3 rows per model showing performance across difficulty levels:

Easy: Baseline reasoning tasks
Medium: Intermediate complexity scenarios
Hard: Advanced reasoning challenges

This multi-row format reveals how models scale with task difficulty and identifies failure modes.

Interactive Rankings with Statistical Confidence¶

Models ranked by ReasonScore with per-task breakdowns and 95% confidence intervals
Average Token Usage showing completion efficiency
Color-coded performance indicators: Green (top performance) → Red (poor performance)
Asterisks mark incomplete evaluations or insufficient statistical data
Hover tooltips reveal detailed performance metrics and failure analysis

Performance Landscape Visualization¶

Novel embedded difficulty manifold plots provide at-a-glance performance understanding:

Colored squares: Mean accuracy across difficulty ranges (color = performance level)
Whiskers: 95% confidence intervals showing statistical reliability
Red squares: Truncation issues indicating context limits or reasoning failures
Color legend: Task-specific color coding in table header

Token Efficiency Analysis¶

Score/Token ratios identify cost-performance optimal models
Average completion length tracking across all tasks
Resource utilization summaries for deployment planning

Markdown Report Generation¶

The --markdown option generates comprehensive static reports including:

Performance Results Table¶

ReasonScore rankings across all models and difficulty levels
Per-task accuracy scores with confidence intervals
Truncation indicators [-.XX] showing completion reliability issues

Resource Usage Analysis¶

Total token consumption per model
Average tokens per completion
Test count breakdowns by task category
Overall resource utilization summaries

Group-Based Analysis¶

Use --groups along with --markdown to focus analysis on specific model categories:

# Compare only production models
python leaderboard.py data/dataset.json --groups "production,stable" --markdown experiment_results.md

ReasonScore Calculation¶

ReasonScore provides a unified performance metric across all tasks and difficulty levels:

ReasonScore = 1000 × Geometric_Mean([
  adjusted_center + adjusted_margin - truncated_ratio
])

Where:

adjusted_center: Knowledge-adjusted accuracy (performance above random guessing)
adjusted_margin: 95% confidence interval half-width (statistical reliability bonus)
truncated_ratio: Fraction of responses hitting context limits (reliability penalty)

This metric rewards:

High accuracy across reasoning domains
Statistical confidence with sufficient data
Completion reliability without truncation failures
Consistency Geometric mean penalizes outliers

The 1000× scaling produces intuitive scores (200-900+ range) rather than decimals.

Understanding ReasonScore¶

900+: Saturation at this difficulty level (exceptional reasoning across all domains)
700-900: Excellent performance with minor failure modes
500-700: Good reasoning with notable truncation or difficulty scaling issues
300-500: Limited reasoning capability
<300: Severe reasoning deficits across most domains

The leaderboard serves as both an interactive exploration tool and a simple reporting system for LLM reasoning capability assessment.