Leaderboard
ReasonScape Leaderboard (leaderboard.py)¶
The ReasonScape Leaderboard provides a multi-domain, reasoning-aware LLM performance ranking system with embedded multi-task performance landscape visualization and comprehensive statistical analysis.
Usage¶
# Interactive web dashboard
python leaderboard.py data/dataset.json
# Interactive with non-standard port
python leaderboard.py data/dataset.json --port 8080
# Generate filtered markdown report
python leaderboard.py data/dataset.json --groups "production" --markdown report.md
Command Line Options¶
config
: Configuration JSON file (required)--port PORT
: Custom port for web server (default: 8050)--url-base-pathname
: Base URL pathname for deployment--groups GROUPS
: Comma-separated list of scenario groups to include--markdown MARKDOWN
: Output markdown report instead of interactive dashboard
Interactive Dashboard Features¶
Dynamic Filtering¶
- Group Filter: Filter models by scenario groups (top, experimental, etc.)
- Manifold Filter: Show all difficulty levels or filter to specific ranges
Multi-Difficulty Model Display¶
Models now display up to 3 rows per model showing performance across difficulty levels:
- Easy: Baseline reasoning tasks
- Medium: Intermediate complexity scenarios
- Hard: Advanced reasoning challenges
This multi-row format reveals how models scale with task difficulty and identifies failure modes.
Interactive Rankings with Statistical Confidence¶
- Models ranked by ReasonScore with per-task breakdowns and 95% confidence intervals
- Average Token Usage showing completion efficiency
- Color-coded performance indicators: Green (top performance) → Red (poor performance)
- Asterisks mark incomplete evaluations or insufficient statistical data
- Hover tooltips reveal detailed performance metrics and failure analysis
Performance Landscape Visualization¶
Novel embedded difficulty manifold plots provide at-a-glance performance understanding:
- Colored squares: Mean accuracy across difficulty ranges (color = performance level)
- Whiskers: 95% confidence intervals showing statistical reliability
- Red squares: Truncation issues indicating context limits or reasoning failures
- Color legend: Task-specific color coding in table header
Token Efficiency Analysis¶
- Score/Token ratios identify cost-performance optimal models
- Average completion length tracking across all tasks
- Resource utilization summaries for deployment planning
Markdown Report Generation¶
The --markdown
option generates comprehensive static reports including:
Performance Results Table¶
- ReasonScore rankings across all models and difficulty levels
- Per-task accuracy scores with confidence intervals
- Truncation indicators
[-.XX]
showing completion reliability issues
Resource Usage Analysis¶
- Total token consumption per model
- Average tokens per completion
- Test count breakdowns by task category
- Overall resource utilization summaries
Group-Based Analysis¶
Use --groups
along with --markdown
to focus analysis on specific model categories:
# Compare only production models
python leaderboard.py data/dataset.json --groups "production,stable" --markdown experiment_results.md
ReasonScore Calculation¶
ReasonScore provides a unified performance metric across all tasks and difficulty levels:
ReasonScore = 1000 × Geometric_Mean([
adjusted_center + adjusted_margin - truncated_ratio
])
Where:
- adjusted_center: Knowledge-adjusted accuracy (performance above random guessing)
- adjusted_margin: 95% confidence interval half-width (statistical reliability bonus)
- truncated_ratio: Fraction of responses hitting context limits (reliability penalty)
This metric rewards:
- High accuracy across reasoning domains
- Statistical confidence with sufficient data
- Completion reliability without truncation failures
- Consistency Geometric mean penalizes outliers
The 1000× scaling produces intuitive scores (200-900+ range) rather than decimals.
Understanding ReasonScore¶
- 900+: Saturation at this difficulty level (exceptional reasoning across all domains)
- 700-900: Excellent performance with minor failure modes
- 500-700: Good reasoning with notable truncation or difficulty scaling issues
- 300-500: Limited reasoning capability
- <300: Severe reasoning deficits across most domains
The leaderboard serves as both an interactive exploration tool and a simple reporting system for LLM reasoning capability assessment.