Skip to content

Leaderboard

ReasonScape Leaderboard (leaderboard.py)

The ReasonScape Leaderboard provides a multi-domain, reasoning-aware LLM performance ranking system with heatmap visualization, pagination, and comprehensive statistical analysis.

ReasonScape Leaderboard

Usage

# Interactive web dashboard
python leaderboard.py data/dataset.json

# Interactive with non-standard port
python leaderboard.py data/dataset.json --port 8080

# Custom URL base pathname for deployment
python leaderboard.py data/dataset.json --url-base-pathname /leaderboard/

Command Line Options

  • config: Configuration JSON file (required)
  • --port PORT: Custom port for web server (default: 8050)
  • --url-base-pathname: Base URL pathname for deployment

Interactive Dashboard Features

Dynamic Filtering

  • Group Filter: Filter models by scenario groups (e.g., "top", "experimental", "production")
  • Manifold Filter: Show all difficulty levels or filter to specific manifolds (e.g., "Easy", "Medium", "Hard")

Pagination

The leaderboard automatically paginates results, displaying 10 models per page with previous/next navigation controls. This enables efficient browsing of large model sets without overwhelming the interface.

Multi-Difficulty Model Display

Models display 3 rows per model showing performance across difficulty levels:

  • Easy: Baseline reasoning tasks
  • Medium: Intermediate complexity scenarios
  • Hard: Advanced reasoning challenges

This multi-row format reveals how models scale with task difficulty and identifies failure modes across the difficulty spectrum.

Interactive Rankings with Statistical Confidence

  • Models ranked by ReasonScore with per-task breakdowns and 95% confidence intervals
  • Average Token Usage showing completion efficiency with score/token ratios
  • Asterisks mark incomplete evaluations or insufficient statistical data
  • Top scores highlighted in green per difficulty level
  • Best/worst token efficiency marked in green/red respectively

Heatmap Performance Visualization

Novel color-coded heatmap cells provide at-a-glance performance understanding:

Performance Fill (Left to Right): - Cell fills from left to right based on success rate (0-100%) - Color Scale (Red → Yellow → Green): - Green (>0.9): Exceptional performance - Yellow-Green (0.7-0.9): Good performance - Orange-Yellow (0.5-0.7): Moderate performance - Red-Orange (0.3-0.5): Poor performance - Dark Red (<0.3): Severe failure

Truncation Overlay: - Cross-hatch pattern overlays the performance fill from left to right - Width represents truncation ratio (0-100%) - Indicates context limit issues or reasoning failures - Higher truncation = wider cross-hatch pattern

Cell Hover Tooltips: - Manifold and task labels - Success rate with confidence intervals - Token usage - Truncation ratio

This visualization enables rapid identification of: - Task-specific strengths and weaknesses - Context limit issues via cross-hatch patterns - Difficulty scaling behavior - Performance consistency across domains

Token Efficiency Analysis

  • Score/Token ratios displayed beneath average token counts
  • Identifies cost-performance optimal models
  • Best ratio per difficulty highlighted in green
  • Worst ratio per difficulty highlighted in red
  • Resource utilization summaries for deployment planning

Fair Sorting Algorithm

The leaderboard uses a fair multi-pass sorting algorithm that enables models with different numbers of difficulty levels to be compared fairly:

  1. Models with the most complete evaluations (typically 3 difficulty levels) are sorted by overall ReasonScore
  2. Models with fewer difficulty levels are inserted based on comparable score calculations
  3. For comparison, the algorithm computes what the complete models would score using only the same difficulty subset
  4. This ensures models aren't penalized for having fewer evaluations while maintaining meaningful rankings

ReasonScore Calculation

ReasonScore provides a unified performance metric across all tasks and difficulty levels:

ReasonScore = 1000 × Geometric_Mean([
  adjusted_center + adjusted_margin - truncated_ratio
])

Where:

  • adjusted_center: Knowledge-adjusted accuracy (performance above random guessing)
  • adjusted_margin: 95% confidence interval half-width (statistical reliability bonus)
  • truncated_ratio: Fraction of responses hitting context limits (reliability penalty)

This metric rewards:

  • High accuracy across reasoning domains
  • Statistical confidence with sufficient data
  • Completion reliability without truncation failures
  • Consistency: Geometric mean penalizes outliers and weak performance in any domain

The 1000× scaling produces intuitive scores (200-900+ range) rather than decimals.

Understanding ReasonScore

  • 900+: Saturation at this difficulty level (exceptional reasoning across all domains)
  • 700-900: Excellent performance with minor failure modes
  • 500-700: Good reasoning with notable truncation or difficulty scaling issues
  • 300-500: Limited reasoning capability
  • <300: Severe reasoning deficits across most domains

Markdown Report Generation

For static markdown reports, use the separate report.py tool (see report.py documentation):

python report.py data/dataset.json --output report.md
python report.py data/dataset.json --output report.md --groups "production"

The report tool generates comprehensive tables with: - Performance results with confidence intervals - Resource usage analysis - Truncation indicators in [-.XX] format - Overall statistics and totals

Deployment Considerations

Standalone Server

python leaderboard.py dataset.json --port 8050
# Access at http://localhost:8050

Reverse Proxy Deployment

python leaderboard.py dataset.json --url-base-pathname /leaderboard/
# Configure nginx/apache to proxy /leaderboard/ to the app

Mobile Responsiveness

The leaderboard includes mobile viewport configuration with: - Initial scale: 0.4 (zoomed out for table visibility) - Minimum scale: 0.2 (allows further zoom out) - User scalable: yes (pinch to zoom enabled)

Performance Optimization

The leaderboard uses several optimization strategies:

  • Data caching: Full dataset loaded once at startup and stored in browser
  • Client-side filtering: Group and manifold filters use cached data
  • Lazy pagination: Only renders current page of 10 models
  • HTML/CSS visualization: Heatmap cells use DIV elements instead of plotly for faster rendering
  • Efficient callbacks: Dash callbacks minimize re-computation

These optimizations enable smooth interaction even with datasets containing dozens of models and hundreds of evaluation points.

Integration with Analysis Pipeline

The leaderboard is the primary visualization tool in the ReasonScape analysis pipeline:

# 1. Run evaluation
python runner.py --config configs/m12x.yaml --degree 1 --model your-model

# 2. Process results
python evaluate.py --interview 'results/*/*.ndjson' --output analysis.json

# 3. Launch leaderboard
python leaderboard.py analysis.json

# 4. Optional: Generate markdown report
python report.py analysis.json --output report.md

The leaderboard serves as both an interactive exploration tool and a presentation-ready ranking system for LLM reasoning capability assessment.