Skip to content

Leaderboard

ReasonScape Leaderboard (leaderboard.py)

The ReasonScape Leaderboard provides a multi-domain, reasoning-aware LLM performance ranking system with embedded multi-task performance landscape visualization and comprehensive statistical analysis.

ReasonScape Leaderboard

Usage

# Interactive web dashboard
python leaderboard.py data/dataset.json

# Interactive with non-standard port
python leaderboard.py data/dataset.json --port 8080

# Generate filtered markdown report
python leaderboard.py data/dataset.json --groups "production" --markdown report.md

Command Line Options

  • config: Configuration JSON file (required)
  • --port PORT: Custom port for web server (default: 8050)
  • --url-base-pathname: Base URL pathname for deployment
  • --groups GROUPS: Comma-separated list of scenario groups to include
  • --markdown MARKDOWN: Output markdown report instead of interactive dashboard

Interactive Dashboard Features

Dynamic Filtering

  • Group Filter: Filter models by scenario groups (top, experimental, etc.)
  • Manifold Filter: Show all difficulty levels or filter to specific ranges

Multi-Difficulty Model Display

Models now display up to 3 rows per model showing performance across difficulty levels:

  • Easy: Baseline reasoning tasks
  • Medium: Intermediate complexity scenarios
  • Hard: Advanced reasoning challenges

This multi-row format reveals how models scale with task difficulty and identifies failure modes.

Interactive Rankings with Statistical Confidence

  • Models ranked by ReasonScore with per-task breakdowns and 95% confidence intervals
  • Average Token Usage showing completion efficiency
  • Color-coded performance indicators: Green (top performance) → Red (poor performance)
  • Asterisks mark incomplete evaluations or insufficient statistical data
  • Hover tooltips reveal detailed performance metrics and failure analysis

Performance Landscape Visualization

Novel embedded difficulty manifold plots provide at-a-glance performance understanding:

  • Colored squares: Mean accuracy across difficulty ranges (color = performance level)
  • Whiskers: 95% confidence intervals showing statistical reliability
  • Red squares: Truncation issues indicating context limits or reasoning failures
  • Color legend: Task-specific color coding in table header

Token Efficiency Analysis

  • Score/Token ratios identify cost-performance optimal models
  • Average completion length tracking across all tasks
  • Resource utilization summaries for deployment planning

Markdown Report Generation

The --markdown option generates comprehensive static reports including:

Performance Results Table

  • ReasonScore rankings across all models and difficulty levels
  • Per-task accuracy scores with confidence intervals
  • Truncation indicators [-.XX] showing completion reliability issues

Resource Usage Analysis

  • Total token consumption per model
  • Average tokens per completion
  • Test count breakdowns by task category
  • Overall resource utilization summaries

Group-Based Analysis

Use --groups along with --markdown to focus analysis on specific model categories:

# Compare only production models
python leaderboard.py data/dataset.json --groups "production,stable" --markdown experiment_results.md

ReasonScore Calculation

ReasonScore provides a unified performance metric across all tasks and difficulty levels:

ReasonScore = 1000 × Geometric_Mean([
  adjusted_center + adjusted_margin - truncated_ratio
])

Where:

  • adjusted_center: Knowledge-adjusted accuracy (performance above random guessing)
  • adjusted_margin: 95% confidence interval half-width (statistical reliability bonus)
  • truncated_ratio: Fraction of responses hitting context limits (reliability penalty)

This metric rewards:

  • High accuracy across reasoning domains
  • Statistical confidence with sufficient data
  • Completion reliability without truncation failures
  • Consistency Geometric mean penalizes outliers

The 1000× scaling produces intuitive scores (200-900+ range) rather than decimals.

Understanding ReasonScore

  • 900+: Saturation at this difficulty level (exceptional reasoning across all domains)
  • 700-900: Excellent performance with minor failure modes
  • 500-700: Good reasoning with notable truncation or difficulty scaling issues
  • 300-500: Limited reasoning capability
  • <300: Severe reasoning deficits across most domains

The leaderboard serves as both an interactive exploration tool and a simple reporting system for LLM reasoning capability assessment.