Skip to content

Report

ReasonScape Report Generator (report.py)

The ReasonScape Report Generator produces static markdown reports from evaluation results, providing comprehensive performance and resource usage tables suitable for documentation, publications, and sharing.

Usage

# Generate basic report
python report.py data/dataset.json --output report.md

# Generate filtered report for specific groups
python report.py data/dataset.json --output report.md --groups "production"

# Generate report for multiple groups
python report.py data/dataset.json --output report.md --groups "production,experimental"

Command Line Options

  • config: Configuration JSON file (required)
  • --output, -o: Output markdown file path (required)
  • --groups: Comma-separated list of scenario groups to include (optional)

Report Structure

The generated markdown report contains three main sections:

1. Performance Results Table

A comprehensive table showing: - Model column: Model name with manifold identifier - ReasonScore: Overall performance metric (geometric mean × 1000) - Marked with * if evaluation is incomplete - Task columns: Per-task accuracy ranges with confidence intervals - Format: .XX - .YY (conservative to optimistic estimates) - Conservative: adjusted_center - adjusted_margin - Optimistic: adjusted_center + adjusted_margin - Truncation penalty: [-.ZZ] shown when truncation > 2% - N/A for missing or incomplete tasks

Example row:

| Qwen3-32B (Medium) | 784 | .87 - .91 [-.03] | .72 - .86 | .84 - .88 | ... |

This shows: - Model scored 784 ReasonScore on medium difficulty - First task: 87-91% accuracy with 3% truncation penalty - Second task: 72-86% accuracy, no significant truncation - And so on...

2. Resource Usage Table

A detailed table showing computational costs: - Model column: Model name with manifold identifier - Total Tokens: Sum of completion tokens across all tasks - Avg Tokens/Completion: Mean completion length - Total Tests: Count of all test cases executed - Task-specific test columns: Tests executed per task

This enables: - Cost estimation for different models - Resource planning for batch evaluations - Efficiency comparisons across models

3. Overall Totals Section

Summary statistics including: - Unique Models: Count of distinct model configurations - Total Tokens (All Models): Aggregate token consumption - Total Tests (All Models): Aggregate test execution count - Legend: Explanation of notation and metrics

Group-Based Filtering

The --groups parameter filters results to specific model categories defined in your configuration:

# Production models only
python report.py data/dataset.json --output production.md --groups "production"

# Multiple groups
python report.py data/dataset.json --output subset.md --groups "top,experimental"

Groups are defined in the configuration file's scenarios section:

{
  "scenarios": {
    "model-name": {
      "groups": ["production", "top"]
    }
  }
}

This enables: - Focused reports on specific model categories - Comparative analysis of model subsets - Publication-ready filtered results

Data Processing

The report generator reuses the leaderboard's data processing pipeline:

  1. Data Loading: Loads scenario-task data from configuration
  2. Filtering: Applies group filters if specified
  3. Grouping: Organizes by model and manifold
  4. Calculation: Computes ReasonScores, confidence intervals, and resource metrics
  5. Formatting: Generates markdown tables with proper alignment and notation

This ensures consistency between interactive leaderboard and static reports.

Report Metrics

ReasonScore

ReasonScore = 1000 × Geometric_Mean([
  adjusted_center + adjusted_margin - truncated_ratio
])
  • Same calculation as interactive leaderboard
  • Marked with * when any task is incomplete or has insufficient samples

Confidence Intervals

  • Conservative: adjusted_center - adjusted_margin (lower bound)
  • Optimistic: adjusted_center + adjusted_margin (upper bound)
  • Margin: 95% Wilson confidence interval half-width

Truncation Indicators

  • Shown as [-.XX] when truncated_ratio > 0.02 (2%)
  • Indicates percentage of responses hitting context limits
  • Helps identify reliability issues

Use Cases

Documentation and Sharing

# Generate comprehensive report
python report.py results.json --output README-RESULTS.md

# Include in repository or documentation
git add README-RESULTS.md

Publication Tables

# Filter to published models only
python report.py results.json --output paper-results.md --groups "published"

# Convert to LaTeX or other formats as needed

Model Selection

# Compare production candidates
python report.py results.json --output comparison.md --groups "production,candidates"

# Review token costs and performance tradeoffs

Progress Tracking

# Generate reports at different evaluation stages
python report.py stage1.json --output stage1-report.md
python report.py stage2.json --output stage2-report.md
python report.py final.json --output final-report.md

Integration with Analysis Pipeline

The report generator complements the interactive leaderboard:

# Standard workflow
python runner.py --config configs/m12x.yaml --degree 1 --model your-model
python evaluate.py --interview 'results/*/*.ndjson' --output analysis.json

# Interactive exploration
python leaderboard.py analysis.json

# Static documentation
python report.py analysis.json --output RESULTS.md

Use the leaderboard for: - Interactive exploration - Real-time filtering - Visual pattern detection - Detailed tooltips

Use the report generator for: - Static documentation - Sharing and collaboration - Publication-ready tables - Version control tracking

Comparison with Leaderboard

Feature Leaderboard Report
Format Interactive web app Static markdown
Visualization Heatmap cells Text tables
Filtering Real-time dropdown Command-line groups
Pagination Yes (10 per page) No (all results)
Tooltips Yes (detailed metrics) No (table format)
Deployment Requires server Files only
Best for Exploration Documentation

Both tools use identical data processing and scoring algorithms, ensuring consistency across formats.