Report

ReasonScape Report Generator (report.py)¶

The ReasonScape Report Generator produces static markdown reports from evaluation results, providing comprehensive performance and resource usage tables suitable for documentation, publications, and sharing.

Usage¶

# Generate basic report
python report.py data/dataset.json --output report.md

# Generate filtered report for specific groups
python report.py data/dataset.json --output report.md --groups "production"

# Generate report for multiple groups
python report.py data/dataset.json --output report.md --groups "production,experimental"

Command Line Options¶

config: Configuration JSON file (required)
--output, -o: Output markdown file path (required)
--groups: Comma-separated list of scenario groups to include (optional)

Report Structure¶

The generated markdown report contains three main sections:

1. Performance Results Table¶

A comprehensive table showing: - Model column: Model name with manifold identifier - ReasonScore: Overall performance metric (geometric mean × 1000) - Marked with * if evaluation is incomplete - Task columns: Per-task accuracy ranges with confidence intervals - Format: .XX - .YY (conservative to optimistic estimates) - Conservative: adjusted_center - adjusted_margin - Optimistic: adjusted_center + adjusted_margin - Truncation penalty: [-.ZZ] shown when truncation > 2% - N/A for missing or incomplete tasks

Example row:

| Qwen3-32B (Medium) | 784 | .87 - .91 [-.03] | .72 - .86 | .84 - .88 | ... |

This shows: - Model scored 784 ReasonScore on medium difficulty - First task: 87-91% accuracy with 3% truncation penalty - Second task: 72-86% accuracy, no significant truncation - And so on...

2. Resource Usage Table¶

A detailed table showing computational costs: - Model column: Model name with manifold identifier - Total Tokens: Sum of completion tokens across all tasks - Avg Tokens/Completion: Mean completion length - Total Tests: Count of all test cases executed - Task-specific test columns: Tests executed per task

This enables: - Cost estimation for different models - Resource planning for batch evaluations - Efficiency comparisons across models

3. Overall Totals Section¶

Summary statistics including: - Unique Models: Count of distinct model configurations - Total Tokens (All Models): Aggregate token consumption - Total Tests (All Models): Aggregate test execution count - Legend: Explanation of notation and metrics

Group-Based Filtering¶

The --groups parameter filters results to specific model categories defined in your configuration:

# Production models only
python report.py data/dataset.json --output production.md --groups "production"

# Multiple groups
python report.py data/dataset.json --output subset.md --groups "top,experimental"

Groups are defined in the configuration file's scenarios section:

{
  "scenarios": {
    "model-name": {
      "groups": ["production", "top"]
    }
  }
}

This enables: - Focused reports on specific model categories - Comparative analysis of model subsets - Publication-ready filtered results

Data Processing¶

The report generator reuses the leaderboard's data processing pipeline:

Data Loading: Loads scenario-task data from configuration
Filtering: Applies group filters if specified
Grouping: Organizes by model and manifold
Calculation: Computes ReasonScores, confidence intervals, and resource metrics
Formatting: Generates markdown tables with proper alignment and notation

This ensures consistency between interactive leaderboard and static reports.

Report Metrics¶

ReasonScore¶

ReasonScore = 1000 × Geometric_Mean([
  adjusted_center + adjusted_margin - truncated_ratio
])

Same calculation as interactive leaderboard
Marked with * when any task is incomplete or has insufficient samples

Confidence Intervals¶

Conservative: adjusted_center - adjusted_margin (lower bound)
Optimistic: adjusted_center + adjusted_margin (upper bound)
Margin: 95% Wilson confidence interval half-width

Truncation Indicators¶

Shown as [-.XX] when truncated_ratio > 0.02 (2%)
Indicates percentage of responses hitting context limits
Helps identify reliability issues

Use Cases¶

# Generate comprehensive report
python report.py results.json --output README-RESULTS.md

# Include in repository or documentation
git add README-RESULTS.md

Publication Tables¶

# Filter to published models only
python report.py results.json --output paper-results.md --groups "published"

# Convert to LaTeX or other formats as needed

Model Selection¶

# Compare production candidates
python report.py results.json --output comparison.md --groups "production,candidates"

# Review token costs and performance tradeoffs

Progress Tracking¶

# Generate reports at different evaluation stages
python report.py stage1.json --output stage1-report.md
python report.py stage2.json --output stage2-report.md
python report.py final.json --output final-report.md

Integration with Analysis Pipeline¶

The report generator complements the interactive leaderboard:

# Standard workflow
python runner.py --config configs/m12x.yaml --degree 1 --model your-model
python evaluate.py --interview 'results/*/*.ndjson' --output analysis.json

# Interactive exploration
python leaderboard.py analysis.json

# Static documentation
python report.py analysis.json --output RESULTS.md

Use the leaderboard for: - Interactive exploration - Real-time filtering - Visual pattern detection - Detailed tooltips

Use the report generator for: - Static documentation - Sharing and collaboration - Publication-ready tables - Version control tracking

Comparison with Leaderboard¶

Feature	Leaderboard	Report
Format	Interactive web app	Static markdown
Visualization	Heatmap cells	Text tables
Filtering	Real-time dropdown	Command-line groups
Pagination	Yes (10 per page)	No (all results)
Tooltips	Yes (detailed metrics)	No (table format)
Deployment	Requires server	Files only
Best for	Exploration	Documentation

Both tools use identical data processing and scoring algorithms, ensuring consistency across formats.