Report
ReasonScape Report Generator (report.py)¶
The ReasonScape Report Generator produces static markdown reports from evaluation results, providing comprehensive performance and resource usage tables suitable for documentation, publications, and sharing.
Usage¶
# Generate basic report
python report.py data/dataset.json --output report.md
# Generate filtered report for specific groups
python report.py data/dataset.json --output report.md --groups "production"
# Generate report for multiple groups
python report.py data/dataset.json --output report.md --groups "production,experimental"
Command Line Options¶
config: Configuration JSON file (required)--output,-o: Output markdown file path (required)--groups: Comma-separated list of scenario groups to include (optional)
Report Structure¶
The generated markdown report contains three main sections:
1. Performance Results Table¶
A comprehensive table showing:
- Model column: Model name with manifold identifier
- ReasonScore: Overall performance metric (geometric mean × 1000)
- Marked with * if evaluation is incomplete
- Task columns: Per-task accuracy ranges with confidence intervals
- Format: .XX - .YY (conservative to optimistic estimates)
- Conservative: adjusted_center - adjusted_margin
- Optimistic: adjusted_center + adjusted_margin
- Truncation penalty: [-.ZZ] shown when truncation > 2%
- N/A for missing or incomplete tasks
Example row:
| Qwen3-32B (Medium) | 784 | .87 - .91 [-.03] | .72 - .86 | .84 - .88 | ... |
This shows: - Model scored 784 ReasonScore on medium difficulty - First task: 87-91% accuracy with 3% truncation penalty - Second task: 72-86% accuracy, no significant truncation - And so on...
2. Resource Usage Table¶
A detailed table showing computational costs: - Model column: Model name with manifold identifier - Total Tokens: Sum of completion tokens across all tasks - Avg Tokens/Completion: Mean completion length - Total Tests: Count of all test cases executed - Task-specific test columns: Tests executed per task
This enables: - Cost estimation for different models - Resource planning for batch evaluations - Efficiency comparisons across models
3. Overall Totals Section¶
Summary statistics including: - Unique Models: Count of distinct model configurations - Total Tokens (All Models): Aggregate token consumption - Total Tests (All Models): Aggregate test execution count - Legend: Explanation of notation and metrics
Group-Based Filtering¶
The --groups parameter filters results to specific model categories defined in your configuration:
# Production models only
python report.py data/dataset.json --output production.md --groups "production"
# Multiple groups
python report.py data/dataset.json --output subset.md --groups "top,experimental"
Groups are defined in the configuration file's scenarios section:
{
"scenarios": {
"model-name": {
"groups": ["production", "top"]
}
}
}
This enables: - Focused reports on specific model categories - Comparative analysis of model subsets - Publication-ready filtered results
Data Processing¶
The report generator reuses the leaderboard's data processing pipeline:
- Data Loading: Loads scenario-task data from configuration
- Filtering: Applies group filters if specified
- Grouping: Organizes by model and manifold
- Calculation: Computes ReasonScores, confidence intervals, and resource metrics
- Formatting: Generates markdown tables with proper alignment and notation
This ensures consistency between interactive leaderboard and static reports.
Report Metrics¶
ReasonScore¶
ReasonScore = 1000 × Geometric_Mean([
adjusted_center + adjusted_margin - truncated_ratio
])
- Same calculation as interactive leaderboard
- Marked with
*when any task is incomplete or has insufficient samples
Confidence Intervals¶
- Conservative:
adjusted_center - adjusted_margin(lower bound) - Optimistic:
adjusted_center + adjusted_margin(upper bound) - Margin: 95% Wilson confidence interval half-width
Truncation Indicators¶
- Shown as
[-.XX]whentruncated_ratio > 0.02(2%) - Indicates percentage of responses hitting context limits
- Helps identify reliability issues
Use Cases¶
Documentation and Sharing¶
# Generate comprehensive report
python report.py results.json --output README-RESULTS.md
# Include in repository or documentation
git add README-RESULTS.md
Publication Tables¶
# Filter to published models only
python report.py results.json --output paper-results.md --groups "published"
# Convert to LaTeX or other formats as needed
Model Selection¶
# Compare production candidates
python report.py results.json --output comparison.md --groups "production,candidates"
# Review token costs and performance tradeoffs
Progress Tracking¶
# Generate reports at different evaluation stages
python report.py stage1.json --output stage1-report.md
python report.py stage2.json --output stage2-report.md
python report.py final.json --output final-report.md
Integration with Analysis Pipeline¶
The report generator complements the interactive leaderboard:
# Standard workflow
python runner.py --config configs/m12x.yaml --degree 1 --model your-model
python evaluate.py --interview 'results/*/*.ndjson' --output analysis.json
# Interactive exploration
python leaderboard.py analysis.json
# Static documentation
python report.py analysis.json --output RESULTS.md
Use the leaderboard for: - Interactive exploration - Real-time filtering - Visual pattern detection - Detailed tooltips
Use the report generator for: - Static documentation - Sharing and collaboration - Publication-ready tables - Version control tracking
Comparison with Leaderboard¶
| Feature | Leaderboard | Report |
|---|---|---|
| Format | Interactive web app | Static markdown |
| Visualization | Heatmap cells | Text tables |
| Filtering | Real-time dropdown | Command-line groups |
| Pagination | Yes (10 per page) | No (all results) |
| Tooltips | Yes (detailed metrics) | No (table format) |
| Deployment | Requires server | Files only |
| Best for | Exploration | Documentation |
Both tools use identical data processing and scoring algorithms, ensuring consistency across formats.