ReasonScape: Information Processing Evaluation for Large Language Models¶
ReasonScape is a next-generation evaluation methodology that treats language models as information processing systems rather than text generation black boxes.

ReasonScape reveals cognitive architecture patterns invisible to traditional benchmarks: 3D reasoning landscapes (left), token-frequency spectral analysis (bottom right), and interactive exploration tools (top and middle right) enable systematic comparison of information processing capabilities across models and tasks.
🌐 Homepage: https://reasonscape.com/
🛠️ GitHub: the-crypt-keeper/reasonscape
📊 Live Visualization Tools:
-
M12X Leaderboards: https://reasonscape.com/m12x/leaderboard
-
M12X Explorer: https://reasonscape.com/m12x/explorer (PC required)
📁 Raw Experiment Result Data:
- M12X Dataset: https://reasonscape.com/data/m12x
Keywords: Large language models, AI evaluation, cognitive architectures, spectral analysis, statistical methodology, parametric testing, difficulty manifolds, information processing
What Makes ReasonScape Different?¶
Statistical Rigor¶
- Excess Accuracy Correction: Remove guessing inflation, enable fair comparison across question formats
- 95% Confidence Intervals: Wilson confidence intervals with truncation awareness
- Dynamic Sample Sizing: Continue sampling until statistical significance or safety limits
- Bias Correction: Handle multiple choice vs binary vs write-in tasks uniformly
Infinite Parametric Testcases¶
- Deterministic Manifolds: Identical test sequences across runs via coordinate-based seeding
- Response Caching: Never re-execute identical requests, dramatic cost reduction
- Count-Invariant Generation: Smaller samples are perfect subsets of larger ones
- Hierarchical Sampling: Downsample existing results or expand sample sizes seamlessly
Token-Frequency Domain Analysis¶
- Spectral Analysis: FFT analysis of tokenized reasoning problems reveals cognitive architecture patterns
- Population Validation: Verify test populations don't differ in unexpected ways
- Quality Control: Detect systematic biases in test generation or model responses
See Methodology for additional information.
Multi-Domain Reasoning Assessment¶
Twelve cognitive domains with thousands of difficulty combinations:
| Domain | Focus | Primary Capabilities | Key Testing Conditions |
|---|---|---|---|
| Arithmetic | Mathematical reasoning | Math, Symbolic Parsing, Structural Analysis | Length scaling, depth nesting, whitespace randomization |
| Boolean | Logical evaluation | Logic, Symbolic Parsing, Structural Analysis | Length scaling, depth nesting, format variation, whitespace randomization |
| Objects | Selective attention | Selective Attention, Semantic Categorization, Language | Length scaling, distraction details, cross-category distractors, multi-category |
| Shuffle | State tracking | State Tracking, Selective Attention, Language | Length scaling, depth nesting, distraction instructions, multi-domain |
| Dates | Temporal reasoning | Math, Pattern Recognition, Temporal Reasoning, Language | Multi-step operations, format variation, multi-domain |
| Movies | Pattern recognition | Pattern Recognition, Semantic Categorization, Language | Length scaling, format variation, multi-domain, multi-category |
| Brackets | Structural parsing | Symbolic Parsing, Pattern Recognition, Structural Analysis | Length scaling, depth nesting, format variation |
| Letters | Character analysis | Math, Selective Attention, Symbolic Parsing, Language | Length scaling, case mutations, cross-category distractors |
| Shapes | Spatial reasoning | Symbolic Parsing, Pattern Recognition, Spatial Reasoning | Format variation, transformations |
| Cars | Logistics planning | State Tracking, Selective Attention, Spatial Reasoning, Language | Length scaling, multi-step operations, distraction instructions |
| Sort | Algorithmic thinking | Symbolic Parsing, Pattern Recognition, Language | Length scaling, case mutations |
| Sequence | Rule-based generation | Math, Logic, Symbolic Parsing, Language | Length scaling, multi-step operations |
See Tasks for further details.
M12X Evaluation Configuration¶
M12X provides flexible evaluation through three independent parameters that control different aspects of the manifold testing:
Difficulty Level (--degree): Controls the complexity and range of values along the difficulty planes. Higher degrees increase the challenge by expanding parameter ranges and introducing more complex scenarios.
- Degree 0: Easy difficulty across all 12 domains
- Degree 1: Medium difficulty with increased complexity
- Degree 2: Hard difficulty revealing model limitations
Sampling Strategy (--density): Determines which specific points within the difficulty value ranges are actually tested.
- corner: Tests extreme parameter combinations at the edges of the difficulty space
- lowdef: Sparse sampling for quick coverage with minimal computational cost
- normal: Comprehensive sampling providing balanced coverage across the parameter space
Test Generation (--precision): Controls how many individual tests are generated at each sampled point in the parameter space. Higher precision provides better statistical confidence through larger sample sizes.
- Low: Fast evaluation with basic statistical confidence
- Medium: Balanced evaluation with good statistical rigor
- High: Comprehensive evaluation with research-grade precision
→ Complete Configuration Guide
QuickStart: 5-Minute Evaluation¶
-
Setup:
git clone https://github.com/the-crypt-keeper/reasonscape.git cd reasonscape && pip install -r requirements.txt -
Start your LLM (any OpenAI-compatible API):
# Example with llama.cpp ./llama-server --model your-model.gguf --port 3333 -
Run a quick evaluation (M12X easy mode):
python runner.py --config configs/m12x.yaml --degree 0 --density normal --precision low \ --model your-model --apibase http://localhost:3333 -
Generate analysis and view results:
python evaluate.py --interview 'results/*/*.ndjson' --output results.json python leaderboard.py results.json # Open http://localhost:8050 python report.py results.json --output report.md # Optional: generate markdown report
Progressive Evaluation Workflow¶
ReasonScape enables hierarchical evaluation - start small and scale up:
Stage 1: Rapid Model Comparison (2-3 hours)¶
python runner.py --config configs/m12x.yaml --degree 0 --density normal --precision low
Stage 2: Standard Research Evaluation (8-12 hours)¶
python runner.py --config configs/m12x.yaml --degree 1 --density normal --precision medium
Stage 3: Deep Cognitive Analysis (20+ hours)¶
python runner.py --config configs/m12x.yaml --degree 2 --density normal --precision high
Next steps: See M12X Documentation for comprehensive evaluation workflows.
Analysis Tools¶
Interactive Leaderboard¶

- ReasonScore rankings across multiple reasoning domains with pagination
- Token efficiency analysis for cost/performance optimization
- Heatmap visualization with color-coded performance cells showing exactly where models break down
- Truncation indicators displayed as rising darkness from the bottom of each cell
- Statistical confidence indicators with 95% confidence intervals
- Group and manifold filtering for focused analysis
3D Difficulty Manifold Explorer¶

- Navigate reasoning landscapes as interactive 3D surfaces
- Multi-panel analysis: FFT spectral analysis, accuracy plots, token distributions
- Line projection analysis for systematic parameter studies
- Cross-model comparison of cognitive architecture patterns
Comparison Tools¶
- Surface comparison: Side-by-side 3D manifold analysis across models
- Projection comparison: Multi-model performance across parameter sweeps
- Spectral analysis: Token-frequency domain patterns reveal architectural differences
Documentation¶
Start with the basics:
- Methodology: Statistical corrections, progressive evaluation, spectral analysis
- Configuration: Templates, samplers, experiment configs, dataset formats
- Tasks: Parametric test generators, difficulty manifolds, task API
- Tools: Leaderboard, explorer, comparison utilities
Then use the navigation bar on the left side to explore tasks, experiments and tools in-depth!
Citation¶
If you use ReasonScape in your research, please cite:
@software{reasonscape2025,
title={ReasonScape: Information Processing Evaluation for Large Language Models},
author={Mikhail Ravkine},
year={2025},
url={https://github.com/the-crypt-keeper/reasonscape}
}
License¶
MIT
Acknowledgments¶
ReasonScape builds upon insights from BIG-Bench Hard, lm-evaluation-harness, and the broader AI evaluation community.