ReasonScape: Information Processing Evaluation for Large Language Models¶
ReasonScape is a next-generation evaluation methodology that treats language models as information processing systems rather than text generation black boxes.
ReasonScape reveals cognitive architecture patterns invisible to traditional benchmarks: 3D reasoning landscapes (left), token-frequency spectral analysis (bottom right), and interactive exploration tools (top and middle right) enable systematic comparison of information processing capabilities across models and tasks.
🌐 Homepage: https://reasonscape.com/
🛠️ GitHub: the-crypt-keeper/reasonscape
📊 Live Visualization Tools:
-
M6 Leaderboards: https://reasonscape.com/m6/leaderboard
-
M6 Explorer: https://reasonscape.com/m6/explorer (PC required)
-
C2 Leaderboard: https://reasonscape.com/c2/leaderboard (Legacy)
-
C2 Explorer: https://reasonscape.com/c2/explorer (Legacy, PC required)
📁 Raw Experiment Result Data:
-
M6 Dataset: https://reasonscape.com/data/m6
-
C2 Dataset: https://reasonscape.com/data/c2 (Legacy)
Keywords: Large language models, AI evaluation, cognitive architectures, spectral analysis, statistical methodology, parametric testing, difficulty manifolds, information processing
What Makes ReasonScape Different?¶
Statistical Rigor¶
- Excess Accuracy Correction: Remove guessing inflation, enable fair comparison across question formats
- 95% Confidence Intervals: Winston confidence intervals with truncation awareness
- Dynamic Sample Sizing: Continue sampling until statistical significance or safety limits
- Bias Correction: Handle multiple choice vs binary vs write-in tasks uniformly
Infinite Parametric Testcases¶
- Deterministic Manifolds: Identical test sequences across runs via coordinate-based seeding
- Response Caching: Never re-execute identical requests, dramatic cost reduction
- Count-Invariant Generation: Smaller samples are perfect subsets of larger ones
- Hierarchical Sampling: Downsample existing results or expand sample sizes seamlessly
Token-Frequency Domain Analysis¶
- Spectral Analysis: FFT analysis of tokenized reasoning problems reveals cognitive architecture patterns
- Population Validation: Verify test populations don't differ in unexpected ways
- Quality Control: Detect systematic biases in test generation or model responses
Multi-Domain Reasoning Assessment¶
Six cognitive domains with thousands of difficulty combinations:
Domain | Focus | Key Challenges |
---|---|---|
Arithmetic | Mathematical reasoning | Operator precedence, nested parentheses, working memory |
Boolean | Logical evaluation | Multiple notations, negation chains, operator precedence |
Objects | Selective attention | Semantic categorization, distractor resistance, quantity aggregation |
Shuffle | State tracking | Sequential transformations, confounding information filtering |
Dates | Temporal reasoning | Calendar arithmetic, format recognition, multi-step inference |
Movies | Pattern recognition | Thematic similarity, cultural knowledge, preference modeling |
Which ReasonScape Suite?¶
flowchart TD
F{How much time/compute?}
F -->|2-3 hours| G[M6 Degree 0<br/>Quick comparison]
F -->|8-12 hours| H[M6 Degree 1<br/>Standard evaluation]
F -->|20+ hours| I[M6 Degree 2<br/>Research grade]
→ Complete Suite Comparison Guide
QuickStart: 5-Minute Evaluation¶
-
Setup:
git clone https://github.com/the-crypt-keeper/reasonscape.git cd reasonscape && pip install -r requirements.txt
-
Start your LLM (any OpenAI-compatible API):
# Example with llama.cpp ./llama-server --model your-model.gguf --port 3333
-
Run a quick evaluation (M6 easy mode, ~18M tokens):
python runner.py --config configs/m6.yaml --degree 0 --precision low \ --model your-model --apibase http://localhost:3333
-
Generate analysis and launch leaderboard:
python evaluate.py --interview 'results/*/*.ndjson' --output results.json python leaderboard.py results.json # Open http://localhost:8050
Progressive Evaluation Workflow¶
ReasonScape enables hierarchical evaluation - start small and scale up:
Stage 1: Rapid Model Comparison (2-3 hours)¶
python runner.py --config configs/m6.yaml --degree 0 --precision low --density normal
Stage 2: Standard Research Evaluation (8-12 hours)¶
python runner.py --config configs/m6.yaml --degree 1 --precision medium --density normal
- Publication-ready statistical rigor
Stage 3: Deep Cognitive Analysis (20+ hours)¶
python runner.py --config configs/m6.yaml --degree 2 --precision high --density normal
Next steps: See M6 Documentation for comprehensive evaluation workflows.
Analysis Tools¶
Interactive Leaderboard¶
- ReasonScore rankings across multiple reasoning domains
- Token efficiency analysis for cost/performance optimization
- Embedded difficulty manifold visualization showing exactly where models break down
- Statistical confidence indicators with 95% confidence intervals
3D Difficulty Manifold Explorer¶
- Navigate reasoning landscapes as interactive 3D surfaces
- Multi-panel analysis: FFT spectral analysis, accuracy plots, token distributions
- Line projection analysis for systematic parameter studies
- Cross-model comparison of cognitive architecture patterns
Comparison Tools¶
- Surface comparison: Side-by-side 3D manifold analysis across models
- Projection comparison: Multi-model performance across parameter sweeps
- Spectral analysis: Token-frequency domain patterns reveal architectural differences
Documentation¶
Start with the basics:
- Methodology: Statistical corrections, progressive evaluation, spectral analysis
- Configuration: Templates, samplers, experiment configs, dataset formats
- Tasks: Parametric test generators, difficulty manifolds, task API
- Tools: Leaderboard, explorer, comparison utilities
Then use the navigation bar on the left side to explore tasks, experiments and tools in-depth!
Citation¶
If you use ReasonScape in your research, please cite:
@software{reasonscape2025,
title={ReasonScape: Information Processing Evaluation for Large Language Models},
author={Mikhail Ravkine},
year={2025},
url={https://github.com/the-crypt-keeper/reasonscape}
}
License¶
MIT
Acknowledgments¶
ReasonScape builds upon insights from BIG-Bench Hard, lm-evaluation-harness, and the broader AI evaluation community.