ReasonScape: Information Processing Evaluation for Large Language Models¶

ReasonScape is a next-generation evaluation methodology that treats language models as information processing systems rather than text generation black boxes.

ReasonScape 4-image Collage

ReasonScape reveals cognitive architecture patterns invisible to traditional benchmarks: 3D reasoning landscapes (left), token-frequency spectral analysis (bottom right), and interactive exploration tools (top and middle right) enable systematic comparison of information processing capabilities across models and tasks.

🌐 Homepage: https://reasonscape.com/

🛠️ GitHub: the-crypt-keeper/reasonscape

📊 Live Visualization Tools:

M6 Leaderboards: https://reasonscape.com/m6/leaderboard
M6 Explorer: https://reasonscape.com/m6/explorer (PC required)
C2 Leaderboard: https://reasonscape.com/c2/leaderboard (Legacy)
C2 Explorer: https://reasonscape.com/c2/explorer (Legacy, PC required)

📁 Raw Experiment Result Data:

M6 Dataset: https://reasonscape.com/data/m6
C2 Dataset: https://reasonscape.com/data/c2 (Legacy)

Keywords: Large language models, AI evaluation, cognitive architectures, spectral analysis, statistical methodology, parametric testing, difficulty manifolds, information processing

What Makes ReasonScape Different?¶

Statistical Rigor¶

Excess Accuracy Correction: Remove guessing inflation, enable fair comparison across question formats
95% Confidence Intervals: Winston confidence intervals with truncation awareness
Dynamic Sample Sizing: Continue sampling until statistical significance or safety limits
Bias Correction: Handle multiple choice vs binary vs write-in tasks uniformly

Infinite Parametric Testcases¶

Deterministic Manifolds: Identical test sequences across runs via coordinate-based seeding
Response Caching: Never re-execute identical requests, dramatic cost reduction
Count-Invariant Generation: Smaller samples are perfect subsets of larger ones
Hierarchical Sampling: Downsample existing results or expand sample sizes seamlessly

Token-Frequency Domain Analysis¶

Spectral Analysis: FFT analysis of tokenized reasoning problems reveals cognitive architecture patterns
Population Validation: Verify test populations don't differ in unexpected ways
Quality Control: Detect systematic biases in test generation or model responses

Multi-Domain Reasoning Assessment¶

Six cognitive domains with thousands of difficulty combinations:

Domain	Focus	Key Challenges
Arithmetic	Mathematical reasoning	Operator precedence, nested parentheses, working memory
Boolean	Logical evaluation	Multiple notations, negation chains, operator precedence
Objects	Selective attention	Semantic categorization, distractor resistance, quantity aggregation
Shuffle	State tracking	Sequential transformations, confounding information filtering
Dates	Temporal reasoning	Calendar arithmetic, format recognition, multi-step inference
Movies	Pattern recognition	Thematic similarity, cultural knowledge, preference modeling

Which ReasonScape Suite?¶

flowchart TD
    F{How much time/compute?}
    F -->|2-3 hours| G[M6 Degree 0<br/>Quick comparison]
    F -->|8-12 hours| H[M6 Degree 1<br/>Standard evaluation]  
    F -->|20+ hours| I[M6 Degree 2<br/>Research grade]

→ Complete Suite Comparison Guide

QuickStart: 5-Minute Evaluation¶

Setup:

git clone https://github.com/the-crypt-keeper/reasonscape.git
cd reasonscape && pip install -r requirements.txt

Start your LLM (any OpenAI-compatible API):

# Example with llama.cpp
./llama-server --model your-model.gguf --port 3333

Run a quick evaluation (M6 easy mode, ~18M tokens):

python runner.py --config configs/m6.yaml --degree 0 --precision low \
  --model your-model --apibase http://localhost:3333

Generate analysis and launch leaderboard:

python evaluate.py --interview 'results/*/*.ndjson' --output results.json
python leaderboard.py results.json  # Open http://localhost:8050

Progressive Evaluation Workflow¶

ReasonScape enables hierarchical evaluation - start small and scale up:

Stage 1: Rapid Model Comparison (2-3 hours)¶

python runner.py --config configs/m6.yaml --degree 0 --precision low --density normal

- 3-20M tokens for quick model comparison - 6 reasoning domains at easy difficulty - Statistical confidence with truncation handling

Stage 2: Standard Research Evaluation (8-12 hours)¶

python runner.py --config configs/m6.yaml --degree 1 --precision medium --density normal

- 10-40M tokens for detailed capability analysis - Medium difficulty across all domains
- Publication-ready statistical rigor

Stage 3: Deep Cognitive Analysis (20+ hours)¶

python runner.py --config configs/m6.yaml --degree 2 --precision high --density normal

- 20-80M tokens for comprehensive reasoning landscapes - Hard difficulty revealing failure modes - Research-grade confidence intervals and spectral analysis

Next steps: See M6 Documentation for comprehensive evaluation workflows.

Analysis Tools¶

Interactive Leaderboard ¶

ReasonScape Leaderboard

ReasonScore rankings across multiple reasoning domains
Token efficiency analysis for cost/performance optimization
Embedded difficulty manifold visualization showing exactly where models break down
Statistical confidence indicators with 95% confidence intervals

3D Difficulty Manifold Explorer ¶

ReasonScape Explorer Screenshot

Navigate reasoning landscapes as interactive 3D surfaces
Multi-panel analysis: FFT spectral analysis, accuracy plots, token distributions
Line projection analysis for systematic parameter studies
Cross-model comparison of cognitive architecture patterns

Comparison Tools¶

Surface comparison: Side-by-side 3D manifold analysis across models
Projection comparison: Multi-model performance across parameter sweeps
Spectral analysis: Token-frequency domain patterns reveal architectural differences

Documentation¶

Start with the basics:

Methodology: Statistical corrections, progressive evaluation, spectral analysis
Configuration: Templates, samplers, experiment configs, dataset formats
Tasks: Parametric test generators, difficulty manifolds, task API
Tools: Leaderboard, explorer, comparison utilities

Then use the navigation bar on the left side to explore tasks, experiments and tools in-depth!

Citation¶

If you use ReasonScape in your research, please cite:

@software{reasonscape2025,
  title={ReasonScape: Information Processing Evaluation for Large Language Models},
  author={Mikhail Ravkine},
  year={2025},
  url={https://github.com/the-crypt-keeper/reasonscape}
}

License¶

MIT

Acknowledgments¶

ReasonScape builds upon insights from BIG-Bench Hard, lm-evaluation-harness, and the broader AI evaluation community.