Skip to content

ReasonScape: Information Processing Evaluation for Large Language Models

ReasonScape is a next-generation evaluation methodology that treats language models as information processing systems rather than text generation black boxes.

ReasonScape 4-image Collage

ReasonScape reveals cognitive architecture patterns invisible to traditional benchmarks: 3D reasoning landscapes (left), token-frequency spectral analysis (bottom right), and interactive exploration tools (top and middle right) enable systematic comparison of information processing capabilities across models and tasks.

🌐 Homepage: https://reasonscape.com/

🛠️ GitHub: the-crypt-keeper/reasonscape

📊 Live Visualization Tools:

📁 Raw Experiment Result Data:

Keywords: Large language models, AI evaluation, cognitive architectures, spectral analysis, statistical methodology, parametric testing, difficulty manifolds, information processing

What Makes ReasonScape Different?

Statistical Rigor

  • Excess Accuracy Correction: Remove guessing inflation, enable fair comparison across question formats
  • 95% Confidence Intervals: Winston confidence intervals with truncation awareness
  • Dynamic Sample Sizing: Continue sampling until statistical significance or safety limits
  • Bias Correction: Handle multiple choice vs binary vs write-in tasks uniformly

Infinite Parametric Testcases

  • Deterministic Manifolds: Identical test sequences across runs via coordinate-based seeding
  • Response Caching: Never re-execute identical requests, dramatic cost reduction
  • Count-Invariant Generation: Smaller samples are perfect subsets of larger ones
  • Hierarchical Sampling: Downsample existing results or expand sample sizes seamlessly

Token-Frequency Domain Analysis

  • Spectral Analysis: FFT analysis of tokenized reasoning problems reveals cognitive architecture patterns
  • Population Validation: Verify test populations don't differ in unexpected ways
  • Quality Control: Detect systematic biases in test generation or model responses

Multi-Domain Reasoning Assessment

Six cognitive domains with thousands of difficulty combinations:

Domain Focus Key Challenges
Arithmetic Mathematical reasoning Operator precedence, nested parentheses, working memory
Boolean Logical evaluation Multiple notations, negation chains, operator precedence
Objects Selective attention Semantic categorization, distractor resistance, quantity aggregation
Shuffle State tracking Sequential transformations, confounding information filtering
Dates Temporal reasoning Calendar arithmetic, format recognition, multi-step inference
Movies Pattern recognition Thematic similarity, cultural knowledge, preference modeling

Which ReasonScape Suite?

flowchart TD
    F{How much time/compute?}
    F -->|2-3 hours| G[M6 Degree 0<br/>Quick comparison]
    F -->|8-12 hours| H[M6 Degree 1<br/>Standard evaluation]  
    F -->|20+ hours| I[M6 Degree 2<br/>Research grade]

Complete Suite Comparison Guide

QuickStart: 5-Minute Evaluation

  1. Setup:

    git clone https://github.com/the-crypt-keeper/reasonscape.git
    cd reasonscape && pip install -r requirements.txt
    

  2. Start your LLM (any OpenAI-compatible API):

    # Example with llama.cpp
    ./llama-server --model your-model.gguf --port 3333
    

  3. Run a quick evaluation (M6 easy mode, ~18M tokens):

    python runner.py --config configs/m6.yaml --degree 0 --precision low \
      --model your-model --apibase http://localhost:3333
    

  4. Generate analysis and launch leaderboard:

    python evaluate.py --interview 'results/*/*.ndjson' --output results.json
    python leaderboard.py results.json  # Open http://localhost:8050
    

Progressive Evaluation Workflow

ReasonScape enables hierarchical evaluation - start small and scale up:

Stage 1: Rapid Model Comparison (2-3 hours)

python runner.py --config configs/m6.yaml --degree 0 --precision low --density normal
- 3-20M tokens for quick model comparison - 6 reasoning domains at easy difficulty - Statistical confidence with truncation handling

Stage 2: Standard Research Evaluation (8-12 hours)

python runner.py --config configs/m6.yaml --degree 1 --precision medium --density normal
- 10-40M tokens for detailed capability analysis - Medium difficulty across all domains
- Publication-ready statistical rigor

Stage 3: Deep Cognitive Analysis (20+ hours)

python runner.py --config configs/m6.yaml --degree 2 --precision high --density normal
- 20-80M tokens for comprehensive reasoning landscapes - Hard difficulty revealing failure modes - Research-grade confidence intervals and spectral analysis

Next steps: See M6 Documentation for comprehensive evaluation workflows.

Analysis Tools

Interactive Leaderboard

ReasonScape Leaderboard

  • ReasonScore rankings across multiple reasoning domains
  • Token efficiency analysis for cost/performance optimization
  • Embedded difficulty manifold visualization showing exactly where models break down
  • Statistical confidence indicators with 95% confidence intervals

3D Difficulty Manifold Explorer

ReasonScape Explorer Screenshot

  • Navigate reasoning landscapes as interactive 3D surfaces
  • Multi-panel analysis: FFT spectral analysis, accuracy plots, token distributions
  • Line projection analysis for systematic parameter studies
  • Cross-model comparison of cognitive architecture patterns

Comparison Tools

  • Surface comparison: Side-by-side 3D manifold analysis across models
  • Projection comparison: Multi-model performance across parameter sweeps
  • Spectral analysis: Token-frequency domain patterns reveal architectural differences

Documentation

Start with the basics:

  • Methodology: Statistical corrections, progressive evaluation, spectral analysis
  • Configuration: Templates, samplers, experiment configs, dataset formats
  • Tasks: Parametric test generators, difficulty manifolds, task API
  • Tools: Leaderboard, explorer, comparison utilities

Then use the navigation bar on the left side to explore tasks, experiments and tools in-depth!

Citation

If you use ReasonScape in your research, please cite:

@software{reasonscape2025,
  title={ReasonScape: Information Processing Evaluation for Large Language Models},
  author={Mikhail Ravkine},
  year={2025},
  url={https://github.com/the-crypt-keeper/reasonscape}
}

License

MIT

Acknowledgments

ReasonScape builds upon insights from BIG-Bench Hard, lm-evaluation-harness, and the broader AI evaluation community.