Skip to content

ReasonScape: Information Processing Evaluation for Large Language Models

ReasonScape is a next-generation evaluation methodology that treats language models as information processing systems rather than text generation black boxes.

ReasonScape 4-image Collage

ReasonScape reveals cognitive architecture patterns invisible to traditional benchmarks: 3D reasoning landscapes (left), token-frequency spectral analysis (bottom right), and interactive exploration tools (top and middle right) enable systematic comparison of information processing capabilities across models and tasks.

🌐 Homepage: https://reasonscape.com/

🛠️ GitHub: the-crypt-keeper/reasonscape

📊 Live Visualization Tools:

📁 Raw Experiment Result Data:

Keywords: Large language models, AI evaluation, cognitive architectures, spectral analysis, statistical methodology, parametric testing, difficulty manifolds, information processing

What Makes ReasonScape Different?

Statistical Rigor

  • Excess Accuracy Correction: Remove guessing inflation, enable fair comparison across question formats
  • 95% Confidence Intervals: Wilson confidence intervals with truncation awareness
  • Dynamic Sample Sizing: Continue sampling until statistical significance or safety limits
  • Bias Correction: Handle multiple choice vs binary vs write-in tasks uniformly

Infinite Parametric Testcases

  • Deterministic Manifolds: Identical test sequences across runs via coordinate-based seeding
  • Response Caching: Never re-execute identical requests, dramatic cost reduction
  • Count-Invariant Generation: Smaller samples are perfect subsets of larger ones
  • Hierarchical Sampling: Downsample existing results or expand sample sizes seamlessly

Token-Frequency Domain Analysis

  • Spectral Analysis: FFT analysis of tokenized reasoning problems reveals cognitive architecture patterns
  • Population Validation: Verify test populations don't differ in unexpected ways
  • Quality Control: Detect systematic biases in test generation or model responses

See Methodology for additional information.

Multi-Domain Reasoning Assessment

Twelve cognitive domains with thousands of difficulty combinations:

Domain Focus Primary Capabilities Key Testing Conditions
Arithmetic Mathematical reasoning Math, Symbolic Parsing, Structural Analysis Length scaling, depth nesting, whitespace randomization
Boolean Logical evaluation Logic, Symbolic Parsing, Structural Analysis Length scaling, depth nesting, format variation, whitespace randomization
Objects Selective attention Selective Attention, Semantic Categorization, Language Length scaling, distraction details, cross-category distractors, multi-category
Shuffle State tracking State Tracking, Selective Attention, Language Length scaling, depth nesting, distraction instructions, multi-domain
Dates Temporal reasoning Math, Pattern Recognition, Temporal Reasoning, Language Multi-step operations, format variation, multi-domain
Movies Pattern recognition Pattern Recognition, Semantic Categorization, Language Length scaling, format variation, multi-domain, multi-category
Brackets Structural parsing Symbolic Parsing, Pattern Recognition, Structural Analysis Length scaling, depth nesting, format variation
Letters Character analysis Math, Selective Attention, Symbolic Parsing, Language Length scaling, case mutations, cross-category distractors
Shapes Spatial reasoning Symbolic Parsing, Pattern Recognition, Spatial Reasoning Format variation, transformations
Cars Logistics planning State Tracking, Selective Attention, Spatial Reasoning, Language Length scaling, multi-step operations, distraction instructions
Sort Algorithmic thinking Symbolic Parsing, Pattern Recognition, Language Length scaling, case mutations
Sequence Rule-based generation Math, Logic, Symbolic Parsing, Language Length scaling, multi-step operations

See Tasks for further details.

M12X Evaluation Configuration

M12X provides flexible evaluation through three independent parameters that control different aspects of the manifold testing:

Difficulty Level (--degree): Controls the complexity and range of values along the difficulty planes. Higher degrees increase the challenge by expanding parameter ranges and introducing more complex scenarios.

  • Degree 0: Easy difficulty across all 12 domains
  • Degree 1: Medium difficulty with increased complexity
  • Degree 2: Hard difficulty revealing model limitations

Sampling Strategy (--density): Determines which specific points within the difficulty value ranges are actually tested.

  • corner: Tests extreme parameter combinations at the edges of the difficulty space
  • lowdef: Sparse sampling for quick coverage with minimal computational cost
  • normal: Comprehensive sampling providing balanced coverage across the parameter space

Test Generation (--precision): Controls how many individual tests are generated at each sampled point in the parameter space. Higher precision provides better statistical confidence through larger sample sizes.

  • Low: Fast evaluation with basic statistical confidence
  • Medium: Balanced evaluation with good statistical rigor
  • High: Comprehensive evaluation with research-grade precision

Complete Configuration Guide

QuickStart: 5-Minute Evaluation

  1. Setup:

    git clone https://github.com/the-crypt-keeper/reasonscape.git
    cd reasonscape && pip install -r requirements.txt
    

  2. Start your LLM (any OpenAI-compatible API):

    # Example with llama.cpp
    ./llama-server --model your-model.gguf --port 3333
    

  3. Run a quick evaluation (M12X easy mode):

    python runner.py --config configs/m12x.yaml --degree 0 --density normal --precision low \
      --model your-model --apibase http://localhost:3333
    

  4. Generate analysis and view results:

    python evaluate.py --interview 'results/*/*.ndjson' --output results.json
    python leaderboard.py results.json  # Open http://localhost:8050
    python report.py results.json --output report.md  # Optional: generate markdown report
    

Progressive Evaluation Workflow

ReasonScape enables hierarchical evaluation - start small and scale up:

Stage 1: Rapid Model Comparison (2-3 hours)

python runner.py --config configs/m12x.yaml --degree 0 --density normal --precision low
- Quick model comparison across all 12 reasoning domains - Easy difficulty with comprehensive sampling for rapid assessment - Statistical confidence with truncation handling

Stage 2: Standard Research Evaluation (8-12 hours)

python runner.py --config configs/m12x.yaml --degree 1 --density normal --precision medium
- Detailed capability analysis across 12 cognitive domains - Medium difficulty with balanced parameter space coverage - Publication-ready statistical rigor

Stage 3: Deep Cognitive Analysis (20+ hours)

python runner.py --config configs/m12x.yaml --degree 2 --density normal --precision high
- Comprehensive reasoning landscapes across all 12 domains - Hard difficulty with maximum parameter space exploration - Research-grade confidence intervals and spectral analysis

Next steps: See M12X Documentation for comprehensive evaluation workflows.

Analysis Tools

Interactive Leaderboard

ReasonScape Leaderboard

  • ReasonScore rankings across multiple reasoning domains with pagination
  • Token efficiency analysis for cost/performance optimization
  • Heatmap visualization with color-coded performance cells showing exactly where models break down
  • Truncation indicators displayed as rising darkness from the bottom of each cell
  • Statistical confidence indicators with 95% confidence intervals
  • Group and manifold filtering for focused analysis

3D Difficulty Manifold Explorer

ReasonScape Explorer Screenshot

  • Navigate reasoning landscapes as interactive 3D surfaces
  • Multi-panel analysis: FFT spectral analysis, accuracy plots, token distributions
  • Line projection analysis for systematic parameter studies
  • Cross-model comparison of cognitive architecture patterns

Comparison Tools

  • Surface comparison: Side-by-side 3D manifold analysis across models
  • Projection comparison: Multi-model performance across parameter sweeps
  • Spectral analysis: Token-frequency domain patterns reveal architectural differences

Documentation

Start with the basics:

  • Methodology: Statistical corrections, progressive evaluation, spectral analysis
  • Configuration: Templates, samplers, experiment configs, dataset formats
  • Tasks: Parametric test generators, difficulty manifolds, task API
  • Tools: Leaderboard, explorer, comparison utilities

Then use the navigation bar on the left side to explore tasks, experiments and tools in-depth!

Citation

If you use ReasonScape in your research, please cite:

@software{reasonscape2025,
  title={ReasonScape: Information Processing Evaluation for Large Language Models},
  author={Mikhail Ravkine},
  year={2025},
  url={https://github.com/the-crypt-keeper/reasonscape}
}

License

MIT

Acknowledgments

ReasonScape builds upon insights from BIG-Bench Hard, lm-evaluation-harness, and the broader AI evaluation community.