Why ReasonScape?
Traditional benchmarks treat models as black boxes, measuring only the final outputs and producing a single result. ReasonScape treats LLMs as information processing systems through parametric test generation, spectral analysis, and 3D interactive visualization. This approach eliminates contamination, provides scalable difficulty, and enables large-scale analysis of how models actually reason.

Revolutionary Methodology

Treating language models as information processing systems

3D Difficulty Manifolds

Navigate reasoning landscapes as interactive 3D terrain. Explore how model performance varies across multiple difficulty dimensions simultaneously with enhanced surface analysis.

Token-Frequency Analysis

Apply FFT to tokenized reasoning problems, revealing spectral signatures and validating difficulty parameters through frequency domain analysis of cognitive architectures.

Parametric Test Generation

Generate infinite unique test instances within controlled difficulty manifolds. Eliminate contamination through deterministic coordinate-based seeding and hierarchical sampling.

Statistical Rigor

Excess accuracy correction, truncation handling, and dynamic confidence intervals with Winston methodology ensure meaningful model and task comparisons.

Progressive Evaluation

Three orthogonal controls (difficulty, sampling, precision) enable flexible evaluation from rapid 5-minute scans to comprehensive research-grade analysis.

Cognitive Architecture Insights

Reveal patterns invisible to traditional benchmarks through spectral analysis, parametric testing, and interactive visualization of information processing capabilities.

Learn About the Methodology

Three Orthogonal Evaluation Controls

M12X provides flexible configuration through independent parameters

🎚️
Difficulty Level
--degree controls the complexity and range of values along difficulty planes. Higher degrees expand/shift the difficulty parameter ranges and introduce more complex scenarios.
  • Degree 0: Easy tasks with low/no interference to understand baseline capabilities
  • Degree 1: Medium complexity with expanded ranges, introduction of interference
  • Degree 2: Hard problems with maximum interference revealing model limitations
📐
Sampling Strategy
--density determines which specific points within the difficulty value ranges are actually tested.
  • corner: Extreme parameter combinations at edges
  • lowdef: Sparse sampling for quick broad coverage
  • normal: Comprehensive balanced parameter space coverage
🎯
Test Precision
--precision controls how many individual tests are generated at each sampled point for better statistical confidence.
  • flash: 32 tests per point (instant overview)
  • low: 32-192 tests with 9% CI target, standard evaluation
  • medium: 64-512 tests with 6% CI target, smoother surfaces
  • high: 128-1280 tests with 4% CI target, head-to-head comparisons
Progressive Evaluation Workflow

These three parameters are independent and orthogonal. Mix them to match your needs:

  • Quick model comparison: --degree 0 --density normal --precision low — Easy difficulty, comprehensive sampling, basic confidence (2-3 hours)
  • Standard research evaluation: --degree 1 --density normal --precision medium — Medium difficulty, balanced coverage, good statistical rigor (8-12 hours)
  • Deep cognitive analysis: --degree 2 --density normal --precision high — Hard difficulty, full parameter exploration, research-grade precision (20+ hours)
  • Edge case stress testing: --degree 2 --density corner --precision flash — Hardest extremes only, instant feedback

Twelve Cognitive Domains

Comprehensive assessment across diverse reasoning capabilities.
Select a task domain below to see it's detailed documentation!

Analysis Tools

Raw data is great, but producing billions of tokens creates unprecedented data-analysis challenges. ReasonScape includes multiple visualization tools to enable exploration and comparisons of model reasoning capabilities.

ReasonScape Leaderboard

Interactive Leaderboard

  • ReasonScore rankings across multiple reasoning domains with pagination
  • Token efficiency analysis for cost/performance optimization
  • Bar visualization with color-coded performance levels for each task
  • Truncation indicators displayed as crosshatch overlay from the left of each cell
  • Statistical confidence indicators with 95% confidence intervals
  • Model family and size filtering for focused analysis and peer comparison
Explore M12X Leaderboard
ReasonScape Explorer

3D Difficulty Manifold Explorer

  • Navigate reasoning landscapes as interactive 3D surfaces
  • Multi-panel analysis FFT spectral analysis, accuracy plots, token distributions
  • Line projection analysis for systematic parameter studies
  • Cross-model comparison of cognitive architecture patterns
Launch M12X Explorer
Surface Comparison

Comparison Tools

  • Surface comparison Side-by-side 3D manifold analysis across models
  • Projection comparison Multi-model performance across parameter sweeps
  • Spectral analysis Token-frequency domain patterns reveal architectural differences
See Documentation

ReasonScape: Information Processing Evaluation for Large Language Models

Mikhail Ravkine • 2025

ReasonScape introduces a next-generation evaluation methodology that treats language models as analyzable information processing systems. Through parametric test generation, spectral analysis, and interactive visualization, ReasonScape reveals cognitive architecture patterns invisible to traditional benchmarks. The M12X suite provides comprehensive assessment across twelve cognitive domains with progressive difficulty degrees, enabling both rapid model comparison and research-grade analysis.

@software{reasonscape2025, title={ReasonScape: Information Processing Evaluation for Large Language Models}, author={Mikhail Ravkine}, year={2025}, url={https://github.com/the-crypt-keeper/reasonscape} }