ReasonScape treats LLMs as the Information Processing Systems they are - adressing blind spots of static benchmarks with parametric difficulty, truncation-aware scoring/clustering, and forensic analysis (FFT, compression, hazard). Proven on 6.5B tokens across 75+ models—ready to use today.
Eight systemic problems in current LLM evaluation—solved with an information-processing pipeline.
Coordinate-based generation produces infinite, contamination-proof tests with controllable length, depth, interference, and format.
Wilson CIs, excess accuracy, and tiering keep aggregate scores honest and make head-to-head comparisons statistically meaningful.
FFT, compression, and hazard analyses expose reasoning quality, loops, and thinking budgets—not just final answers.
Truncations are first-class failures; score/token tracks efficiency so "expensive correctness" stops hiding behind averages.
Explore the data, review the code, compare models, and inspect failure boundaries - all directly from your browser.
The 6.5B-token m12x dataset is ready to query. Pull it locally and start exploring reasoning surfaces.
Breadth of coverage matters. m12x spans across the core reasoning workloads required for practical Reasoning LLM applications. Click on any card below for detailed information!
Length Ă— depth manifolds stress symbolic computation with varying whitespace.
Nested expressions expose logic consistency, 5 notations tests format sensitivity.
Stack discipline and pattern tracking, out-of-domain inputs and outputs.
Categorization and counting under load with distractors.
Swap sequences test working memory across length and depth with distractors.
Ordering and language reasoning, output formatting.
Calendar math and pattern recognition, date format variation.
Symbolic parsing with distractors.
Categorization with semantic cues.
Instruction following with complex constraints.
SVG Shape recognition under rotation, translation and transformation.
Absolute and Relative spatial operations with interference.
ReasonScape treats LLM evaluation as an information-processing pipeline—from parametric test generation through statistical scoring to forensic root-cause analysis.
Parametric task manifolds; deterministic coordinates.
Adaptive sampling, caching, precision targeting.
Excess accuracy, truncation penalties, tier mapping.
Leaderboard, spider plots, surfaces for pattern finding.
FFT, compression, hazard to explain root causes.
Each tool addresses a specific evaluation question—from aggregate ranking to temporal reasoning behavior forensics.
Unified metric with Wilson CIs, truncation penalties, geometric mean for balance, and score/token efficiency.
Statistical grouping using confidence interval overlap to identify models that are truly indistinguishable.
3D visualization of accuracy across parameter grids to identify capability cliffs and performance boundaries.
Frequency domain analysis to distinguish tokenizer effects from model capabilities and output patterns.
Information-theoretic analysis revealing underthink/overthink patterns and reasoning loop detection.
Temporal failure analysis showing when and how models fail during token generation.
ReasonScape supports four distinct research activities - each with different tools, questions, and outcomes.
"What's the best model overall?" — Aggregate rankings with ReasonScore to identify 3-5 candidates for deeper investigation.
"Which models are truly different?" — Statistical clustering with CI overlap to separate signal from measurement noise.
"What are the trade-offs?" — Profile cognitive fingerprints, capability zones, and cost/performance characteristics.
"Why/how/when did it fail?" — Root-cause analysis across input, reasoning, output, and temporal dimensions.