ReasonScape treats LLMs as the Information Processing Systems they are - adressing blind spots of static benchmarks with parametric difficulty, truncation-aware scoring/clustering, and forensic analysis (FFT, compression, hazard). Proven on 3B+ tokens across 30+ models—ready to use today.
Systemic problems in current LLM evaluation—solved with an information-processing pipeline.
Coordinate-based generation produces infinite, contamination-proof tests with controllable length, depth, interference, and format.
Wilson CIs, excess accuracy, and tiering keep aggregate scores honest and make head-to-head comparisons statistically meaningful.
FFT, compression, and hazard analyses expose reasoning quality, loops, and thinking budgets—not just final answers.
Truncations are first-class failures; score/token tracks efficiency so "expensive correctness" stops hiding behind averages.
Explore the data, review the code, compare models, and inspect failure boundaries - all directly from your browser.
The r12 dataset is ready to query. Pull it locally and start exploring reasoning surfaces.
Breadth of coverage matters. r12 spans across the core reasoning workloads required for practical Reasoning LLM applications. Click on any card below for detailed information!
Length Ă— depth manifolds stress symbolic computation with varying whitespace.
Nested expressions expose logic consistency, 5 notations tests format sensitivity.
Stack discipline and pattern tracking, out-of-domain inputs and outputs.
Categorization and counting under load with distractors.
Swap sequences test working memory across length and depth with distractors.
Ordering and language reasoning, output formatting.
Calendar math and pattern recognition, date format variation.
Symbolic parsing with distractors.
Structured data lookup and aggregation under row and column interference.
Instruction following with complex constraints.
SVG Shape recognition under rotation, translation and transformation.
Absolute and Relative spatial operations with interference.
ReasonScape treats LLM evaluation as an information-processing pipeline—from parametric test generation through statistical scoring to forensic root-cause analysis.
Parametric task manifolds; deterministic coordinates.
Adaptive sampling, caching, precision targeting.
Excess accuracy, truncation penalties, tier mapping.
Leaderboard, spider plots, surfaces for pattern finding.
FFT, compression, hazard to explain root causes.
Each tool addresses a specific evaluation question—from aggregate ranking to temporal reasoning behavior forensics.
Unified metric with Wilson CIs, truncation penalties, geometric mean for balance, and score/token efficiency.
Statistical grouping using confidence interval overlap to identify models that are truly indistinguishable.
3D visualization of accuracy across parameter grids to identify capability cliffs and performance boundaries.
Frequency domain analysis to distinguish tokenizer effects from model capabilities and output patterns.
Information-theoretic analysis revealing underthink/overthink patterns and reasoning loop detection.
Temporal failure analysis showing when and how models fail during token generation.
Shows how accuracy and correct/incorrect faceted completion-token distributions change along a difficulty axis.
Radial cognitive profile with dual-axis (accuracy, token usage) along 12 tasks Ă— 3 difficulties.
Bradley-Terry cohort comparisons for head-to-head model rankings.
ReasonScape supports three distinct research activities - each with different tools, questions, and outcomes.
"Which model is better?" — Aggregate rankings to identify 3-5 candidates for deeper investigation.
"What can this model do?" — Profile cognitive fingerprints, capability zones, and cost/performance characteristics.
"How did this fail?" — Examine raw thinking traces with loop detection and classification.