🗺️ ReasonScape

LLM evaluation that considers how models think.

ReasonScape treats LLMs as the Information Processing Systems they are - adressing blind spots of static benchmarks with parametric difficulty, truncation-aware scoring/clustering, and forensic analysis (FFT, compression, hazard). Proven on 3B+ tokens across 30+ models—ready to use today.

At a glance

Tokens Analyzed 6B+

Models Benchmarked 65+

Reasoning Tasks 12

Total Prompts 27K

See the results Explore the surfaces Download the data Clone the Source Code Read the Documentation

We fix the biggest evaluation blind spots

Systemic problems in current LLM evaluation—solved with an information-processing pipeline.

Difficulty control

Parametric manifolds

Coordinate-based generation produces infinite, contamination-proof tests with controllable length, depth, interference, and format.

3D surface visualization showing performance across difficulty parameters

Truthful scoring

Per-point + confidence

Wilson CIs, excess accuracy, and tiering keep aggregate scores honest and make head-to-head comparisons statistically meaningful.

Statistical clustering with confidence intervals

Process visibility

Forensic signals

FFT, compression, and hazard analyses expose reasoning quality, loops, and thinking budgets—not just final answers.

Hazard analysis showing temporal failure patterns

Deployment realism

Truncation + token cost

Truncations are first-class failures; score/token tracks efficiency so "expensive correctness" stops hiding behind averages.

Spidrweb plot with truncation and token efficiency metrics

Live tools you can use now

Explore the data, review the code, compare models, and inspect failure boundaries - all directly from your browser.

Start in 60 seconds

Analyze without running inference

The r12 dataset is ready to query. Pull it locally and start exploring reasoning surfaces.

git clone \ 

  https://github.com/the-crypt-keeper/reasonscape

cd reasonscape

python data.py pull dataset data/r12-leaderboard.json

python analyze.py evals data/r12-leaderboard.json

python explorer.py data/r12-leaderboard.json

Twelve cognitive domains

Breadth of coverage matters. r12 spans across the core reasoning workloads required for practical Reasoning LLM applications. Click on any card below for detailed information!

Arithmetic

Multi-step Math

Length × depth manifolds stress symbolic computation with varying whitespace.

Boolean

Logical evaluation

Nested expressions expose logic consistency, 5 notations tests format sensitivity.

Brackets

Structural parsing

Stack discipline and pattern tracking, out-of-domain inputs and outputs.

Objects

Selective attention

Categorization and counting under load with distractors.

Shuffle

State tracking

Swap sequences test working memory across length and depth with distractors.

Sort

Algorithmic thinking

Ordering and language reasoning, output formatting.

Dates

Temporal reasoning

Calendar math and pattern recognition, date format variation.

Letters

Character analysis

Symbolic parsing with distractors.

Tables

Tabular reasoning

Structured data lookup and aggregation under row and column interference.

Sequence

Rule-based generation

Instruction following with complex constraints.

Shapes

Spatial reasoning

SVG Shape recognition under rotation, translation and transformation.

Cars

Logistics planning

Absolute and Relative spatial operations with interference.

Methodology: Five-stage pipeline

ReasonScape treats LLM evaluation as an information-processing pipeline—from parametric test generation through statistical scoring to forensic root-cause analysis.

Stage 1

Definition

Parametric task manifolds; deterministic coordinates.

Stage 2

Execution

Adaptive sampling, caching, precision targeting.

Stage 3

Evaluation

Excess accuracy, truncation penalties, tier mapping.

Stage 4

Discovery

Leaderboard, spider plots, surfaces for pattern finding.