🗺️ ReasonScape

LLM evaluation that considers how models think.

ReasonScape treats LLMs as the Information Processing Systems they are - adressing blind spots of static benchmarks with parametric difficulty, truncation-aware scoring/clustering, and forensic analysis (FFT, compression, hazard). Proven on 3B+ tokens across 30+ models—ready to use today.

At a glance
Tokens Analyzed 6B+
Models Benchmarked 65+
Reasoning Tasks 12
Total Prompts 27K

We fix the biggest evaluation blind spots

Systemic problems in current LLM evaluation—solved with an information-processing pipeline.

Difficulty control

Parametric manifolds

Coordinate-based generation produces infinite, contamination-proof tests with controllable length, depth, interference, and format.

3D surface visualization showing performance across difficulty parameters
Truthful scoring

Per-point + confidence

Wilson CIs, excess accuracy, and tiering keep aggregate scores honest and make head-to-head comparisons statistically meaningful.

Statistical clustering with confidence intervals
Process visibility

Forensic signals

FFT, compression, and hazard analyses expose reasoning quality, loops, and thinking budgets—not just final answers.

Hazard analysis showing temporal failure patterns
Deployment realism

Truncation + token cost

Truncations are first-class failures; score/token tracks efficiency so "expensive correctness" stops hiding behind averages.

Spidrweb plot with truncation and token efficiency metrics

Live tools you can use now

Explore the data, review the code, compare models, and inspect failure boundaries - all directly from your browser.

Start in 60 seconds

Analyze without running inference

The r12 dataset is ready to query. Pull it locally and start exploring reasoning surfaces.

git clone \
  https://github.com/the-crypt-keeper/reasonscape
cd reasonscape
python data.py pull dataset data/r12-leaderboard.json
python analyze.py evals data/r12-leaderboard.json
python explorer.py data/r12-leaderboard.json

Twelve cognitive domains

Breadth of coverage matters. r12 spans across the core reasoning workloads required for practical Reasoning LLM applications. Click on any card below for detailed information!

Methodology: Five-stage pipeline

ReasonScape treats LLM evaluation as an information-processing pipeline—from parametric test generation through statistical scoring to forensic root-cause analysis.

Analysis Tools

Each tool addresses a specific evaluation question—from aggregate ranking to temporal reasoning behavior forensics.

ReasonScore

Unified metric with Wilson CIs, truncation penalties, geometric mean for balance, and score/token efficiency.

ReasonScore leaderboard
Cluster

Statistical grouping using confidence interval overlap to identify models that are truly indistinguishable.

Cluster analysis
Surface

3D visualization of accuracy across parameter grids to identify capability cliffs and performance boundaries.

Surface analysis
FFT

Frequency domain analysis to distinguish tokenizer effects from model capabilities and output patterns.

FFT analysis
Compression

Information-theoretic analysis revealing underthink/overthink patterns and reasoning loop detection.

Compression analysis
Hazard

Temporal failure analysis showing when and how models fail during token generation.

Hazard analysis
Accuracy-Projection

Shows how accuracy and correct/incorrect faceted completion-token distributions change along a difficulty axis.

Accuracy projection analysis
Spiderweb

Radial cognitive profile with dual-axis (accuracy, token usage) along 12 tasks Ă— 3 difficulties.

Spiderweb analysis
Pairwise

Bradley-Terry cohort comparisons for head-to-head model rankings.

Pairwise analysis

Three research workflows, one unified platform

ReasonScape supports three distinct research activities - each with different tools, questions, and outcomes.