🗺️ ReasonScape

LLM evaluation that considers how models think.

ReasonScape treats LLMs as the Information Processing Systems they are - adressing blind spots of static benchmarks with parametric difficulty, truncation-aware scoring/clustering, and forensic analysis (FFT, compression, hazard). Proven on 6.5B tokens across 75+ models—ready to use today.

At a glance
Tokens Analyzed 6.5B
Models Benchmarked 75+
Reasoning Tasks 12
Evaluation Points 30K+

We fix the biggest evaluation blind spots

Eight systemic problems in current LLM evaluation—solved with an information-processing pipeline.

Difficulty control

Parametric manifolds

Coordinate-based generation produces infinite, contamination-proof tests with controllable length, depth, interference, and format.

3D surface visualization showing performance across difficulty parameters
Truthful scoring

Per-point + confidence

Wilson CIs, excess accuracy, and tiering keep aggregate scores honest and make head-to-head comparisons statistically meaningful.

Statistical clustering with confidence intervals
Process visibility

Forensic signals

FFT, compression, and hazard analyses expose reasoning quality, loops, and thinking budgets—not just final answers.

Hazard analysis showing temporal failure patterns
Deployment realism

Truncation + token cost

Truncations are first-class failures; score/token tracks efficiency so "expensive correctness" stops hiding behind averages.

Spidrweb plot with truncation and token efficiency metrics

Live tools you can use now

Explore the data, review the code, compare models, and inspect failure boundaries - all directly from your browser.

Start in 60 seconds

Analyze without running inference

The 6.5B-token m12x dataset is ready to query. Pull it locally and start exploring reasoning surfaces.

git clone https://github.com/the-crypt-keeper/reasonscape
cd reasonscape
curl https://reasonscape.com/data/m12x/m12x.db -o data/m12x.db
python analyze.py evals data/dataset-m12x.json

Twelve cognitive domains

Breadth of coverage matters. m12x spans across the core reasoning workloads required for practical Reasoning LLM applications. Click on any card below for detailed information!

Methodology: Five-stage pipeline

ReasonScape treats LLM evaluation as an information-processing pipeline—from parametric test generation through statistical scoring to forensic root-cause analysis.

Analysis Tools

Each tool addresses a specific evaluation question—from aggregate ranking to temporal reasoning behavior forensics.

Analysis Tool

ReasonScore

Unified metric with Wilson CIs, truncation penalties, geometric mean for balance, and score/token efficiency.

ReasonScore leaderboard
Analysis Tool

Cluster

Statistical grouping using confidence interval overlap to identify models that are truly indistinguishable.

Cluster analysis
Analysis Tool

Surface

3D visualization of accuracy across parameter grids to identify capability cliffs and performance boundaries.

Surface analysis
Analysis Tool

FFT

Frequency domain analysis to distinguish tokenizer effects from model capabilities and output patterns.

FFT analysis
Analysis Tool

Compression

Information-theoretic analysis revealing underthink/overthink patterns and reasoning loop detection.

Compression analysis
Analysis Tool

Hazard

Temporal failure analysis showing when and how models fail during token generation.

Hazard analysis

Four research workflows, one unified platform

ReasonScape supports four distinct research activities - each with different tools, questions, and outcomes.