m12x: The Reference¶

What is m12x?¶

m12x is ReasonScape's reference configuration—a complete evaluation of 75+ frontier models across 12 reasoning tasks, generating 6.5B tokens of analysis-ready data.

The extraordinary claim: LLMs are information processors and we can diagnose their failures like engineers diagnose signal systems.

The extraordinary evidence: - FFT reveals spectral signatures that differ by tokenizer/architecture - Compression shows underthink/overthink/broken loops patterns - Hazard analysis proves models have measurable "thinking budgets" - Surface plots reveal capability boundaries nobody knew existed - Statistical rigor confirms these patterns are signal, not noise

Eating Our Own Dogfood¶

m12x is the ReasonScape reference implementation that proves:

✅ The five-stage architecture is practical, not theoretical — 6.5B tokens processed through the full pipeline
✅ The discovery-investigation loop enables real research — Documented workflows showing ping-pong between Stages 4 and 5
✅ The manifold/tier/surface abstractions scale to production — 12 tasks × 75 models × 3 tiers without architectural strain
✅ The forensic quartet isolates root causes — Case studies showing INPUT→REASONING→OUTPUT analysis chains
✅ You can build serious reasoning research on this foundation — m12x itself produced novel findings about model failure modes

Live Results Access¶

M12X Leaderboard: https://reasonscape.com/m12x/leaderboard
M12X Explorer: https://reasonscape.com/m12x/explorer (PC required)
M12X Dataset: https://reasonscape.com/data/m12x

m12x Configuration¶

Task Coverage¶

m12x evaluates across 12 reasoning domains:

Task	Focus	Primary Capabilities
Arithmetic	Mathematical reasoning	Math, Symbolic Parsing, Structural Analysis
Boolean	Logical evaluation	Logic, Symbolic Parsing, Structural Analysis
Brackets	Structural parsing	Symbolic Parsing, Pattern Recognition, Structural Analysis
Objects	Selective attention	Selective Attention, Semantic Categorization, Language
Shuffle	State tracking	State Tracking, Selective Attention, Language
Sort	Algorithmic thinking	Symbolic Parsing, Pattern Recognition, Language
Dates	Temporal reasoning	Math, Pattern Recognition, Temporal Reasoning, Language
Letters	Character analysis	Math, Selective Attention, Symbolic Parsing, Language
Movies	Pattern recognition	Pattern Recognition, Semantic Categorization, Language
Sequence	Rule-based generation	Math, Logic, Symbolic Parsing, Language
Shapes	Spatial reasoning	Symbolic Parsing, Pattern Recognition, Spatial Reasoning
Cars	Logistics planning	State Tracking, Selective Attention, Spatial Reasoning, Language

For complete task specifications and difficulty manifolds: See tasks.md for abstract task API and tasks/*.md for implementation details.

Tier Definitions¶

m12x uses three difficulty tiers that scale with model capabilities:

Tier	Degree	Density	Purpose
Easy	0	normal	Baseline difficulty for quick model comparison
Medium	1	normal	Standard evaluation difficulty
Hard	2	normal	Comprehensive research difficulty

Tier mapping lives in: data/dataset-m12x.json

Design rationale: As models improve, new tiers can be added (e.g., "ultra" at degree=3) without changing manifold definitions. This ensures long-term reproducibility while enabling adaptive difficulty scaling.

Surface Definitions¶

Surfaces are 2D slices through difficulty manifolds, organized by task. Each surface explores the interaction between two difficulty parameters.

Example surfaces:

arithmetic_length_x_depth - How length and nesting depth interact
boolean_length_x_depth - Logical complexity vs expression length
objects_length_x_distractors - Selective attention under increasing load

Discovery: Use python analyze.py tasks data/dataset-m12x.json to see all available surfaces per task.

For surface visualization: See explorer.py and analyze.py surface.

Projection Definitions¶

Projections are 1D sweeps through difficulty manifolds, holding other parameters fixed.

Example projections:

arithmetic_length_sweep - Performance vs input length
boolean_depth_sweep - Performance vs nesting depth
objects_distractor_sweep - Performance vs distraction load

Discovery: Use python analyze.py tasks data/dataset-m12x.json to see all available projections per task.

For projection analysis: See analyze.py surface with projection filters.

Model Coverage¶

75+ frontier models (Nov 2024 - Jan 2025):

All major LLM families: DeepSeek, Qwen, Llama, Mistral, Phi, Gemma, etc..
Dense and sparse MoE architectures
1B to 355B parameter ranges
Multiple quantization schemes (FP16, 4-bit, 8-bit)
8k context

Evaluation Coverage:

3 difficulty tiers per task (easy/medium/hard)
Multiple templates (zero-shot, chain-of-thought, system prompts)
Multiple samplers (temperature 0, temperature 0.7, top-p variations)
Adaptive precision (32-128 samples per point, CI-targeted stopping)

Data Volume:

6.5B tokens total (prompts + completions)
30K+ evaluation points in PointsDB
Millions of test instances generated deterministically
Complete response logs with compression and FFT pre-computed

Group Taxonomy¶

Groups are classification tags that enable peer comparison and cross-cutting analysis. m12x uses three primary dimensions:

1. Architecture Type (`arch:`)¶

arch:dense - Standard transformer models: All parameters active per token, predictable compute
arch:moe - Mixture of Experts with sparse activation: Reduced compute via expert routing
arch:ssm - State-space model architectures: Recurrent processing with linear complexity
arch:hybrid - Models combining multiple mechanisms: Multiple types of layers mixed together

2. Size Category (`size:`)¶

Use active parameters for MoE models:

size:tiny - <3B active parameters (mobile deployment)
size:small - 3B-10B active parameters (desktop deployment)
size:mid - 10B-30B active parameters (server deployment)
size:large - 30B+ active parameters (enterprise deployment)

3. Model Family (`family:`)¶

family:llama - Meta's Llama family
family:phi4 - Microsoft's Phi-4 series
family:gemma3 - Google's Gemma 3 series
family:qwen3 - Alibaba's Qwen 3 series
family:granite - IBM's Granite series
family:gpt-oss - OpenAI's GPT-OSS series
(and more)

Extending m12x¶

Adding a New Model to m12x¶

Step 1: Run the Evaluation

python runner.py --config configs/m12x.yaml \
  --degree 0,1,2 --density normal --precision low \
  --model new-model-70b \
  --apibase http://localhost:8000 \
  --template zerocot-nosys \
  --sampler greedy-max

This creates interview NDJSON files in results/timestamp/new-model-70b+zerocot-nosys+greedy-max/.

Step 2: Organize Interview Data

# Create experiment directory
mkdir -p data/m12x/new-model-70b-exp001

# Move interview files
mv results/timestamp/new-model-70b+zerocot-nosys+greedy-max/* \
   data/m12x/new-model-70b-exp001/

Step 3: Add to Dataset Config

Edit data/dataset-m12x.json and add a new eval entry:

{
  "evals": [
    {
      "evaluate": {
        "glob": "data/m12x/new-model-70b-exp001/*"
      },
      "filters": {
        "model": "new-model-70b",
        "template": "zerocot-nosys",
        "sampler": "greedy-max"
      },
      "label": "New Model 70B",
      "groups": [
        "family:new",
        "arch:dense",
        "size:large"
      ],
      "hf_id": "organization/new-model-70b",
      "hf_quant_id": null
    }
  ]
}

Step 4: Process Evaluation

# Process all missing evaluations (idempotent)
python evaluate.py --dataset data/dataset-m12x.json --parallel 16

Step 5: Verify Integration

# Verify eval appears in dataset
python analyze.py evals data/dataset-m12x.json --search "new model"

# Check database size
ls -lh data/m12x.db

# Quick leaderboard check
python analyze.py scores data/dataset-m12x.json \
  --filters '{"model": "new-model-70b"}' --format markdown

Context Simulation¶

You can simulate lower context limits from high-context evaluations:

{
  "evals": [
    {
      "evaluate": {
        "glob": "data/m12x/*phi-4-fp16*/*",
        "context": 8192
      },
      "filters": {
        "model": "phi-4-fp16",
        "template": "zerocot-nosys",
        "sampler": "greedy-max-ctx8192"
      },
      "label": "Microsoft Phi-4 (8k ctx)",
      "groups": ["family:phi4", "arch:dense", "size:mid"],
      "hf_id": "microsoft/Phi-4"
    }
  ]
}

How it works: 1. Uses same raw interview data as base eval 2. evaluate.py clips responses to fit in simulated context 3. Sampler auto-suffixed: greedy-max → greedy-max-ctx8192 4. Stored as separate eval with unique eval_id

Building Your Own Evaluation¶

Use m12x as a reference: - Study manifold definitions in tasks/*.json to understand difficulty parameterization - Examine tier mappings in data/dataset-m12x.json to see how easy/medium/hard scale - Review surface/projection naming conventions for organizing analysis - Analyze precision configurations to calibrate your own cost/confidence tradeoffs

Fork and adapt: - Start with m12x manifolds and adjust difficulty ranges for your models - Modify tier thresholds as model capabilities improve (add "ultra" when "hard" becomes easy) - Add new surfaces based on your research questions (any two parameters can form a surface) - Use m12x's dataset structure as a template for your own evaluations

Validate your changes: - Compare your results against m12x baselines - Use m12x's cognitive archetypes as reference patterns - Leverage m12x's forensic case studies as investigation templates

Researching with m12x¶

Beyond the reference evaluation: worked examples of the ReasonScape methodology in action.

The m12x reference dataset proves the methodology works at scale. But how do you actually use it to answer research questions?

6 research subdatasets demonstrate the complete workflow:

Seed-OSS thinking budget optimization (7 configs, overthinking loop diagnosis)
Qwen3-Next RC budget allocation (budget distribution matters more than size)
Qwen3 family deep dive (19 variants across sizes and capabilities)
Plus Ministral3, GPT-OSS, and K2Think investigations

Each subdataset is a template showing:

Hypothesis formation and experimental design
Discovery-investigation ping-pong with analyze.py
Evidence chain building with compression/hazard/surface plots
Statistical validation and graduation decisions
Complete SUMMARY.md documenting the story

Learn by example: See datasets.md for complete documentation including:

Dataset taxonomy (main vs research)
The graduation pattern (research → main for winners)
Dataset lifecycle (hypothesis → experiments → evidence → graduation)
Case studies with worked examples
Best practices from real investigations