m12x: The Reference¶
What is m12x?¶
m12x is ReasonScape's reference configuration—a complete evaluation of 75+ frontier models across 12 reasoning tasks, generating 6.5B tokens of analysis-ready data.
The extraordinary claim: LLMs are information processors and we can diagnose their failures like engineers diagnose signal systems.
The extraordinary evidence: - FFT reveals spectral signatures that differ by tokenizer/architecture - Compression shows underthink/overthink/broken loops patterns - Hazard analysis proves models have measurable "thinking budgets" - Surface plots reveal capability boundaries nobody knew existed - Statistical rigor confirms these patterns are signal, not noise
Eating Our Own Dogfood¶
m12x is the ReasonScape reference implementation that proves:
- ✅ The five-stage architecture is practical, not theoretical — 6.5B tokens processed through the full pipeline
- ✅ The discovery-investigation loop enables real research — Documented workflows showing ping-pong between Stages 4 and 5
- ✅ The manifold/tier/surface abstractions scale to production — 12 tasks × 75 models × 3 tiers without architectural strain
- ✅ The forensic quartet isolates root causes — Case studies showing INPUT→REASONING→OUTPUT analysis chains
- ✅ You can build serious reasoning research on this foundation — m12x itself produced novel findings about model failure modes
Live Results Access¶
- M12X Leaderboard: https://reasonscape.com/m12x/leaderboard
- M12X Explorer: https://reasonscape.com/m12x/explorer (PC required)
- M12X Dataset: https://reasonscape.com/data/m12x
m12x Configuration¶
Task Coverage¶
m12x evaluates across 12 reasoning domains:
| Task | Focus | Primary Capabilities |
|---|---|---|
| Arithmetic | Mathematical reasoning | Math, Symbolic Parsing, Structural Analysis |
| Boolean | Logical evaluation | Logic, Symbolic Parsing, Structural Analysis |
| Brackets | Structural parsing | Symbolic Parsing, Pattern Recognition, Structural Analysis |
| Objects | Selective attention | Selective Attention, Semantic Categorization, Language |
| Shuffle | State tracking | State Tracking, Selective Attention, Language |
| Sort | Algorithmic thinking | Symbolic Parsing, Pattern Recognition, Language |
| Dates | Temporal reasoning | Math, Pattern Recognition, Temporal Reasoning, Language |
| Letters | Character analysis | Math, Selective Attention, Symbolic Parsing, Language |
| Movies | Pattern recognition | Pattern Recognition, Semantic Categorization, Language |
| Sequence | Rule-based generation | Math, Logic, Symbolic Parsing, Language |
| Shapes | Spatial reasoning | Symbolic Parsing, Pattern Recognition, Spatial Reasoning |
| Cars | Logistics planning | State Tracking, Selective Attention, Spatial Reasoning, Language |
For complete task specifications and difficulty manifolds: See tasks.md for abstract task API and tasks/*.md for implementation details.
Tier Definitions¶
m12x uses three difficulty tiers that scale with model capabilities:
| Tier | Degree | Density | Purpose |
|---|---|---|---|
| Easy | 0 | normal | Baseline difficulty for quick model comparison |
| Medium | 1 | normal | Standard evaluation difficulty |
| Hard | 2 | normal | Comprehensive research difficulty |
Tier mapping lives in: data/dataset-m12x.json
Design rationale: As models improve, new tiers can be added (e.g., "ultra" at degree=3) without changing manifold definitions. This ensures long-term reproducibility while enabling adaptive difficulty scaling.
Surface Definitions¶
Surfaces are 2D slices through difficulty manifolds, organized by task. Each surface explores the interaction between two difficulty parameters.
Example surfaces:
arithmetic_length_x_depth- How length and nesting depth interactboolean_length_x_depth- Logical complexity vs expression lengthobjects_length_x_distractors- Selective attention under increasing load
Discovery: Use python analyze.py tasks data/dataset-m12x.json to see all available surfaces per task.
For surface visualization: See explorer.py and analyze.py surface.
Projection Definitions¶
Projections are 1D sweeps through difficulty manifolds, holding other parameters fixed.
Example projections:
arithmetic_length_sweep- Performance vs input lengthboolean_depth_sweep- Performance vs nesting depthobjects_distractor_sweep- Performance vs distraction load
Discovery: Use python analyze.py tasks data/dataset-m12x.json to see all available projections per task.
For projection analysis: See analyze.py surface with projection filters.
Model Coverage¶
75+ frontier models (Nov 2024 - Jan 2025):
- All major LLM families: DeepSeek, Qwen, Llama, Mistral, Phi, Gemma, etc..
- Dense and sparse MoE architectures
- 1B to 355B parameter ranges
- Multiple quantization schemes (FP16, 4-bit, 8-bit)
- 8k context
Evaluation Coverage:
- 3 difficulty tiers per task (easy/medium/hard)
- Multiple templates (zero-shot, chain-of-thought, system prompts)
- Multiple samplers (temperature 0, temperature 0.7, top-p variations)
- Adaptive precision (32-128 samples per point, CI-targeted stopping)
Data Volume:
- 6.5B tokens total (prompts + completions)
- 30K+ evaluation points in PointsDB
- Millions of test instances generated deterministically
- Complete response logs with compression and FFT pre-computed
Group Taxonomy¶
Groups are classification tags that enable peer comparison and cross-cutting analysis. m12x uses three primary dimensions:
1. Architecture Type (arch:)¶
-
arch:dense- Standard transformer models: All parameters active per token, predictable compute -
arch:moe- Mixture of Experts with sparse activation: Reduced compute via expert routing -
arch:ssm- State-space model architectures: Recurrent processing with linear complexity -
arch:hybrid- Models combining multiple mechanisms: Multiple types of layers mixed together
2. Size Category (size:)¶
Use active parameters for MoE models:
size:tiny- <3B active parameters (mobile deployment)size:small- 3B-10B active parameters (desktop deployment)size:mid- 10B-30B active parameters (server deployment)size:large- 30B+ active parameters (enterprise deployment)
3. Model Family (family:)¶
family:llama- Meta's Llama familyfamily:phi4- Microsoft's Phi-4 seriesfamily:gemma3- Google's Gemma 3 seriesfamily:qwen3- Alibaba's Qwen 3 seriesfamily:granite- IBM's Granite seriesfamily:gpt-oss- OpenAI's GPT-OSS series- (and more)
Extending m12x¶
Adding a New Model to m12x¶
Step 1: Run the Evaluation
python runner.py --config configs/m12x.yaml \
--degree 0,1,2 --density normal --precision low \
--model new-model-70b \
--apibase http://localhost:8000 \
--template zerocot-nosys \
--sampler greedy-max
This creates interview NDJSON files in results/timestamp/new-model-70b+zerocot-nosys+greedy-max/.
Step 2: Organize Interview Data
# Create experiment directory
mkdir -p data/m12x/new-model-70b-exp001
# Move interview files
mv results/timestamp/new-model-70b+zerocot-nosys+greedy-max/* \
data/m12x/new-model-70b-exp001/
Step 3: Add to Dataset Config
Edit data/dataset-m12x.json and add a new eval entry:
{
"evals": [
{
"evaluate": {
"glob": "data/m12x/new-model-70b-exp001/*"
},
"filters": {
"model": "new-model-70b",
"template": "zerocot-nosys",
"sampler": "greedy-max"
},
"label": "New Model 70B",
"groups": [
"family:new",
"arch:dense",
"size:large"
],
"hf_id": "organization/new-model-70b",
"hf_quant_id": null
}
]
}
Step 4: Process Evaluation
# Process all missing evaluations (idempotent)
python evaluate.py --dataset data/dataset-m12x.json --parallel 16
Step 5: Verify Integration
# Verify eval appears in dataset
python analyze.py evals data/dataset-m12x.json --search "new model"
# Check database size
ls -lh data/m12x.db
# Quick leaderboard check
python analyze.py scores data/dataset-m12x.json \
--filters '{"model": "new-model-70b"}' --format markdown
Context Simulation¶
You can simulate lower context limits from high-context evaluations:
{
"evals": [
{
"evaluate": {
"glob": "data/m12x/*phi-4-fp16*/*",
"context": 8192
},
"filters": {
"model": "phi-4-fp16",
"template": "zerocot-nosys",
"sampler": "greedy-max-ctx8192"
},
"label": "Microsoft Phi-4 (8k ctx)",
"groups": ["family:phi4", "arch:dense", "size:mid"],
"hf_id": "microsoft/Phi-4"
}
]
}
How it works:
1. Uses same raw interview data as base eval
2. evaluate.py clips responses to fit in simulated context
3. Sampler auto-suffixed: greedy-max → greedy-max-ctx8192
4. Stored as separate eval with unique eval_id
Building Your Own Evaluation¶
Use m12x as a reference:
- Study manifold definitions in tasks/*.json to understand difficulty parameterization
- Examine tier mappings in data/dataset-m12x.json to see how easy/medium/hard scale
- Review surface/projection naming conventions for organizing analysis
- Analyze precision configurations to calibrate your own cost/confidence tradeoffs
Fork and adapt: - Start with m12x manifolds and adjust difficulty ranges for your models - Modify tier thresholds as model capabilities improve (add "ultra" when "hard" becomes easy) - Add new surfaces based on your research questions (any two parameters can form a surface) - Use m12x's dataset structure as a template for your own evaluations
Validate your changes: - Compare your results against m12x baselines - Use m12x's cognitive archetypes as reference patterns - Leverage m12x's forensic case studies as investigation templates
Researching with m12x¶
Beyond the reference evaluation: worked examples of the ReasonScape methodology in action.
The m12x reference dataset proves the methodology works at scale. But how do you actually use it to answer research questions?
6 research subdatasets demonstrate the complete workflow:
- Seed-OSS thinking budget optimization (7 configs, overthinking loop diagnosis)
- Qwen3-Next RC budget allocation (budget distribution matters more than size)
- Qwen3 family deep dive (19 variants across sizes and capabilities)
- Plus Ministral3, GPT-OSS, and K2Think investigations
Each subdataset is a template showing:
- Hypothesis formation and experimental design
- Discovery-investigation ping-pong with analyze.py
- Evidence chain building with compression/hazard/surface plots
- Statistical validation and graduation decisions
- Complete SUMMARY.md documenting the story
Learn by example: See datasets.md for complete documentation including:
- Dataset taxonomy (main vs research)
- The graduation pattern (research → main for winners)
- Dataset lifecycle (hypothesis → experiments → evidence → graduation)
- Case studies with worked examples
- Best practices from real investigations
See Also¶
- Datasets - Research subdatasets and worked examples (Pillar 6)
- Architecture Overview - The extraordinary claim and five-stage philosophy
- Workflow Guide - Research workflows and examples
- Tools - Complete tool reference
- Tasks - Abstract task API specifications
- Tasks/*.md - Individual task implementation details