Skip to content

m12x: The Reference

What is m12x?

m12x is ReasonScape's reference configuration—a complete evaluation of 75+ frontier models across 12 reasoning tasks, generating 6.5B tokens of analysis-ready data.

The extraordinary claim: LLMs are information processors and we can diagnose their failures like engineers diagnose signal systems.

The extraordinary evidence: - FFT reveals spectral signatures that differ by tokenizer/architecture - Compression shows underthink/overthink/broken loops patterns - Hazard analysis proves models have measurable "thinking budgets" - Surface plots reveal capability boundaries nobody knew existed - Statistical rigor confirms these patterns are signal, not noise

Eating Our Own Dogfood

m12x is the ReasonScape reference implementation that proves:

  1. The five-stage architecture is practical, not theoretical — 6.5B tokens processed through the full pipeline
  2. The discovery-investigation loop enables real research — Documented workflows showing ping-pong between Stages 4 and 5
  3. The manifold/tier/surface abstractions scale to production — 12 tasks × 75 models × 3 tiers without architectural strain
  4. The forensic quartet isolates root causes — Case studies showing INPUT→REASONING→OUTPUT analysis chains
  5. You can build serious reasoning research on this foundation — m12x itself produced novel findings about model failure modes

Live Results Access

m12x Configuration

Task Coverage

m12x evaluates across 12 reasoning domains:

Task Focus Primary Capabilities
Arithmetic Mathematical reasoning Math, Symbolic Parsing, Structural Analysis
Boolean Logical evaluation Logic, Symbolic Parsing, Structural Analysis
Brackets Structural parsing Symbolic Parsing, Pattern Recognition, Structural Analysis
Objects Selective attention Selective Attention, Semantic Categorization, Language
Shuffle State tracking State Tracking, Selective Attention, Language
Sort Algorithmic thinking Symbolic Parsing, Pattern Recognition, Language
Dates Temporal reasoning Math, Pattern Recognition, Temporal Reasoning, Language
Letters Character analysis Math, Selective Attention, Symbolic Parsing, Language
Movies Pattern recognition Pattern Recognition, Semantic Categorization, Language
Sequence Rule-based generation Math, Logic, Symbolic Parsing, Language
Shapes Spatial reasoning Symbolic Parsing, Pattern Recognition, Spatial Reasoning
Cars Logistics planning State Tracking, Selective Attention, Spatial Reasoning, Language

For complete task specifications and difficulty manifolds: See tasks.md for abstract task API and tasks/*.md for implementation details.

Tier Definitions

m12x uses three difficulty tiers that scale with model capabilities:

Tier Degree Density Purpose
Easy 0 normal Baseline difficulty for quick model comparison
Medium 1 normal Standard evaluation difficulty
Hard 2 normal Comprehensive research difficulty

Tier mapping lives in: data/dataset-m12x.json

Design rationale: As models improve, new tiers can be added (e.g., "ultra" at degree=3) without changing manifold definitions. This ensures long-term reproducibility while enabling adaptive difficulty scaling.

Surface Definitions

Surfaces are 2D slices through difficulty manifolds, organized by task. Each surface explores the interaction between two difficulty parameters.

Example surfaces:

  • arithmetic_length_x_depth - How length and nesting depth interact
  • boolean_length_x_depth - Logical complexity vs expression length
  • objects_length_x_distractors - Selective attention under increasing load

Discovery: Use python analyze.py tasks data/dataset-m12x.json to see all available surfaces per task.

For surface visualization: See explorer.py and analyze.py surface.

Projection Definitions

Projections are 1D sweeps through difficulty manifolds, holding other parameters fixed.

Example projections:

  • arithmetic_length_sweep - Performance vs input length
  • boolean_depth_sweep - Performance vs nesting depth
  • objects_distractor_sweep - Performance vs distraction load

Discovery: Use python analyze.py tasks data/dataset-m12x.json to see all available projections per task.

For projection analysis: See analyze.py surface with projection filters.


Model Coverage

75+ frontier models (Nov 2024 - Jan 2025):

  • All major LLM families: DeepSeek, Qwen, Llama, Mistral, Phi, Gemma, etc..
  • Dense and sparse MoE architectures
  • 1B to 355B parameter ranges
  • Multiple quantization schemes (FP16, 4-bit, 8-bit)
  • 8k context

Evaluation Coverage:

  • 3 difficulty tiers per task (easy/medium/hard)
  • Multiple templates (zero-shot, chain-of-thought, system prompts)
  • Multiple samplers (temperature 0, temperature 0.7, top-p variations)
  • Adaptive precision (32-128 samples per point, CI-targeted stopping)

Data Volume:

  • 6.5B tokens total (prompts + completions)
  • 30K+ evaluation points in PointsDB
  • Millions of test instances generated deterministically
  • Complete response logs with compression and FFT pre-computed

Group Taxonomy

Groups are classification tags that enable peer comparison and cross-cutting analysis. m12x uses three primary dimensions:

1. Architecture Type (arch:)

  • arch:dense - Standard transformer models: All parameters active per token, predictable compute

  • arch:moe - Mixture of Experts with sparse activation: Reduced compute via expert routing

  • arch:ssm - State-space model architectures: Recurrent processing with linear complexity

  • arch:hybrid - Models combining multiple mechanisms: Multiple types of layers mixed together

2. Size Category (size:)

Use active parameters for MoE models:

  • size:tiny - <3B active parameters (mobile deployment)
  • size:small - 3B-10B active parameters (desktop deployment)
  • size:mid - 10B-30B active parameters (server deployment)
  • size:large - 30B+ active parameters (enterprise deployment)

3. Model Family (family:)

  • family:llama - Meta's Llama family
  • family:phi4 - Microsoft's Phi-4 series
  • family:gemma3 - Google's Gemma 3 series
  • family:qwen3 - Alibaba's Qwen 3 series
  • family:granite - IBM's Granite series
  • family:gpt-oss - OpenAI's GPT-OSS series
  • (and more)

Extending m12x

Adding a New Model to m12x

Step 1: Run the Evaluation

python runner.py --config configs/m12x.yaml \
  --degree 0,1,2 --density normal --precision low \
  --model new-model-70b \
  --apibase http://localhost:8000 \
  --template zerocot-nosys \
  --sampler greedy-max

This creates interview NDJSON files in results/timestamp/new-model-70b+zerocot-nosys+greedy-max/.

Step 2: Organize Interview Data

# Create experiment directory
mkdir -p data/m12x/new-model-70b-exp001

# Move interview files
mv results/timestamp/new-model-70b+zerocot-nosys+greedy-max/* \
   data/m12x/new-model-70b-exp001/

Step 3: Add to Dataset Config

Edit data/dataset-m12x.json and add a new eval entry:

{
  "evals": [
    {
      "evaluate": {
        "glob": "data/m12x/new-model-70b-exp001/*"
      },
      "filters": {
        "model": "new-model-70b",
        "template": "zerocot-nosys",
        "sampler": "greedy-max"
      },
      "label": "New Model 70B",
      "groups": [
        "family:new",
        "arch:dense",
        "size:large"
      ],
      "hf_id": "organization/new-model-70b",
      "hf_quant_id": null
    }
  ]
}

Step 4: Process Evaluation

# Process all missing evaluations (idempotent)
python evaluate.py --dataset data/dataset-m12x.json --parallel 16

Step 5: Verify Integration

# Verify eval appears in dataset
python analyze.py evals data/dataset-m12x.json --search "new model"

# Check database size
ls -lh data/m12x.db

# Quick leaderboard check
python analyze.py scores data/dataset-m12x.json \
  --filters '{"model": "new-model-70b"}' --format markdown

Context Simulation

You can simulate lower context limits from high-context evaluations:

{
  "evals": [
    {
      "evaluate": {
        "glob": "data/m12x/*phi-4-fp16*/*",
        "context": 8192
      },
      "filters": {
        "model": "phi-4-fp16",
        "template": "zerocot-nosys",
        "sampler": "greedy-max-ctx8192"
      },
      "label": "Microsoft Phi-4 (8k ctx)",
      "groups": ["family:phi4", "arch:dense", "size:mid"],
      "hf_id": "microsoft/Phi-4"
    }
  ]
}

How it works: 1. Uses same raw interview data as base eval 2. evaluate.py clips responses to fit in simulated context 3. Sampler auto-suffixed: greedy-maxgreedy-max-ctx8192 4. Stored as separate eval with unique eval_id

Building Your Own Evaluation

Use m12x as a reference: - Study manifold definitions in tasks/*.json to understand difficulty parameterization - Examine tier mappings in data/dataset-m12x.json to see how easy/medium/hard scale - Review surface/projection naming conventions for organizing analysis - Analyze precision configurations to calibrate your own cost/confidence tradeoffs

Fork and adapt: - Start with m12x manifolds and adjust difficulty ranges for your models - Modify tier thresholds as model capabilities improve (add "ultra" when "hard" becomes easy) - Add new surfaces based on your research questions (any two parameters can form a surface) - Use m12x's dataset structure as a template for your own evaluations

Validate your changes: - Compare your results against m12x baselines - Use m12x's cognitive archetypes as reference patterns - Leverage m12x's forensic case studies as investigation templates


Researching with m12x

Beyond the reference evaluation: worked examples of the ReasonScape methodology in action.

The m12x reference dataset proves the methodology works at scale. But how do you actually use it to answer research questions?

6 research subdatasets demonstrate the complete workflow:

  • Seed-OSS thinking budget optimization (7 configs, overthinking loop diagnosis)
  • Qwen3-Next RC budget allocation (budget distribution matters more than size)
  • Qwen3 family deep dive (19 variants across sizes and capabilities)
  • Plus Ministral3, GPT-OSS, and K2Think investigations

Each subdataset is a template showing:

  • Hypothesis formation and experimental design
  • Discovery-investigation ping-pong with analyze.py
  • Evidence chain building with compression/hazard/surface plots
  • Statistical validation and graduation decisions
  • Complete SUMMARY.md documenting the story

Learn by example: See datasets.md for complete documentation including:

  • Dataset taxonomy (main vs research)
  • The graduation pattern (research → main for winners)
  • Dataset lifecycle (hypothesis → experiments → evidence → graduation)
  • Case studies with worked examples
  • Best practices from real investigations

See Also