Skip to content

r12: The Current Reference Evaluation

What is r12?

r12 is ReasonScape's current reference evaluation. It evaluates language model reasoning across 12 tasks using comprehensive parametric coverage and a single difficulty tier.

  • 12 tasks - Covering 8 distinct cognitive domains
  • 95% accuracy ceiling - Calibrated to allow capable models to "pass"
  • 1,070 points per evaluation - More difficulty points across all 12 tasks to understand capability boundaries
  • 16k context window - Eliminates most truncation artifacts
  • ReasonScore v2 - Simplified 2-layer scoring with probability-space truncation handling and bootstrap confidence intervals

The r12 dataset is actively growing, over 60 models have been evaluated with new models added regularly.


The 12 Tasks

Each task tests distinct reasoning capabilities through parametric variation:

Task Domain Tests Key Parameters
Arithmetic Mathematical reasoning Symbolic parsing, numeric computation, nesting Length, depth, whitespace, number range
Boolean Logical evaluation Logic operations, symbolic parsing Length, depth, boolean format
Dates Temporal reasoning Calendar math, date parsing, temporal offsets Tier, date format
Objects Selective attention Filtering, semantic categorization Length, target groups, distractors
Shuffle State tracking Sequence manipulation, working memory Depth, confounders, anchors, list length
Brackets Structural parsing Nesting, matching, constraint satisfaction Length, depth, bracket types
Letters Character analysis Counting, pattern matching, case sensitivity Target words, letter frequency, confounders
Tables Structured data reasoning Table parsing, filtering, aggregation Rows, columns, format, operation type
Shapes Spatial reasoning Visual parsing, rotation/scaling, transformation tracking Shape type, rotation, scale, offset
Cars Logistics planning State tracking, spatial relationships, operation sequences Moves, operation complexity, entities
Sort Algorithmic thinking Sorting, run-length grouping, sequence manipulation Length, run-length, mechanical mode
Sequence Rule-based generation Pattern identification, rule application Rule count, sequence length, rule complexity

Instead of predefined difficulty levels, r12 achieves comprehensive coverage through parametric grids that span each task's full difficulty space. Difficulty emerges empirically from model performance rather than being imposed by the evaluator.

Task Reference

Obtain Data

First, obtain the r12 results database by pulling it:

python data.py pull dataset data/r12.json

See data.md for futher information.

Quick Discovery

List all available analysis dimensions:

python analyze.py tasks data/r12.json

This shows all views defined for each task.

Similarly, to see all models for which evaluation data is available:

python analyze.py evals data/r12.json

Views

Each task defines views -- named analysis recipes that slice results by one or two parameters. Views are defined in the experiment config (configs/r12.yaml) and consumed by analyze.py to produce surfaces and projections.

A view specifies: - group_by: The primary axes (1 or 2 manifold.* or params.* columns) - facet_by (optional): Columns that split results into separate panels or series - filters (optional): Fixed values that restrict the view to a specific slice

Example views from the Arithmetic task:

views:
- view: arith_depth_length_single
  label: "Depth x Length (Small, {prob_dewhitespace} whitespace)"
  group_by: ["params.max_depth", "params.length"]
  facet_by: ["params.prob_dewhitespace"]
  filters: { "manifold.id": "single_digit" }
- view: arith_length
  label: Length
  group_by: ["params.length"]
  filters: { "manifold.id": "single_digit", "params.prob_dewhitespace": 0.5, "params.max_depth": 0 }

Views replace the old surfaces/projections distinction. Use the discovery command above to list all views currently defined for each task.

Group Taxonomy

Groups are classification tags on model evaluations that enable peer comparison and cross-cutting analysis. They are set in evals.json for each model and are system-independent.

Note: the facet_by field in views (see above) is a separate concept -- it splits chart panels by parameter values, not by model groups.

Architecture (arch:)

  • arch:dense - Standard dense transformer: all parameters active per token
  • arch:moe - Mixture of Experts: sparse activation via routing
  • arch:ssm - State-space models: recurrent/linear complexity architectures
  • arch:hybrid - Multiple mechanisms: mixed dense and sparse layers

Size (size:)

Use active parameters (not total parameters for MoE models):

  • size:tiny - <3B active parameters
  • size:small - 3B-10B active parameters
  • size:mid - 10B-30B active parameters
  • size:large - 30B-70B active parameters
  • size:xlarge - 70B+ active parameters

Family (family:)

Model family/organization (one per model):

  • family:phi4, family:qwen3, family:llama, family:gemini
  • family:granite, family:mistral, family:glm, family:deepseek
  • family:ministral, family:hermes, family:nemotron
  • And others for new models

See config.md for futher information and usage examples.


Adding Models to r12

Dataset Structure

Refer to datasets.md to understand the r12 cohort dataset structure.

Primary Path: /import Skill

The recommended way to add a new model is the /import skill:

/import ModelName

This handles directory creation, evals.json generation, and database rebuild automatically.

Manual Process (Reference)

Step 1: Run Evaluation

python runner.py --config configs/r12.yaml \
  --model my-new-model \
  --apibase http://localhost:8000 \
  --template zerocot-nosys \
  --sampler greedy-max

This generates NDJSON interview files in results/timestamp/my-new-model+zerocot-nosys+greedy-max/.

Step 2: Create Cohort Directory

mkdir -p data/r12/MyNewModel
mv results/timestamp/my-new-model+zerocot-nosys+greedy-max/* data/r12/MyNewModel/

Step 3: Create evals.json

Create data/r12/MyNewModel/evals.json:

[
  {
    "evaluate": {
      "glob": "data/r12/MyNewModel/*"
    },
    "filters": {
      "model": "my-new-model",
      "template": "zerocot-nosys",
      "sampler": "greedy-max"
    },
    "label": "MyNew Model",
    "groups": [
      "family:mynew",
      "arch:dense",
      "size:large"
    ],
    "hf_id": "org/my-new-model",
    "hf_quant_id": null,
    "tags": []
  }
]

Step 4: Process Evaluation

python evaluate.py --dataset data/r12.json --parallel 16

Step 5: Verify

# Confirm model appears
python analyze.py evals data/r12.json --search "mynew"

# Quick leaderboard check
python analyze.py scores data/r12.json

Using r12 for Research

r12 is both a reference evaluation and a research platform. The analysis workflow follows the Three P's:

Research process: 1. Position - Use scores to rank models overall, cluster for statistical rigor 2. Profile - Use spiderweb for fingerprints, surface for capability boundaries, compression/hazard for reasoning and temporal analysis 3. Probe - Use probe.py failure and probe.py truncation to inspect raw outputs and diagnose loop patterns

Research output goes in:

research/<project>/

See workflow.md for detailed guides and datasets.md for research subdataset documentation.


See Also