r12: The Current Reference Evaluation¶

What is r12?¶

r12 is ReasonScape's current reference evaluation. It evaluates language model reasoning across 12 tasks using comprehensive parametric coverage and a single difficulty tier.

12 tasks - Covering 8 distinct cognitive domains
95% accuracy ceiling - Calibrated to allow capable models to "pass"
1,070 points per evaluation - More difficulty points across all 12 tasks to understand capability boundaries
16k context window - Eliminates most truncation artifacts
ReasonScore v2 - Simplified 2-layer scoring with probability-space truncation handling and bootstrap confidence intervals

The r12 dataset is actively growing, over 60 models have been evaluated with new models added regularly.

The 12 Tasks¶

Each task tests distinct reasoning capabilities through parametric variation:

Task	Domain	Tests	Key Parameters
Arithmetic	Mathematical reasoning	Symbolic parsing, numeric computation, nesting	Length, depth, whitespace, number range
Boolean	Logical evaluation	Logic operations, symbolic parsing	Length, depth, boolean format
Dates	Temporal reasoning	Calendar math, date parsing, temporal offsets	Tier, date format
Objects	Selective attention	Filtering, semantic categorization	Length, target groups, distractors
Shuffle	State tracking	Sequence manipulation, working memory	Depth, confounders, anchors, list length
Brackets	Structural parsing	Nesting, matching, constraint satisfaction	Length, depth, bracket types
Letters	Character analysis	Counting, pattern matching, case sensitivity	Target words, letter frequency, confounders
Tables	Structured data reasoning	Table parsing, filtering, aggregation	Rows, columns, format, operation type
Shapes	Spatial reasoning	Visual parsing, rotation/scaling, transformation tracking	Shape type, rotation, scale, offset
Cars	Logistics planning	State tracking, spatial relationships, operation sequences	Moves, operation complexity, entities
Sort	Algorithmic thinking	Sorting, run-length grouping, sequence manipulation	Length, run-length, mechanical mode
Sequence	Rule-based generation	Pattern identification, rule application	Rule count, sequence length, rule complexity

Instead of predefined difficulty levels, r12 achieves comprehensive coverage through parametric grids that span each task's full difficulty space. Difficulty emerges empirically from model performance rather than being imposed by the evaluator.

Task Reference¶

Obtain Data¶

First, obtain the r12 results database by pulling it:

python data.py pull dataset data/r12.json

See data.md for futher information.

Quick Discovery¶

List all available analysis dimensions:

python analyze.py tasks data/r12.json

This shows all views defined for each task.

Similarly, to see all models for which evaluation data is available:

python analyze.py evals data/r12.json

Views¶

Each task defines views -- named analysis recipes that slice results by one or two parameters. Views are defined in the experiment config (configs/r12.yaml) and consumed by analyze.py to produce surfaces and projections.

A view specifies: - group_by: The primary axes (1 or 2 manifold.* or params.* columns) - facet_by (optional): Columns that split results into separate panels or series - filters (optional): Fixed values that restrict the view to a specific slice

Example views from the Arithmetic task:

views:
- view: arith_depth_length_single
  label: "Depth x Length (Small, {prob_dewhitespace} whitespace)"
  group_by: ["params.max_depth", "params.length"]
  facet_by: ["params.prob_dewhitespace"]
  filters: { "manifold.id": "single_digit" }
- view: arith_length
  label: Length
  group_by: ["params.length"]
  filters: { "manifold.id": "single_digit", "params.prob_dewhitespace": 0.5, "params.max_depth": 0 }

Views replace the old surfaces/projections distinction. Use the discovery command above to list all views currently defined for each task.

Group Taxonomy¶

Groups are classification tags on model evaluations that enable peer comparison and cross-cutting analysis. They are set in evals.json for each model and are system-independent.

Note: the facet_by field in views (see above) is a separate concept -- it splits chart panels by parameter values, not by model groups.

Architecture (`arch:`)¶

arch:dense - Standard dense transformer: all parameters active per token
arch:moe - Mixture of Experts: sparse activation via routing
arch:ssm - State-space models: recurrent/linear complexity architectures
arch:hybrid - Multiple mechanisms: mixed dense and sparse layers

Size (`size:`)¶

Use active parameters (not total parameters for MoE models):

size:tiny - <3B active parameters
size:small - 3B-10B active parameters
size:mid - 10B-30B active parameters
size:large - 30B-70B active parameters
size:xlarge - 70B+ active parameters

Family (`family:`)¶

Model family/organization (one per model):

family:phi4, family:qwen3, family:llama, family:gemini
family:granite, family:mistral, family:glm, family:deepseek
family:ministral, family:hermes, family:nemotron
And others for new models

See config.md for futher information and usage examples.

Adding Models to r12¶

Dataset Structure¶

Refer to datasets.md to understand the r12 cohort dataset structure.

Primary Path: `/import` Skill¶

The recommended way to add a new model is the /import skill:

/import ModelName

This handles directory creation, evals.json generation, and database rebuild automatically.

Manual Process (Reference)¶

Step 1: Run Evaluation

python runner.py --config configs/r12.yaml \
  --model my-new-model \
  --apibase http://localhost:8000 \
  --template zerocot-nosys \
  --sampler greedy-max

This generates NDJSON interview files in results/timestamp/my-new-model+zerocot-nosys+greedy-max/.

Step 2: Create Cohort Directory

mkdir -p data/r12/MyNewModel
mv results/timestamp/my-new-model+zerocot-nosys+greedy-max/* data/r12/MyNewModel/

Step 3: Create evals.json

Create data/r12/MyNewModel/evals.json:

[
  {
    "evaluate": {
      "glob": "data/r12/MyNewModel/*"
    },
    "filters": {
      "model": "my-new-model",
      "template": "zerocot-nosys",
      "sampler": "greedy-max"
    },
    "label": "MyNew Model",
    "groups": [
      "family:mynew",
      "arch:dense",
      "size:large"
    ],
    "hf_id": "org/my-new-model",
    "hf_quant_id": null,
    "tags": []
  }
]

Step 4: Process Evaluation

python evaluate.py --dataset data/r12.json --parallel 16

Step 5: Verify

# Confirm model appears
python analyze.py evals data/r12.json --search "mynew"

# Quick leaderboard check
python analyze.py scores data/r12.json

Using r12 for Research¶

r12 is both a reference evaluation and a research platform. The analysis workflow follows the Three P's:

Research process: 1. Position - Use scores to rank models overall, cluster for statistical rigor 2. Profile - Use spiderweb for fingerprints, surface for capability boundaries, compression/hazard for reasoning and temporal analysis 3. Probe - Use probe.py failure and probe.py truncation to inspect raw outputs and diagnose loop patterns

Research output goes in:

research/<project>/

See workflow.md for detailed guides and datasets.md for research subdataset documentation.