r12: The Current Reference Evaluation¶
What is r12?¶
r12 is ReasonScape's current reference evaluation. It evaluates language model reasoning across 12 tasks using comprehensive parametric coverage and a single difficulty tier.
- 12 tasks - Covering 8 distinct cognitive domains
- 95% accuracy ceiling - Calibrated to allow capable models to "pass"
- 1,070 points per evaluation - More difficulty points across all 12 tasks to understand capability boundaries
- 16k context window - Eliminates most truncation artifacts
- ReasonScore v2 - Simplified 2-layer scoring with probability-space truncation handling and bootstrap confidence intervals
The r12 dataset is actively growing, over 60 models have been evaluated with new models added regularly.
The 12 Tasks¶
Each task tests distinct reasoning capabilities through parametric variation:
| Task | Domain | Tests | Key Parameters |
|---|---|---|---|
| Arithmetic | Mathematical reasoning | Symbolic parsing, numeric computation, nesting | Length, depth, whitespace, number range |
| Boolean | Logical evaluation | Logic operations, symbolic parsing | Length, depth, boolean format |
| Dates | Temporal reasoning | Calendar math, date parsing, temporal offsets | Tier, date format |
| Objects | Selective attention | Filtering, semantic categorization | Length, target groups, distractors |
| Shuffle | State tracking | Sequence manipulation, working memory | Depth, confounders, anchors, list length |
| Brackets | Structural parsing | Nesting, matching, constraint satisfaction | Length, depth, bracket types |
| Letters | Character analysis | Counting, pattern matching, case sensitivity | Target words, letter frequency, confounders |
| Tables | Structured data reasoning | Table parsing, filtering, aggregation | Rows, columns, format, operation type |
| Shapes | Spatial reasoning | Visual parsing, rotation/scaling, transformation tracking | Shape type, rotation, scale, offset |
| Cars | Logistics planning | State tracking, spatial relationships, operation sequences | Moves, operation complexity, entities |
| Sort | Algorithmic thinking | Sorting, run-length grouping, sequence manipulation | Length, run-length, mechanical mode |
| Sequence | Rule-based generation | Pattern identification, rule application | Rule count, sequence length, rule complexity |
Instead of predefined difficulty levels, r12 achieves comprehensive coverage through parametric grids that span each task's full difficulty space. Difficulty emerges empirically from model performance rather than being imposed by the evaluator.
Task Reference¶
Obtain Data¶
First, obtain the r12 results database by pulling it:
python data.py pull dataset data/r12.json
See data.md for futher information.
Quick Discovery¶
List all available analysis dimensions:
python analyze.py tasks data/r12.json
This shows all views defined for each task.
Similarly, to see all models for which evaluation data is available:
python analyze.py evals data/r12.json
Views¶
Each task defines views -- named analysis recipes that slice results by one or two parameters. Views are defined in the experiment config (configs/r12.yaml) and consumed by analyze.py to produce surfaces and projections.
A view specifies:
- group_by: The primary axes (1 or 2 manifold.* or params.* columns)
- facet_by (optional): Columns that split results into separate panels or series
- filters (optional): Fixed values that restrict the view to a specific slice
Example views from the Arithmetic task:
views:
- view: arith_depth_length_single
label: "Depth x Length (Small, {prob_dewhitespace} whitespace)"
group_by: ["params.max_depth", "params.length"]
facet_by: ["params.prob_dewhitespace"]
filters: { "manifold.id": "single_digit" }
- view: arith_length
label: Length
group_by: ["params.length"]
filters: { "manifold.id": "single_digit", "params.prob_dewhitespace": 0.5, "params.max_depth": 0 }
Views replace the old surfaces/projections distinction. Use the discovery command above to list all views currently defined for each task.
Group Taxonomy¶
Groups are classification tags on model evaluations that enable peer comparison and cross-cutting analysis. They are set in evals.json for each model and are system-independent.
Note: the facet_by field in views (see above) is a separate concept -- it splits chart panels by parameter values, not by model groups.
Architecture (arch:)¶
arch:dense- Standard dense transformer: all parameters active per tokenarch:moe- Mixture of Experts: sparse activation via routingarch:ssm- State-space models: recurrent/linear complexity architecturesarch:hybrid- Multiple mechanisms: mixed dense and sparse layers
Size (size:)¶
Use active parameters (not total parameters for MoE models):
size:tiny- <3B active parameterssize:small- 3B-10B active parameterssize:mid- 10B-30B active parameterssize:large- 30B-70B active parameterssize:xlarge- 70B+ active parameters
Family (family:)¶
Model family/organization (one per model):
family:phi4,family:qwen3,family:llama,family:geminifamily:granite,family:mistral,family:glm,family:deepseekfamily:ministral,family:hermes,family:nemotron- And others for new models
See config.md for futher information and usage examples.
Adding Models to r12¶
Dataset Structure¶
Refer to datasets.md to understand the r12 cohort dataset structure.
Primary Path: /import Skill¶
The recommended way to add a new model is the /import skill:
/import ModelName
This handles directory creation, evals.json generation, and database rebuild automatically.
Manual Process (Reference)¶
Step 1: Run Evaluation
python runner.py --config configs/r12.yaml \
--model my-new-model \
--apibase http://localhost:8000 \
--template zerocot-nosys \
--sampler greedy-max
This generates NDJSON interview files in results/timestamp/my-new-model+zerocot-nosys+greedy-max/.
Step 2: Create Cohort Directory
mkdir -p data/r12/MyNewModel
mv results/timestamp/my-new-model+zerocot-nosys+greedy-max/* data/r12/MyNewModel/
Step 3: Create evals.json
Create data/r12/MyNewModel/evals.json:
[
{
"evaluate": {
"glob": "data/r12/MyNewModel/*"
},
"filters": {
"model": "my-new-model",
"template": "zerocot-nosys",
"sampler": "greedy-max"
},
"label": "MyNew Model",
"groups": [
"family:mynew",
"arch:dense",
"size:large"
],
"hf_id": "org/my-new-model",
"hf_quant_id": null,
"tags": []
}
]
Step 4: Process Evaluation
python evaluate.py --dataset data/r12.json --parallel 16
Step 5: Verify
# Confirm model appears
python analyze.py evals data/r12.json --search "mynew"
# Quick leaderboard check
python analyze.py scores data/r12.json
Using r12 for Research¶
r12 is both a reference evaluation and a research platform. The analysis workflow follows the Three P's:
Research process:
1. Position - Use scores to rank models overall, cluster for statistical rigor
2. Profile - Use spiderweb for fingerprints, surface for capability boundaries, compression/hazard for reasoning and temporal analysis
3. Probe - Use probe.py failure and probe.py truncation to inspect raw outputs and diagnose loop patterns
Research output goes in:
research/<project>/
See workflow.md for detailed guides and datasets.md for research subdataset documentation.
See Also¶
- Architecture Overview - The five-stage philosophy
- Config Reference - Templates, samplers, views and sampling modes
- ReasonScore v2 - Statistical scoring methodology for r12
- Tasks - Abstract task API and analysis dimensions
- Workflow Guide - The Three P's research methodology
- Datasets - Research subdatasets and case studies