The ReasonScape Manifold¶
ReasonScape stores evaluation results using a two-plane, three-layer structure. Each point exists simultaneously in an Evaluation Plane and a Task Complexity Plane.
The planes are symmetric: each has a definition layer, a generation layer, and an analysis layer.
LAYER 1 LAYER 2 LAYER 3
(definition) (generation) (analysis)
────────────── ────────────── ──────────────
TASK PLANE manifold{} base_task views: (config)
which grid entry params{} query recipes
EVAL PLANE (none) model \ facets{}
template | eval_id how to cut models
sampler /
This document explains the design motivations behind this architecture.
Why Two Planes?¶
Traditional benchmarks are flat: (model, task) → score
But this can't answer:
- WHERE in complexity space does the model fail?
- HOW does performance change as difficulty increases?
- WHAT architectural patterns emerge across difficulty levels?
ReasonScape's two-plane structure enables these questions by:
Separating concerns:
- Evaluation plane = System Under Test (who/how)
- Complexity plane = Test Harness (what/where)
Enabling orthogonal variation:
- Same model tested at many difficulty levels
- Many models tested at same difficulty level
- Cross-product: all models × all difficulties
Providing analysis layers per plane:
- Evaluation Layer 3: group models by architecture, size, runtime (
facets{}) - Task Layer 3: named query recipes defined in config (
views:)
Why Three Layers?¶
To enable semantically rich grouping and filtering.
Layer 1 — Definition space: How was this point defined? The task plane has manifold{} because calibration is a lossy mapping — num_rows=440 does not tell you it came from target_tokens=8000. Layer 1 preserves information that the Layer 1→2 mapping destroys. The layer 1 definition for the evaluation plane is semantically the dataset definition, but since in practice all evaluations come from a single data-set we omit this column.
Layer 2 — Generation space: How was this point generated? For the task plane: {base_task, params} — the concrete values passed to the generator. For the eval plane: {model, template, sampler} — the identity triple.
Layer 3 — Analysis space: How does the researcher want to view and slice the data? For the task plane: views: defined in the config per task — each view is a named query recipe (group_by, optional filters, optional facet_by). Nothing pre-computed onto points; manifold.* and params.* already carry every dimension a view needs. For the eval plane: facets{} — scalar JSON decomposed from groups[] at construction time (["arch:moe", "size:large"] → {"arch": "moe", "size": "large"}). groups[] is kept in parallel as a backwards-compatible read path until deprecated.
The Evaluation Plane¶
The Evaluation Plane defines which model configuration was tested.
Layer 1¶
(none - see above)
Layer 2 — Identity Dimensions¶
model (VARCHAR) - Which language model
- Example: "Seed-OSS-36B", "MiniMax-M2", "GPT-OSS-20B"
- Identifies the base model being evaluated
template (VARCHAR) - Which prompt template
- Example: "zerocot-nosys", "cot-high", "direct"
- Defines how problems are presented to the model
sampler (VARCHAR) - Which sampling configuration
- Example: "seedoss-4k", "default", "extended-8k"
- Defines generation parameters (token budget, temperature, etc.)
Together: (model, template, sampler) uniquely identifies an evaluation configuration — a specific way of testing a model.
eval_id Coordinate System¶
The evaluation plane is 3-dimensional: every point belongs to exactly one (model, template, sampler) combination. While this uniquely identifies a point's position in evaluation space, filtering by three dimensions is verbose:
{"model": "GLM-4.5-Air-AWQ", "template": "zeroshot-nosys", "sampler": "qwen3-think-max"}
In practice, datasets have low cardinality in this space—typically 5-50 distinct combinations. This makes it convenient to reference each combination with a compact identifier rather than spelling out all three dimensions.
eval_id is a stable 6-character hash computed from the identity triple:
eval_id = sha256(f"{model}|{template}|{sampler}")[:6]
# "GLM-4.5-Air-AWQ|zeroshot-nosys|qwen3-think-max" → "5e0718"
This provides:
- Compact filtering:
{"eval_id": ["5e0718"]}instead of three separate filters - Stability and Caching: Same triple always produces same eval_id, even across different datasets
- Label independence: Changing display labels doesn't affect eval_id
- Join key: Enables cross-dataset comparisons on the same model configuration
Layer 3 — Facets¶
Facets are classification tags enabling peer comparison and cross-cutting analysis. ReasonScape uses three orthogonal dimensions that can be combined for complex filtering.
Architecture (arch:)¶
dense- Standard dense transformer: all parameters active per tokenmoe- Mixture of Experts: sparse activation via routingssm- State-space models: recurrent/linear complexity architectureshybrid- Multiple mechanisms: mixed dense and sparse layers
Size (size:)¶
Use active parameters (not total parameters for MoE models):
tiny- <3B active parameterssmall- 3B-10B active parametersmid- 10B-30B active parameterslarge- 30B-70B active parametersxlarge- 70B+ active parameters
Family (family:)¶
Model family/organization (one per model):
phi4,qwen3,llama,granite, ...- And others for new models
facets{} (JSON) — Layer 3 analysis dimension
- Example:
{"arch": "moe", "size": "large", "runtime": "vllm"} - Decomposed from
groups[]at point construction time:["arch:moe", "size:large"]→{"arch": "moe", "size": "large"} - Purpose: Architectural/size/runtime comparisons via scalar JSON path queries
groups[](VARCHAR[]) kept in parallel as a backwards-compatible read path until deprecated
The Task Complexity Plane¶
The Task Complexity Plane defines which problem was tested and where in difficulty space.
Layer 1 — Definition¶
manifold{} (JSON) — Which named grid entry produced this point
- Fields:
id(named grid entry),target_tokens(calibrate mode coordinate) - Example:
{"id": "csv_tall", "target_tokens": 8000} - Written by the runner onto every challenge; all samples in a bucket share the same manifold identity
- Preserves information destroyed by the Layer 1→2 mapping:
num_rows=440does not tell you it came fromtarget_tokens=8000 - Old NDJSON (no
manifoldkey) defaults to{}
Layer 2 — Generation¶
base_task (VARCHAR) - Which problem domain
- Example: "arithmetic", "objects", "shuffle", "boolean"
- Defines the type of reasoning required
params (JSON) - Where in complexity manifold
- Example:
{"length": 54, "max_depth": 3, "prob_dewhitespace": 0.8} - Task-specific difficulty knobs
- N-dimensional continuous coordinate space
- Defines the exact difficulty configuration
Together: (base_task, params) uniquely identifies a problem instance at a specific difficulty level.
Layer 3 — Analysis¶
views: (config) — Named query recipes per task
- Defined in the experiment config's
views:list per task; not stored in the database - Each view has a
viewname,group_bylist, optionalfilters, optionalfacet_by - Operate directly on
manifold.*andparams.*— nothing pre-computed onto points - Example:
{view: depth_length, group_by: ["params.max_depth", "params.length"], filters: {"manifold.id": "single_digit"}}
Point Identity¶
A point is uniquely identified by its position in both planes:
UNIQUE(model, template, sampler, base_task, params)
└───── Evaluation ─────┘ └── Complexity ──┘
Two points with the same (evaluation identity, complexity identity) are considered the same measurement.
Example:
Point A: ("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10, "depth": 2})
Point B: ("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10, "depth": 2})
# → Same point (deduplicated)
Point C: ("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 20, "depth": 2})
# → Different point (complexity changed)
Point D: ("MiniMax", "zerocot", "4k", "arithmetic", {"length": 10, "depth": 2})
# → Different point (evaluation changed)
Summary¶
ReasonScape's two-plane structure provides:
- Clear separation: Evaluation (who/how) vs Complexity (what/where) are orthogonal
- Three-layer symmetry: definition → generation → analysis in both planes
- Flexible navigation:
facets.*(eval),manifold.*, andparams.*(task) dimensions provide researcher-defined views - Dynamic aggregation: Filter, group or slice by any key of any layer —
views:in config drives the task plane,facets.*drives the eval plane.
This structure transforms evaluation from:
- Flat benchmarks: (model, task) → score
- To spatial analysis: model → function over complexity manifold
The result: We can ask not just "who wins?" but "WHERE do they win? HOW do they fail? WHEN does catastrophe occur?"