The ReasonScape Manifold¶

ReasonScape stores evaluation results using a two-plane, three-layer structure. Each point exists simultaneously in an Evaluation Plane and a Task Complexity Plane.

The planes are symmetric: each has a definition layer, a generation layer, and an analysis layer.

                  LAYER 1              LAYER 2                LAYER 3
                 (definition)         (generation)           (analysis)
                ──────────────       ──────────────         ──────────────
TASK PLANE      manifold{}           base_task              views: (config)
                which grid entry     params{}               query recipes

EVAL PLANE      (none)               model    \              facets{}
                                     template | eval_id      how to cut models
                                     sampler  /

This document explains the design motivations behind this architecture.

Why Two Planes?¶

Traditional benchmarks are flat: (model, task) → score

But this can't answer:

WHERE in complexity space does the model fail?
HOW does performance change as difficulty increases?
WHAT architectural patterns emerge across difficulty levels?

ReasonScape's two-plane structure enables these questions by:

Separating concerns:

Evaluation plane = System Under Test (who/how)
Complexity plane = Test Harness (what/where)

Enabling orthogonal variation:

Same model tested at many difficulty levels
Many models tested at same difficulty level
Cross-product: all models × all difficulties

Providing analysis layers per plane:

Evaluation Layer 3: group models by architecture, size, runtime (facets{})
Task Layer 3: named query recipes defined in config (views:)

Why Three Layers?¶

To enable semantically rich grouping and filtering.

Layer 1 — Definition space: How was this point defined? The task plane has manifold{} because calibration is a lossy mapping — num_rows=440 does not tell you it came from target_tokens=8000. Layer 1 preserves information that the Layer 1→2 mapping destroys. The layer 1 definition for the evaluation plane is semantically the dataset definition, but since in practice all evaluations come from a single data-set we omit this column.

Layer 2 — Generation space: How was this point generated? For the task plane: {base_task, params} — the concrete values passed to the generator. For the eval plane: {model, template, sampler} — the identity triple.

Layer 3 — Analysis space: How does the researcher want to view and slice the data? For the task plane: views: defined in the config per task — each view is a named query recipe (group_by, optional filters, optional facet_by). Nothing pre-computed onto points; manifold.* and params.* already carry every dimension a view needs. For the eval plane: facets{} — scalar JSON decomposed from groups[] at construction time (["arch:moe", "size:large"] → {"arch": "moe", "size": "large"}). groups[] is kept in parallel as a backwards-compatible read path until deprecated.

The Evaluation Plane¶

The Evaluation Plane defines which model configuration was tested.

Layer 1¶

(none - see above)

Layer 2 — Identity Dimensions¶

model (VARCHAR) - Which language model

Example: "Seed-OSS-36B", "MiniMax-M2", "GPT-OSS-20B"
Identifies the base model being evaluated

template (VARCHAR) - Which prompt template

Example: "zerocot-nosys", "cot-high", "direct"
Defines how problems are presented to the model

sampler (VARCHAR) - Which sampling configuration

Example: "seedoss-4k", "default", "extended-8k"
Defines generation parameters (token budget, temperature, etc.)

Together: (model, template, sampler) uniquely identifies an evaluation configuration — a specific way of testing a model.

eval_id Coordinate System¶

The evaluation plane is 3-dimensional: every point belongs to exactly one (model, template, sampler) combination. While this uniquely identifies a point's position in evaluation space, filtering by three dimensions is verbose:

{"model": "GLM-4.5-Air-AWQ", "template": "zeroshot-nosys", "sampler": "qwen3-think-max"}

In practice, datasets have low cardinality in this space—typically 5-50 distinct combinations. This makes it convenient to reference each combination with a compact identifier rather than spelling out all three dimensions.

eval_id is a stable 6-character hash computed from the identity triple:

eval_id = sha256(f"{model}|{template}|{sampler}")[:6]
# "GLM-4.5-Air-AWQ|zeroshot-nosys|qwen3-think-max" → "5e0718"

This provides:

Compact filtering: {"eval_id": ["5e0718"]} instead of three separate filters
Stability and Caching: Same triple always produces same eval_id, even across different datasets
Label independence: Changing display labels doesn't affect eval_id
Join key: Enables cross-dataset comparisons on the same model configuration

Facets are classification tags enabling peer comparison and cross-cutting analysis. ReasonScape uses three orthogonal dimensions that can be combined for complex filtering.

Architecture (`arch:`)¶

dense - Standard dense transformer: all parameters active per token
moe - Mixture of Experts: sparse activation via routing
ssm - State-space models: recurrent/linear complexity architectures
hybrid - Multiple mechanisms: mixed dense and sparse layers

Size (`size:`)¶

Use active parameters (not total parameters for MoE models):

tiny - <3B active parameters
small - 3B-10B active parameters
mid - 10B-30B active parameters
large - 30B-70B active parameters
xlarge - 70B+ active parameters

Family (`family:`)¶

Model family/organization (one per model):

phi4, qwen3, llama, granite, ...
And others for new models

facets{} (JSON) — Layer 3 analysis dimension

Example: {"arch": "moe", "size": "large", "runtime": "vllm"}
Decomposed from groups[] at point construction time: ["arch:moe", "size:large"] → {"arch": "moe", "size": "large"}
Purpose: Architectural/size/runtime comparisons via scalar JSON path queries
groups[] (VARCHAR[]) kept in parallel as a backwards-compatible read path until deprecated

The Task Complexity Plane¶

The Task Complexity Plane defines which problem was tested and where in difficulty space.

Layer 1 — Definition¶

manifold{} (JSON) — Which named grid entry produced this point

Fields: id (named grid entry), target_tokens (calibrate mode coordinate)
Example: {"id": "csv_tall", "target_tokens": 8000}
Written by the runner onto every challenge; all samples in a bucket share the same manifold identity
Preserves information destroyed by the Layer 1→2 mapping: num_rows=440 does not tell you it came from target_tokens=8000
Old NDJSON (no manifold key) defaults to {}

Layer 2 — Generation¶

base_task (VARCHAR) - Which problem domain

Example: "arithmetic", "objects", "shuffle", "boolean"
Defines the type of reasoning required

params (JSON) - Where in complexity manifold

Example: {"length": 54, "max_depth": 3, "prob_dewhitespace": 0.8}
Task-specific difficulty knobs
N-dimensional continuous coordinate space
Defines the exact difficulty configuration

Together: (base_task, params) uniquely identifies a problem instance at a specific difficulty level.

Layer 3 — Analysis¶

views: (config) — Named query recipes per task

Defined in the experiment config's views: list per task; not stored in the database
Each view has a view name, group_by list, optional filters, optional facet_by
Operate directly on manifold.* and params.* — nothing pre-computed onto points
Example: {view: depth_length, group_by: ["params.max_depth", "params.length"], filters: {"manifold.id": "single_digit"}}

Point Identity¶

A point is uniquely identified by its position in both planes:

UNIQUE(model, template, sampler, base_task, params)
       └───── Evaluation ─────┘ └── Complexity ──┘

Two points with the same (evaluation identity, complexity identity) are considered the same measurement.

Example:

Point A: ("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10, "depth": 2})
Point B: ("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10, "depth": 2})
# → Same point (deduplicated)

Point C: ("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 20, "depth": 2})
# → Different point (complexity changed)

Point D: ("MiniMax", "zerocot", "4k", "arithmetic", {"length": 10, "depth": 2})
# → Different point (evaluation changed)

Summary¶

ReasonScape's two-plane structure provides:

Clear separation: Evaluation (who/how) vs Complexity (what/where) are orthogonal
Three-layer symmetry: definition → generation → analysis in both planes
Flexible navigation: facets.* (eval), manifold.*, and params.* (task) dimensions provide researcher-defined views
Dynamic aggregation: Filter, group or slice by any key of any layer — views: in config drives the task plane, facets.* drives the eval plane.

This structure transforms evaluation from:

Flat benchmarks: (model, task) → score
To spatial analysis: model → function over complexity manifold

The result: We can ask not just "who wins?" but "WHERE do they win? HOW do they fail? WHEN does catastrophe occur?"