Skip to content

The ReasonScape Manifold

ReasonScape stores evaluation results using a two-plane, three-layer structure. Each point exists simultaneously in an Evaluation Plane and a Task Complexity Plane.

The planes are symmetric: each has a definition layer, a generation layer, and an analysis layer.

                  LAYER 1              LAYER 2                LAYER 3
                 (definition)         (generation)           (analysis)
                ──────────────       ──────────────         ──────────────
TASK PLANE      manifold{}           base_task              views: (config)
                which grid entry     params{}               query recipes

EVAL PLANE      (none)               model    \              facets{}
                                     template | eval_id      how to cut models
                                     sampler  /

This document explains the design motivations behind this architecture.

Why Two Planes?

Traditional benchmarks are flat: (model, task) → score

But this can't answer:

  • WHERE in complexity space does the model fail?
  • HOW does performance change as difficulty increases?
  • WHAT architectural patterns emerge across difficulty levels?

ReasonScape's two-plane structure enables these questions by:

Separating concerns:

  • Evaluation plane = System Under Test (who/how)
  • Complexity plane = Test Harness (what/where)

Enabling orthogonal variation:

  • Same model tested at many difficulty levels
  • Many models tested at same difficulty level
  • Cross-product: all models × all difficulties

Providing analysis layers per plane:

  • Evaluation Layer 3: group models by architecture, size, runtime (facets{})
  • Task Layer 3: named query recipes defined in config (views:)

Why Three Layers?

To enable semantically rich grouping and filtering.

Layer 1 — Definition space: How was this point defined? The task plane has manifold{} because calibration is a lossy mapping — num_rows=440 does not tell you it came from target_tokens=8000. Layer 1 preserves information that the Layer 1→2 mapping destroys. The layer 1 definition for the evaluation plane is semantically the dataset definition, but since in practice all evaluations come from a single data-set we omit this column.

Layer 2 — Generation space: How was this point generated? For the task plane: {base_task, params} — the concrete values passed to the generator. For the eval plane: {model, template, sampler} — the identity triple.

Layer 3 — Analysis space: How does the researcher want to view and slice the data? For the task plane: views: defined in the config per task — each view is a named query recipe (group_by, optional filters, optional facet_by). Nothing pre-computed onto points; manifold.* and params.* already carry every dimension a view needs. For the eval plane: facets{} — scalar JSON decomposed from groups[] at construction time (["arch:moe", "size:large"]{"arch": "moe", "size": "large"}). groups[] is kept in parallel as a backwards-compatible read path until deprecated.

The Evaluation Plane

The Evaluation Plane defines which model configuration was tested.

Layer 1

(none - see above)

Layer 2 — Identity Dimensions

model (VARCHAR) - Which language model

  • Example: "Seed-OSS-36B", "MiniMax-M2", "GPT-OSS-20B"
  • Identifies the base model being evaluated

template (VARCHAR) - Which prompt template

  • Example: "zerocot-nosys", "cot-high", "direct"
  • Defines how problems are presented to the model

sampler (VARCHAR) - Which sampling configuration

  • Example: "seedoss-4k", "default", "extended-8k"
  • Defines generation parameters (token budget, temperature, etc.)

Together: (model, template, sampler) uniquely identifies an evaluation configuration — a specific way of testing a model.

eval_id Coordinate System

The evaluation plane is 3-dimensional: every point belongs to exactly one (model, template, sampler) combination. While this uniquely identifies a point's position in evaluation space, filtering by three dimensions is verbose:

{"model": "GLM-4.5-Air-AWQ", "template": "zeroshot-nosys", "sampler": "qwen3-think-max"}

In practice, datasets have low cardinality in this space—typically 5-50 distinct combinations. This makes it convenient to reference each combination with a compact identifier rather than spelling out all three dimensions.

eval_id is a stable 6-character hash computed from the identity triple:

eval_id = sha256(f"{model}|{template}|{sampler}")[:6]
# "GLM-4.5-Air-AWQ|zeroshot-nosys|qwen3-think-max" → "5e0718"

This provides:

  1. Compact filtering: {"eval_id": ["5e0718"]} instead of three separate filters
  2. Stability and Caching: Same triple always produces same eval_id, even across different datasets
  3. Label independence: Changing display labels doesn't affect eval_id
  4. Join key: Enables cross-dataset comparisons on the same model configuration

Layer 3 — Facets

Facets are classification tags enabling peer comparison and cross-cutting analysis. ReasonScape uses three orthogonal dimensions that can be combined for complex filtering.

Architecture (arch:)

  • dense - Standard dense transformer: all parameters active per token
  • moe - Mixture of Experts: sparse activation via routing
  • ssm - State-space models: recurrent/linear complexity architectures
  • hybrid - Multiple mechanisms: mixed dense and sparse layers

Size (size:)

Use active parameters (not total parameters for MoE models):

  • tiny - <3B active parameters
  • small - 3B-10B active parameters
  • mid - 10B-30B active parameters
  • large - 30B-70B active parameters
  • xlarge - 70B+ active parameters

Family (family:)

Model family/organization (one per model):

  • phi4, qwen3, llama, granite, ...
  • And others for new models

facets{} (JSON) — Layer 3 analysis dimension

  • Example: {"arch": "moe", "size": "large", "runtime": "vllm"}
  • Decomposed from groups[] at point construction time: ["arch:moe", "size:large"]{"arch": "moe", "size": "large"}
  • Purpose: Architectural/size/runtime comparisons via scalar JSON path queries
  • groups[] (VARCHAR[]) kept in parallel as a backwards-compatible read path until deprecated

The Task Complexity Plane

The Task Complexity Plane defines which problem was tested and where in difficulty space.

Layer 1 — Definition

manifold{} (JSON) — Which named grid entry produced this point

  • Fields: id (named grid entry), target_tokens (calibrate mode coordinate)
  • Example: {"id": "csv_tall", "target_tokens": 8000}
  • Written by the runner onto every challenge; all samples in a bucket share the same manifold identity
  • Preserves information destroyed by the Layer 1→2 mapping: num_rows=440 does not tell you it came from target_tokens=8000
  • Old NDJSON (no manifold key) defaults to {}

Layer 2 — Generation

base_task (VARCHAR) - Which problem domain

  • Example: "arithmetic", "objects", "shuffle", "boolean"
  • Defines the type of reasoning required

params (JSON) - Where in complexity manifold

  • Example: {"length": 54, "max_depth": 3, "prob_dewhitespace": 0.8}
  • Task-specific difficulty knobs
  • N-dimensional continuous coordinate space
  • Defines the exact difficulty configuration

Together: (base_task, params) uniquely identifies a problem instance at a specific difficulty level.

Layer 3 — Analysis

views: (config) — Named query recipes per task

  • Defined in the experiment config's views: list per task; not stored in the database
  • Each view has a view name, group_by list, optional filters, optional facet_by
  • Operate directly on manifold.* and params.* — nothing pre-computed onto points
  • Example: {view: depth_length, group_by: ["params.max_depth", "params.length"], filters: {"manifold.id": "single_digit"}}

Point Identity

A point is uniquely identified by its position in both planes:

UNIQUE(model, template, sampler, base_task, params)
       └───── Evaluation ─────┘ └── Complexity ──┘

Two points with the same (evaluation identity, complexity identity) are considered the same measurement.

Example:

Point A: ("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10, "depth": 2})
Point B: ("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10, "depth": 2})
# → Same point (deduplicated)

Point C: ("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 20, "depth": 2})
# → Different point (complexity changed)

Point D: ("MiniMax", "zerocot", "4k", "arithmetic", {"length": 10, "depth": 2})
# → Different point (evaluation changed)

Summary

ReasonScape's two-plane structure provides:

  1. Clear separation: Evaluation (who/how) vs Complexity (what/where) are orthogonal
  2. Three-layer symmetry: definition → generation → analysis in both planes
  3. Flexible navigation: facets.* (eval), manifold.*, and params.* (task) dimensions provide researcher-defined views
  4. Dynamic aggregation: Filter, group or slice by any key of any layer — views: in config drives the task plane, facets.* drives the eval plane.

This structure transforms evaluation from:

  • Flat benchmarks: (model, task) → score
  • To spatial analysis: model → function over complexity manifold

The result: We can ask not just "who wins?" but "WHERE do they win? HOW do they fail? WHEN does catastrophe occur?"