Skip to content

The ReasonScape Manifold: Design Intent

Overview

ReasonScape stores evaluation results in a two-plane structure, where each point exists simultaneously in both an Evaluation Plane and a Task Complexity Plane.

Each plane has its own identity dimensions (what defines points in that plane) and facet dimensions (how to organize points in that plane).

EVALUATION TASK-COMPLEXITY
IDENTITY (5D) - model
- template
- sampler
- base_task
- params
FACETS - eval_id
- groups[]
- tiers[]
- surfaces[]
- projections[]

This document explains the design motivations behind this architecture.

Why Two Planes?

Traditional benchmarks are flat: (model, task) → score

But this can't answer:

  • WHERE in complexity space does the model fail?
  • HOW does performance change as difficulty increases?
  • WHAT architectural patterns emerge across difficulty levels?

ReasonScape's two-plane structure enables these questions by:

Separating concerns:

  • Evaluation plane = System Under Test (who/how)
  • Complexity plane = Test Harness (what/where)

Enabling orthogonal variation:

  • Same model tested at many difficulty levels
  • Many models tested at same difficulty level
  • Cross-product: all models × all difficulties

Providing organizational facets per plane:

  • Evaluation facets group models (eval_id, groups[])
  • Complexity facets organize difficulty (tiers[], surfaces[], projections[])

The Evaluation Plane

Purpose: Define which model configuration was tested

Identity Dimensions

model (VARCHAR) - Which language model

  • Example: "Seed-OSS-36B", "MiniMax-M2", "GPT-OSS-20B"
  • Identifies the base model being evaluated

template (VARCHAR) - Which prompt template

  • Example: "zerocot-nosys", "cot-high", "direct"
  • Defines how problems are presented to the model

sampler (VARCHAR) - Which sampling configuration

  • Example: "seedoss-4k", "default", "extended-8k"
  • Defines generation parameters (token budget, temperature, etc.)

Together: (model, template, sampler) uniquely identifies an evaluation configuration - a specific way of testing a model.

Facet Dimensions

eval_id (INTEGER) - Which evaluation scenario

  • Computed from config.evals[] matching
  • Maps (model, template, sampler) → integer ID
  • Purpose: Shorthand for filtering, display labels
  • Cardinality: Singular (one eval per point)

groups[] (VARCHAR[]) - Which model categories

  • Example: ["arch:moe", "size:36B", "runtime:vllm"]
  • Assigned from config.evals[].groups metadata
  • Purpose: Architectural/size/runtime comparisons
  • Cardinality: Multi-valued (overlapping tags)

Why These Facets?

eval_id provides convenient shorthand:

  • Instead of {"model": "X", "template": "Y", "sampler": "Z"}
  • Just use {"eval_id": 0}
  • Directly connects with analyze.py evals subcommand to enable forward and reverse metadata discovery and search.

groups[] enables cross-cutting analysis:

  • Compare all MoE models: {"groups": [["arch:moe"]]}
  • Compare large models: {"groups": [["size:large"]]}
  • Compare across multiple attributes: {"groups": [["arch:moe", "size:large"]]}

The Task Complexity Plane

Purpose: Define which problem was tested and where in difficulty space

Identity Dimensions

base_task (VARCHAR) - Which problem domain

  • Example: "arithmetic", "objects", "shuffle", "boolean"
  • Defines the type of reasoning required
  • Each task has different complexity dimensions

params (JSON) - Where in complexity manifold

  • Example: {"length": 54, "max_depth": 3, "prob_dewhitespace": 0.8}
  • Task-specific difficulty knobs
  • N-dimensional continuous coordinate space
  • Defines the exact difficulty configuration

Together: (base_task, params) uniquely identifies a problem instance - a specific test case at a specific difficulty level.

Facet Dimensions

tiers[] (VARCHAR[]) - Which difficulty levels

  • Example: ["easy", "medium"]
  • Computed from config.tiers[] filter matching during evaluation
  • Purpose: Coarse-grained difficulty groupings
  • Cardinality: Multi-valued (overlapping levels)

surfaces[] (VARCHAR[]) - Which 2D visualization slices

  • Example: ["arithmetic_easy", "arithmetic_length_x_depth"]
  • Computed from param filter matching during post-processing
  • Purpose: 3D surface plots (z=accuracy over x,y grid)
  • Cardinality: Multi-valued (point appears in multiple surfaces)

projections[] (VARCHAR[]) - Which 1D analysis sweeps

  • Example: ["arithmetic_length_sweep", "arithmetic_depth_sweep"]
  • Computed from param filter matching during post-processing
  • Purpose: FFT analysis, parameter sensitivity
  • Cardinality: Multi-valued (point in multiple projections)

Why These Facets?

tiers[] provide semantic difficulty groupings:

  • Instead of a complex task-specific set of filters like {"base_task": "arithmetic", "params": {"max_depth": [1, 2]}} just use {"tiers": ["easy"]}
  • Enables fair ranking (compare models at same difficulty)

surfaces[] enable spatial visualization:

  • Group points into 2D slices for heatmap/surface plots
  • Identify capability cliffs (where performance drops)
  • Map "green zones" (model succeeds) vs "red cliffs" (model fails)

projections[] enable parameter analysis:

  • Sweep one parameter while holding others fixed
  • FFT analysis reveals periodic patterns
  • Sensitivity analysis shows which parameters matter most

The Identity vs Facet Model

Each plane has identity dimensions and facet dimensions:

Evaluation Plane:

Identity (discrete) Facets (computed)
model eval_id
template groups[]
sampler

Task Complexity Plane:

Identity (hybrid) Facets (computed)
base_task tiers[]
params (continuous) surfaces[]
projections[]

Key property: Facets organize points within their own plane

  • Evaluation facets don't organize complexity (eval_id doesn't care about params)
  • Complexity facets don't organize evaluations (surfaces don't care about model)

Point Identity

A point is uniquely identified by its position in both planes:

UNIQUE(model, template, sampler, base_task, params)
       └───── Evaluation ─────┘ └── Complexity ──┘

Two points with the same (evaluation identity, complexity identity) are considered the same measurement.

Example:

Point A: ("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10, "depth": 2})
Point B: ("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10, "depth": 2})
# → Same point (deduplicated)

Point C: ("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 20, "depth": 2})
# → Different point (complexity changed)

Point D: ("MiniMax", "zerocot", "4k", "arithmetic", {"length": 10, "depth": 2})
# → Different point (evaluation changed)

Orthogonality

The planes are orthogonal - they vary independently:

Same evaluation, different complexity

# Test Seed-OSS on easy arithmetic
("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10})

# Test Seed-OSS on hard arithmetic
("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 100})

# Evaluation plane unchanged ✓
# Complexity plane changed ✓

Same complexity, different evaluation

# Test Seed-OSS on arithmetic length=10
("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10})

# Test MiniMax on arithmetic length=10
("MiniMax", "zerocot", "4k", "arithmetic", {"length": 10})

# Complexity plane unchanged ✓
# Evaluation plane changed ✓

Different in both planes

# Test Seed-OSS on easy arithmetic
("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10})

# Test MiniMax on hard boolean
("MiniMax", "cot-high", "default", "boolean", {"depth": 5})

# Both planes changed ✓

This orthogonality enables: - Model comparison: Same complexity, multiple models - Difficulty analysis: Same model, multiple complexities - Full cross-product: All models × all complexities

Facet Computation

Facets are computed, not stored in raw evaluation data.

Evaluation Facets (from evaluation identity)

eval_id - Lookup in config.evals[]:

# config.evals[0]
{
    "label": "Seed-OSS 36B (4k budget)",
    "filters": {"model": "Seed-OSS-36B", "template": "zerocot", "sampler": "4k"}
}

# Point with (Seed-OSS-36B, zerocot, 4k) gets: eval_id = 0

groups[] - Extract from matched eval:

# config.evals[0]
{
    "label": "Seed-OSS 36B",
    "filters": {...},
    "groups": ["arch:dense", "size:36B", "runtime:vllm"]
}

# Point matching eval 0 gets: groups = ["arch:dense", "size:36B", "runtime:vllm"]

Complexity Facets (from complexity identity)

tiers[] - Mapped from (degree, density) at evaluation time:

# config.tiers - defines how degrees/densities map to tier labels
[
    {"label": "easy", "filters": {"degrees": ["0"], "densities": ["normal"]}},
    {"label": "medium", "filters": {"degrees": ["1"], "densities": ["normal"]}},
    {"label": "hard", "filters": {"degrees": ["2"], "densities": ["normal"]}}
]

# During evaluation, (degree=0, density=normal) → tiers = ["easy"]
# Points are stored with semantic tier labels, not raw degrees/densities
# A point spanning multiple difficulty levels: tiers = ["easy", "medium"]

surfaces[] - Param filter intersection per task:

# config.basetasks['arithmetic'].surfaces[0]
{
    "id": "arithmetic_easy",
    "filter": {"max_depth": [1, 2]}
}

# Point with base_task="arithmetic" AND max_depth=1:
# → surfaces = ["arithmetic_easy", ...other matching surfaces]

projections[] - Param filter intersection per task:

# config.basetasks['arithmetic'].projections[0]
{
    "id": "arithmetic_length_sweep",
    "axis": "length",
    "filter": {"max_depth": 2}
}

# Point with base_task="arithmetic" AND max_depth=2:
# → projections = ["arithmetic_length_sweep"]

Motivations For This Structure

1. Enables Spatial Reasoning

Traditional benchmarks: model → scalar score ReasonScape: model → function over complexity manifold

You can ask:

  • WHERE does this model fail? (surface visualization)
  • HOW does performance degrade? (projection analysis)
  • WHEN does catastrophic failure occur? (hazard analysis)

2. Supports Multiple Analysis Levels

The two-plane structure enables hierarchical aggregation:

Point-level: Individual (evaluation, complexity) result

Surface-level: Aggregate over complexity regions (2D slices)

Task-level: Aggregate over all complexity for one task

Tier-level: Aggregate over tasks at same difficulty

Overall: Single score across everything

Each level answers different questions:

  • Point: What happened here?
  • Surface: Where are the cliffs?
  • Task: How good at this reasoning type?
  • Tier: How good at this difficulty?
  • Overall: How good overall?

3. Provides Flexible Navigation

Facets enable semantic filtering without complex identity queries:

Without facets:

# Must know exact identity coordinates
filters = {
    "model": "Seed-OSS-36B",
    "template": "zerocot-nosys",
    "sampler": "seedoss-4k",
    "base_task": "arithmetic",
    "params": {"max_depth": [1, 2]}  # Must know param ranges
}

With facets:

# Use semantic labels
filters = {
    "eval_id": 0,        # Which model config
    "tiers": ["easy"]    # Which difficulty
}

4. Separates Concerns Cleanly

Evaluation plane: System under test

  • Research question: Which model/prompting works best?
  • Facets: Group by architecture, size, runtime

Complexity plane: Test harness

  • Research question: Where do models struggle?
  • Facets: Group by difficulty, visualize by surfaces, analyze by projections

This separation means:

  • Can design test harness (tasks/params) independently of models
  • Can compare new models on existing complexity manifolds
  • Can analyze single model across complexity space

5. Enables Rich Comparative Analysis

Compare within evaluation plane (same complexity, different models):

# Which model is best on easy arithmetic?
filters = {
    "base_task": "arithmetic",
    "tiers": ["easy"]
}
# Group by eval_id → ranking

Compare within complexity plane (same model, different difficulties):

# How does Seed-OSS perform across difficulty levels?
filters = {
    "eval_id": 0,  # Seed-OSS
    "base_task": "arithmetic"
}
# Group by tiers → difficulty curve

Compare across both planes (multiple models, multiple difficulties):

# Which architectural pattern handles complexity best?
filters = {
    "groups": [["arch:moe"], ["arch:dense"]],
    "base_task": "arithmetic"
}
# Group by groups + tiers → architectural comparison across difficulty

Design Principles

1. Identity vs Organization

Identity dimensions define what a point IS (immutable after evaluation) Facet dimensions organize how points are VIEWED (recomputable from config)

This separation means:

  • Raw evaluation data is minimal (identity + results)
  • Organizational views are flexible (change config, recompute facets)
  • No need to re-evaluate to change organization

2. Planes Are Orthogonal

Evaluation and complexity are independent:

  • Can test any model on any complexity
  • Facets don't cross planes (eval facets ≠ complexity facets)
  • Analysis can focus on one plane or explore joint space

3. Facets Are Multi-Valued

Points can belong to multiple facet values:

  • A point can be in multiple tiers (e.g., "easy" AND "shallow")
  • A point can be in multiple surfaces (overlapping 2D slices)
  • A point can be in multiple projections (same point swept by multiple axes)

This enables:

  • Overlapping analysis views
  • Cross-cutting comparisons
  • Flexible aggregation

4. Continuous Complexity

params is not a discrete grid - it's a continuous manifold:

  • length, depth, prob_dewhitespace are real-valued
  • Runner samples this space (corner/lowdef/normal/highdef strategies)
  • Analysis can treat it spatially (interpolate, visualize surfaces, FFT)

This enables:

  • Spatial reasoning about difficulty
  • Identification of complexity cliffs
  • Mapping parameter interactions

Summary

ReasonScape's two-plane structure provides:

  1. Clear separation: Evaluation (who/how) vs Complexity (what/where) are orthogonal!
  2. Unique point identity: Automatic de-duplication of raw runner data for efficient storage and analysis
  3. Spatial reasoning: Dimensions inside each plane form an intrinsic data hiearchy
  4. Flexible navigation: Facets provide semantic organization
  5. Dynamic aggregation: Surface → Task or Model → Tier → Task - its up to you!

This structure transforms evaluation from:

  • Flat benchmarks: (model, task) → score
  • To spatial analysis: model → function over complexity manifold

The result: We can ask not just "who wins?" but "WHERE do they win? HOW do they fail? WHEN does catastrophe occur?"