The ReasonScape Manifold: Design Intent¶

Overview¶

ReasonScape stores evaluation results in a two-plane structure, where each point exists simultaneously in both an Evaluation Plane and a Task Complexity Plane.

Each plane has its own identity dimensions (what defines points in that plane) and facet dimensions (how to organize points in that plane).

	EVALUATION	TASK-COMPLEXITY
IDENTITY (5D)	- model - template - sampler	- base_task - params
FACETS	- eval_id - groups[]	- tiers[] - surfaces[] - projections[]

This document explains the design motivations behind this architecture.

Why Two Planes?¶

Traditional benchmarks are flat: (model, task) → score

But this can't answer:

WHERE in complexity space does the model fail?
HOW does performance change as difficulty increases?
WHAT architectural patterns emerge across difficulty levels?

ReasonScape's two-plane structure enables these questions by:

Separating concerns:

Evaluation plane = System Under Test (who/how)
Complexity plane = Test Harness (what/where)

Enabling orthogonal variation:

Same model tested at many difficulty levels
Many models tested at same difficulty level
Cross-product: all models × all difficulties

Providing organizational facets per plane:

Evaluation facets group models (eval_id, groups[])
Complexity facets organize difficulty (tiers[], surfaces[], projections[])

The Evaluation Plane¶

Purpose: Define which model configuration was tested

Identity Dimensions¶

model (VARCHAR) - Which language model

Example: "Seed-OSS-36B", "MiniMax-M2", "GPT-OSS-20B"
Identifies the base model being evaluated

template (VARCHAR) - Which prompt template

Example: "zerocot-nosys", "cot-high", "direct"
Defines how problems are presented to the model

sampler (VARCHAR) - Which sampling configuration

Example: "seedoss-4k", "default", "extended-8k"
Defines generation parameters (token budget, temperature, etc.)

Together: (model, template, sampler) uniquely identifies an evaluation configuration - a specific way of testing a model.

eval_id (INTEGER) - Which evaluation scenario

Computed from config.evals[] matching
Maps (model, template, sampler) → integer ID
Purpose: Shorthand for filtering, display labels
Cardinality: Singular (one eval per point)

groups[] (VARCHAR[]) - Which model categories

Example: ["arch:moe", "size:36B", "runtime:vllm"]
Assigned from config.evals[].groups metadata
Purpose: Architectural/size/runtime comparisons
Cardinality: Multi-valued (overlapping tags)

eval_id provides convenient shorthand:

Instead of {"model": "X", "template": "Y", "sampler": "Z"}
Just use {"eval_id": 0}
Directly connects with analyze.py evals subcommand to enable forward and reverse metadata discovery and search.

groups[] enables cross-cutting analysis:

Compare all MoE models: {"groups": [["arch:moe"]]}
Compare large models: {"groups": [["size:large"]]}
Compare across multiple attributes: {"groups": [["arch:moe", "size:large"]]}

The Task Complexity Plane¶

Purpose: Define which problem was tested and where in difficulty space

Identity Dimensions¶

base_task (VARCHAR) - Which problem domain

Example: "arithmetic", "objects", "shuffle", "boolean"
Defines the type of reasoning required
Each task has different complexity dimensions

params (JSON) - Where in complexity manifold

Example: {"length": 54, "max_depth": 3, "prob_dewhitespace": 0.8}
Task-specific difficulty knobs
N-dimensional continuous coordinate space
Defines the exact difficulty configuration

Together: (base_task, params) uniquely identifies a problem instance - a specific test case at a specific difficulty level.

tiers[] (VARCHAR[]) - Which difficulty levels

Example: ["easy", "medium"]
Computed from config.tiers[] filter matching during evaluation
Purpose: Coarse-grained difficulty groupings
Cardinality: Multi-valued (overlapping levels)

surfaces[] (VARCHAR[]) - Which 2D visualization slices

Example: ["arithmetic_easy", "arithmetic_length_x_depth"]
Computed from param filter matching during post-processing
Purpose: 3D surface plots (z=accuracy over x,y grid)
Cardinality: Multi-valued (point appears in multiple surfaces)

projections[] (VARCHAR[]) - Which 1D analysis sweeps

Example: ["arithmetic_length_sweep", "arithmetic_depth_sweep"]
Computed from param filter matching during post-processing
Purpose: FFT analysis, parameter sensitivity
Cardinality: Multi-valued (point in multiple projections)

Why These Facets?¶

tiers[] provide semantic difficulty groupings:

Instead of a complex task-specific set of filters like {"base_task": "arithmetic", "params": {"max_depth": [1, 2]}} just use {"tiers": ["easy"]}
Enables fair ranking (compare models at same difficulty)

surfaces[] enable spatial visualization:

Group points into 2D slices for heatmap/surface plots
Identify capability cliffs (where performance drops)
Map "green zones" (model succeeds) vs "red cliffs" (model fails)

projections[] enable parameter analysis:

Sweep one parameter while holding others fixed
FFT analysis reveals periodic patterns
Sensitivity analysis shows which parameters matter most

Each plane has identity dimensions and facet dimensions:

Evaluation Plane:

Identity (discrete)	Facets (computed)
model	eval_id
template	groups[]
sampler

Task Complexity Plane:

Identity (hybrid)	Facets (computed)
base_task	tiers[]
params (continuous)	surfaces[]
	projections[]

Key property: Facets organize points within their own plane

Evaluation facets don't organize complexity (eval_id doesn't care about params)
Complexity facets don't organize evaluations (surfaces don't care about model)

Point Identity¶

A point is uniquely identified by its position in both planes:

UNIQUE(model, template, sampler, base_task, params)
       └───── Evaluation ─────┘ └── Complexity ──┘

Two points with the same (evaluation identity, complexity identity) are considered the same measurement.

Example:

Point A: ("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10, "depth": 2})
Point B: ("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10, "depth": 2})
# → Same point (deduplicated)

Point C: ("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 20, "depth": 2})
# → Different point (complexity changed)

Point D: ("MiniMax", "zerocot", "4k", "arithmetic", {"length": 10, "depth": 2})
# → Different point (evaluation changed)

Orthogonality¶

The planes are orthogonal - they vary independently:

Same evaluation, different complexity¶

# Test Seed-OSS on easy arithmetic
("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10})

# Test Seed-OSS on hard arithmetic
("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 100})

# Evaluation plane unchanged ✓
# Complexity plane changed ✓

Same complexity, different evaluation¶

# Test Seed-OSS on arithmetic length=10
("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10})

# Test MiniMax on arithmetic length=10
("MiniMax", "zerocot", "4k", "arithmetic", {"length": 10})

# Complexity plane unchanged ✓
# Evaluation plane changed ✓

Different in both planes¶

# Test Seed-OSS on easy arithmetic
("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10})

# Test MiniMax on hard boolean
("MiniMax", "cot-high", "default", "boolean", {"depth": 5})

# Both planes changed ✓

This orthogonality enables: - Model comparison: Same complexity, multiple models - Difficulty analysis: Same model, multiple complexities - Full cross-product: All models × all complexities

Facets are computed, not stored in raw evaluation data.

eval_id - Lookup in config.evals[]:

# config.evals[0]
{
    "label": "Seed-OSS 36B (4k budget)",
    "filters": {"model": "Seed-OSS-36B", "template": "zerocot", "sampler": "4k"}
}

# Point with (Seed-OSS-36B, zerocot, 4k) gets: eval_id = 0

groups[] - Extract from matched eval:

# config.evals[0]
{
    "label": "Seed-OSS 36B",
    "filters": {...},
    "groups": ["arch:dense", "size:36B", "runtime:vllm"]
}

# Point matching eval 0 gets: groups = ["arch:dense", "size:36B", "runtime:vllm"]

tiers[] - Mapped from (degree, density) at evaluation time:

# config.tiers - defines how degrees/densities map to tier labels
[
    {"label": "easy", "filters": {"degrees": ["0"], "densities": ["normal"]}},
    {"label": "medium", "filters": {"degrees": ["1"], "densities": ["normal"]}},
    {"label": "hard", "filters": {"degrees": ["2"], "densities": ["normal"]}}
]

# During evaluation, (degree=0, density=normal) → tiers = ["easy"]
# Points are stored with semantic tier labels, not raw degrees/densities
# A point spanning multiple difficulty levels: tiers = ["easy", "medium"]

surfaces[] - Param filter intersection per task:

# config.basetasks['arithmetic'].surfaces[0]
{
    "id": "arithmetic_easy",
    "filter": {"max_depth": [1, 2]}
}

# Point with base_task="arithmetic" AND max_depth=1:
# → surfaces = ["arithmetic_easy", ...other matching surfaces]

projections[] - Param filter intersection per task:

# config.basetasks['arithmetic'].projections[0]
{
    "id": "arithmetic_length_sweep",
    "axis": "length",
    "filter": {"max_depth": 2}
}

# Point with base_task="arithmetic" AND max_depth=2:
# → projections = ["arithmetic_length_sweep"]

Motivations For This Structure¶

1. Enables Spatial Reasoning¶

Traditional benchmarks: model → scalar score ReasonScape: model → function over complexity manifold

You can ask:

WHERE does this model fail? (surface visualization)
HOW does performance degrade? (projection analysis)
WHEN does catastrophic failure occur? (hazard analysis)

2. Supports Multiple Analysis Levels¶

The two-plane structure enables hierarchical aggregation:

Point-level: Individual (evaluation, complexity) result

Surface-level: Aggregate over complexity regions (2D slices)

Task-level: Aggregate over all complexity for one task

Tier-level: Aggregate over tasks at same difficulty

Overall: Single score across everything

Each level answers different questions:

Point: What happened here?
Surface: Where are the cliffs?
Task: How good at this reasoning type?
Tier: How good at this difficulty?
Overall: How good overall?

Facets enable semantic filtering without complex identity queries:

Without facets:

# Must know exact identity coordinates
filters = {
    "model": "Seed-OSS-36B",
    "template": "zerocot-nosys",
    "sampler": "seedoss-4k",
    "base_task": "arithmetic",
    "params": {"max_depth": [1, 2]}  # Must know param ranges
}

With facets:

# Use semantic labels
filters = {
    "eval_id": 0,        # Which model config
    "tiers": ["easy"]    # Which difficulty
}

4. Separates Concerns Cleanly¶

Evaluation plane: System under test

Research question: Which model/prompting works best?
Facets: Group by architecture, size, runtime

Complexity plane: Test harness

Research question: Where do models struggle?
Facets: Group by difficulty, visualize by surfaces, analyze by projections

This separation means:

Can design test harness (tasks/params) independently of models
Can compare new models on existing complexity manifolds
Can analyze single model across complexity space

5. Enables Rich Comparative Analysis¶

Compare within evaluation plane (same complexity, different models):

# Which model is best on easy arithmetic?
filters = {
    "base_task": "arithmetic",
    "tiers": ["easy"]
}
# Group by eval_id → ranking

Compare within complexity plane (same model, different difficulties):

# How does Seed-OSS perform across difficulty levels?
filters = {
    "eval_id": 0,  # Seed-OSS
    "base_task": "arithmetic"
}
# Group by tiers → difficulty curve

Compare across both planes (multiple models, multiple difficulties):

# Which architectural pattern handles complexity best?
filters = {
    "groups": [["arch:moe"], ["arch:dense"]],
    "base_task": "arithmetic"
}
# Group by groups + tiers → architectural comparison across difficulty

Design Principles¶

1. Identity vs Organization¶

Identity dimensions define what a point IS (immutable after evaluation) Facet dimensions organize how points are VIEWED (recomputable from config)

This separation means:

Raw evaluation data is minimal (identity + results)
Organizational views are flexible (change config, recompute facets)
No need to re-evaluate to change organization

2. Planes Are Orthogonal¶

Evaluation and complexity are independent:

Can test any model on any complexity
Facets don't cross planes (eval facets ≠ complexity facets)
Analysis can focus on one plane or explore joint space

Points can belong to multiple facet values:

A point can be in multiple tiers (e.g., "easy" AND "shallow")
A point can be in multiple surfaces (overlapping 2D slices)
A point can be in multiple projections (same point swept by multiple axes)

This enables:

Overlapping analysis views
Cross-cutting comparisons
Flexible aggregation

4. Continuous Complexity¶

params is not a discrete grid - it's a continuous manifold:

length, depth, prob_dewhitespace are real-valued
Runner samples this space (corner/lowdef/normal/highdef strategies)
Analysis can treat it spatially (interpolate, visualize surfaces, FFT)

This enables:

Spatial reasoning about difficulty
Identification of complexity cliffs
Mapping parameter interactions

Summary¶

ReasonScape's two-plane structure provides:

Clear separation: Evaluation (who/how) vs Complexity (what/where) are orthogonal!
Unique point identity: Automatic de-duplication of raw runner data for efficient storage and analysis
Spatial reasoning: Dimensions inside each plane form an intrinsic data hiearchy
Flexible navigation: Facets provide semantic organization
Dynamic aggregation: Surface → Task or Model → Tier → Task - its up to you!

This structure transforms evaluation from:

Flat benchmarks: (model, task) → score
To spatial analysis: model → function over complexity manifold

The result: We can ask not just "who wins?" but "WHERE do they win? HOW do they fail? WHEN does catastrophe occur?"

The ReasonScape Manifold: Design Intent¶

Overview¶

Why Two Planes?¶

The Evaluation Plane¶

Identity Dimensions¶

Facet Dimensions¶

Why These Facets?¶

The Task Complexity Plane¶

Identity Dimensions¶

Facet Dimensions¶

Why These Facets?¶

The Identity vs Facet Model¶

Point Identity¶

Orthogonality¶

Same evaluation, different complexity¶

Same complexity, different evaluation¶

Different in both planes¶

Facet Computation¶

Evaluation Facets (from evaluation identity)¶

Complexity Facets (from complexity identity)¶

Motivations For This Structure¶

1. Enables Spatial Reasoning¶

2. Supports Multiple Analysis Levels¶

3. Provides Flexible Navigation¶

4. Separates Concerns Cleanly¶

5. Enables Rich Comparative Analysis¶

Design Principles¶

1. Identity vs Organization¶

2. Planes Are Orthogonal¶

3. Facets Are Multi-Valued¶

4. Continuous Complexity¶

Summary¶