The ReasonScape Manifold: Design Intent¶
Overview¶
ReasonScape stores evaluation results in a two-plane structure, where each point exists simultaneously in both an Evaluation Plane and a Task Complexity Plane.
Each plane has its own identity dimensions (what defines points in that plane) and facet dimensions (how to organize points in that plane).
| EVALUATION | TASK-COMPLEXITY | |
|---|---|---|
| IDENTITY (5D) | - model - template - sampler |
- base_task - params |
| FACETS | - eval_id - groups[] |
- tiers[] - surfaces[] - projections[] |
This document explains the design motivations behind this architecture.
Why Two Planes?¶
Traditional benchmarks are flat: (model, task) → score
But this can't answer:
- WHERE in complexity space does the model fail?
- HOW does performance change as difficulty increases?
- WHAT architectural patterns emerge across difficulty levels?
ReasonScape's two-plane structure enables these questions by:
Separating concerns:
- Evaluation plane = System Under Test (who/how)
- Complexity plane = Test Harness (what/where)
Enabling orthogonal variation:
- Same model tested at many difficulty levels
- Many models tested at same difficulty level
- Cross-product: all models × all difficulties
Providing organizational facets per plane:
- Evaluation facets group models (eval_id, groups[])
- Complexity facets organize difficulty (tiers[], surfaces[], projections[])
The Evaluation Plane¶
Purpose: Define which model configuration was tested
Identity Dimensions¶
model (VARCHAR) - Which language model
- Example: "Seed-OSS-36B", "MiniMax-M2", "GPT-OSS-20B"
- Identifies the base model being evaluated
template (VARCHAR) - Which prompt template
- Example: "zerocot-nosys", "cot-high", "direct"
- Defines how problems are presented to the model
sampler (VARCHAR) - Which sampling configuration
- Example: "seedoss-4k", "default", "extended-8k"
- Defines generation parameters (token budget, temperature, etc.)
Together: (model, template, sampler) uniquely identifies an evaluation configuration - a specific way of testing a model.
Facet Dimensions¶
eval_id (INTEGER) - Which evaluation scenario
- Computed from config.evals[] matching
- Maps (model, template, sampler) → integer ID
- Purpose: Shorthand for filtering, display labels
- Cardinality: Singular (one eval per point)
groups[] (VARCHAR[]) - Which model categories
- Example:
["arch:moe", "size:36B", "runtime:vllm"] - Assigned from config.evals[].groups metadata
- Purpose: Architectural/size/runtime comparisons
- Cardinality: Multi-valued (overlapping tags)
Why These Facets?¶
eval_id provides convenient shorthand:
- Instead of
{"model": "X", "template": "Y", "sampler": "Z"} - Just use
{"eval_id": 0} - Directly connects with
analyze.py evalssubcommand to enable forward and reverse metadata discovery and search.
groups[] enables cross-cutting analysis:
- Compare all MoE models:
{"groups": [["arch:moe"]]} - Compare large models:
{"groups": [["size:large"]]} - Compare across multiple attributes:
{"groups": [["arch:moe", "size:large"]]}
The Task Complexity Plane¶
Purpose: Define which problem was tested and where in difficulty space
Identity Dimensions¶
base_task (VARCHAR) - Which problem domain
- Example: "arithmetic", "objects", "shuffle", "boolean"
- Defines the type of reasoning required
- Each task has different complexity dimensions
params (JSON) - Where in complexity manifold
- Example:
{"length": 54, "max_depth": 3, "prob_dewhitespace": 0.8} - Task-specific difficulty knobs
- N-dimensional continuous coordinate space
- Defines the exact difficulty configuration
Together: (base_task, params) uniquely identifies a problem instance - a specific test case at a specific difficulty level.
Facet Dimensions¶
tiers[] (VARCHAR[]) - Which difficulty levels
- Example:
["easy", "medium"] - Computed from config.tiers[] filter matching during evaluation
- Purpose: Coarse-grained difficulty groupings
- Cardinality: Multi-valued (overlapping levels)
surfaces[] (VARCHAR[]) - Which 2D visualization slices
- Example:
["arithmetic_easy", "arithmetic_length_x_depth"] - Computed from param filter matching during post-processing
- Purpose: 3D surface plots (z=accuracy over x,y grid)
- Cardinality: Multi-valued (point appears in multiple surfaces)
projections[] (VARCHAR[]) - Which 1D analysis sweeps
- Example:
["arithmetic_length_sweep", "arithmetic_depth_sweep"] - Computed from param filter matching during post-processing
- Purpose: FFT analysis, parameter sensitivity
- Cardinality: Multi-valued (point in multiple projections)
Why These Facets?¶
tiers[] provide semantic difficulty groupings:
- Instead of a complex task-specific set of filters like
{"base_task": "arithmetic", "params": {"max_depth": [1, 2]}}just use{"tiers": ["easy"]} - Enables fair ranking (compare models at same difficulty)
surfaces[] enable spatial visualization:
- Group points into 2D slices for heatmap/surface plots
- Identify capability cliffs (where performance drops)
- Map "green zones" (model succeeds) vs "red cliffs" (model fails)
projections[] enable parameter analysis:
- Sweep one parameter while holding others fixed
- FFT analysis reveals periodic patterns
- Sensitivity analysis shows which parameters matter most
The Identity vs Facet Model¶
Each plane has identity dimensions and facet dimensions:
Evaluation Plane:
| Identity (discrete) | Facets (computed) |
|---|---|
| model | eval_id |
| template | groups[] |
| sampler |
Task Complexity Plane:
| Identity (hybrid) | Facets (computed) |
|---|---|
| base_task | tiers[] |
| params (continuous) | surfaces[] |
| projections[] |
Key property: Facets organize points within their own plane
- Evaluation facets don't organize complexity (eval_id doesn't care about params)
- Complexity facets don't organize evaluations (surfaces don't care about model)
Point Identity¶
A point is uniquely identified by its position in both planes:
UNIQUE(model, template, sampler, base_task, params)
└───── Evaluation ─────┘ └── Complexity ──┘
Two points with the same (evaluation identity, complexity identity) are considered the same measurement.
Example:
Point A: ("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10, "depth": 2})
Point B: ("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10, "depth": 2})
# → Same point (deduplicated)
Point C: ("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 20, "depth": 2})
# → Different point (complexity changed)
Point D: ("MiniMax", "zerocot", "4k", "arithmetic", {"length": 10, "depth": 2})
# → Different point (evaluation changed)
Orthogonality¶
The planes are orthogonal - they vary independently:
Same evaluation, different complexity¶
# Test Seed-OSS on easy arithmetic
("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10})
# Test Seed-OSS on hard arithmetic
("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 100})
# Evaluation plane unchanged ✓
# Complexity plane changed ✓
Same complexity, different evaluation¶
# Test Seed-OSS on arithmetic length=10
("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10})
# Test MiniMax on arithmetic length=10
("MiniMax", "zerocot", "4k", "arithmetic", {"length": 10})
# Complexity plane unchanged ✓
# Evaluation plane changed ✓
Different in both planes¶
# Test Seed-OSS on easy arithmetic
("Seed-OSS", "zerocot", "4k", "arithmetic", {"length": 10})
# Test MiniMax on hard boolean
("MiniMax", "cot-high", "default", "boolean", {"depth": 5})
# Both planes changed ✓
This orthogonality enables: - Model comparison: Same complexity, multiple models - Difficulty analysis: Same model, multiple complexities - Full cross-product: All models × all complexities
Facet Computation¶
Facets are computed, not stored in raw evaluation data.
Evaluation Facets (from evaluation identity)¶
eval_id - Lookup in config.evals[]:
# config.evals[0]
{
"label": "Seed-OSS 36B (4k budget)",
"filters": {"model": "Seed-OSS-36B", "template": "zerocot", "sampler": "4k"}
}
# Point with (Seed-OSS-36B, zerocot, 4k) gets: eval_id = 0
groups[] - Extract from matched eval:
# config.evals[0]
{
"label": "Seed-OSS 36B",
"filters": {...},
"groups": ["arch:dense", "size:36B", "runtime:vllm"]
}
# Point matching eval 0 gets: groups = ["arch:dense", "size:36B", "runtime:vllm"]
Complexity Facets (from complexity identity)¶
tiers[] - Mapped from (degree, density) at evaluation time:
# config.tiers - defines how degrees/densities map to tier labels
[
{"label": "easy", "filters": {"degrees": ["0"], "densities": ["normal"]}},
{"label": "medium", "filters": {"degrees": ["1"], "densities": ["normal"]}},
{"label": "hard", "filters": {"degrees": ["2"], "densities": ["normal"]}}
]
# During evaluation, (degree=0, density=normal) → tiers = ["easy"]
# Points are stored with semantic tier labels, not raw degrees/densities
# A point spanning multiple difficulty levels: tiers = ["easy", "medium"]
surfaces[] - Param filter intersection per task:
# config.basetasks['arithmetic'].surfaces[0]
{
"id": "arithmetic_easy",
"filter": {"max_depth": [1, 2]}
}
# Point with base_task="arithmetic" AND max_depth=1:
# → surfaces = ["arithmetic_easy", ...other matching surfaces]
projections[] - Param filter intersection per task:
# config.basetasks['arithmetic'].projections[0]
{
"id": "arithmetic_length_sweep",
"axis": "length",
"filter": {"max_depth": 2}
}
# Point with base_task="arithmetic" AND max_depth=2:
# → projections = ["arithmetic_length_sweep"]
Motivations For This Structure¶
1. Enables Spatial Reasoning¶
Traditional benchmarks: model → scalar score ReasonScape: model → function over complexity manifold
You can ask:
- WHERE does this model fail? (surface visualization)
- HOW does performance degrade? (projection analysis)
- WHEN does catastrophic failure occur? (hazard analysis)
2. Supports Multiple Analysis Levels¶
The two-plane structure enables hierarchical aggregation:
Point-level: Individual (evaluation, complexity) result
Surface-level: Aggregate over complexity regions (2D slices)
Task-level: Aggregate over all complexity for one task
Tier-level: Aggregate over tasks at same difficulty
Overall: Single score across everything
Each level answers different questions:
- Point: What happened here?
- Surface: Where are the cliffs?
- Task: How good at this reasoning type?
- Tier: How good at this difficulty?
- Overall: How good overall?
3. Provides Flexible Navigation¶
Facets enable semantic filtering without complex identity queries:
Without facets:
# Must know exact identity coordinates
filters = {
"model": "Seed-OSS-36B",
"template": "zerocot-nosys",
"sampler": "seedoss-4k",
"base_task": "arithmetic",
"params": {"max_depth": [1, 2]} # Must know param ranges
}
With facets:
# Use semantic labels
filters = {
"eval_id": 0, # Which model config
"tiers": ["easy"] # Which difficulty
}
4. Separates Concerns Cleanly¶
Evaluation plane: System under test
- Research question: Which model/prompting works best?
- Facets: Group by architecture, size, runtime
Complexity plane: Test harness
- Research question: Where do models struggle?
- Facets: Group by difficulty, visualize by surfaces, analyze by projections
This separation means:
- Can design test harness (tasks/params) independently of models
- Can compare new models on existing complexity manifolds
- Can analyze single model across complexity space
5. Enables Rich Comparative Analysis¶
Compare within evaluation plane (same complexity, different models):
# Which model is best on easy arithmetic?
filters = {
"base_task": "arithmetic",
"tiers": ["easy"]
}
# Group by eval_id → ranking
Compare within complexity plane (same model, different difficulties):
# How does Seed-OSS perform across difficulty levels?
filters = {
"eval_id": 0, # Seed-OSS
"base_task": "arithmetic"
}
# Group by tiers → difficulty curve
Compare across both planes (multiple models, multiple difficulties):
# Which architectural pattern handles complexity best?
filters = {
"groups": [["arch:moe"], ["arch:dense"]],
"base_task": "arithmetic"
}
# Group by groups + tiers → architectural comparison across difficulty
Design Principles¶
1. Identity vs Organization¶
Identity dimensions define what a point IS (immutable after evaluation) Facet dimensions organize how points are VIEWED (recomputable from config)
This separation means:
- Raw evaluation data is minimal (identity + results)
- Organizational views are flexible (change config, recompute facets)
- No need to re-evaluate to change organization
2. Planes Are Orthogonal¶
Evaluation and complexity are independent:
- Can test any model on any complexity
- Facets don't cross planes (eval facets ≠ complexity facets)
- Analysis can focus on one plane or explore joint space
3. Facets Are Multi-Valued¶
Points can belong to multiple facet values:
- A point can be in multiple tiers (e.g., "easy" AND "shallow")
- A point can be in multiple surfaces (overlapping 2D slices)
- A point can be in multiple projections (same point swept by multiple axes)
This enables:
- Overlapping analysis views
- Cross-cutting comparisons
- Flexible aggregation
4. Continuous Complexity¶
params is not a discrete grid - it's a continuous manifold:
- length, depth, prob_dewhitespace are real-valued
- Runner samples this space (corner/lowdef/normal/highdef strategies)
- Analysis can treat it spatially (interpolate, visualize surfaces, FFT)
This enables:
- Spatial reasoning about difficulty
- Identification of complexity cliffs
- Mapping parameter interactions
Summary¶
ReasonScape's two-plane structure provides:
- Clear separation: Evaluation (who/how) vs Complexity (what/where) are orthogonal!
- Unique point identity: Automatic de-duplication of raw runner data for efficient storage and analysis
- Spatial reasoning: Dimensions inside each plane form an intrinsic data hiearchy
- Flexible navigation: Facets provide semantic organization
- Dynamic aggregation: Surface → Task or Model → Tier → Task - its up to you!
This structure transforms evaluation from:
- Flat benchmarks: (model, task) → score
- To spatial analysis: model → function over complexity manifold
The result: We can ask not just "who wins?" but "WHERE do they win? HOW do they fail? WHEN does catastrophe occur?"