ReasonScape Research Workflows¶

"Everything is a comparison"

Introduction¶

The High-Dimensional Challenge¶

ReasonScape stores evaluation results in a high-dimensional manifold (docs/manifold.md): each point exists in a 2-plane structure (Evaluation × Task-Complexity) and carries multiple outcome KPIs (accuracy, token counts, entropy, truncation, etc.).

The dimensionality explosion:

Evaluation plane: 3 identity dimensions (model, template, sampler) + eval_id + facets{}
Complexity plane: 2 identity dimensions (base_task, params{}) + manifold{} + views: in config
Per-point KPIs: ~10+ measurements (outcome, tokens, entropy, accuracy, margin, etc.)

Each dataset contains thousands of points spanning 15+ dimensions with 10+ KPIs each. Analyzing this raw data directly is cognitively infeasible.

What are Workflows?¶

Workflows are illuminated paths through this high-dimensional space. They slice and reduce the data to bring complexity down to 2-3 dimensions where human analysis becomes feasible again.

Every workflow follows the same three-step process:

Filter - Select which subset of the manifold to examine
Group - Aggregate along specific dimensions to reduce dimensionality
Post-process - Transform, visualize, or rank the reduced data

What distinguishes workflows is which dimensions you preserve and which you aggregate away. This choice determines what questions you can answer.

The Three P's: Hierarchical Aggregation Workflows¶

ReasonScape uses hierarchical aggregation to progressively reduce dimensionality. Each workflow operates at a different aggregation level, trading detail for tractability:

P	Data source	Question	Operation	Guide
Position	PointsDB (ranked)	"Which model is better?"	Rank across aggregated groups	→
Profile	PointsDB (unranked)	"What can this model do?"	Characterize or compare capability shape	→
Probe	Raw NDJSON traces	"What does failure look like?"	Inspect inside individual outputs	→

Workflows differ by data source and question type. Position and Profile both use the aggregated PointsDB; what separates them is whether you want a ranking or a characterization. Probe drops to raw traces when you need to see what the model actually produced.

Position¶

Multiple evaluations, ranked output

Tools: scores, rank, cluster, pairwise

Use cases:

Leaderboard creation
Initial triage to identify top candidates for a task
Statistical comparisons of model cohorts

Position Workflow Guide

Profile¶

One or more evaluations, one or more tasks, unranked

The 'scape in ReasonScape. Profile projects high-dimensional model characterization into interpretable low-dimensional views through three geometric spaces:

Characterization-Space (spiderweb): Merged 2D view of accuracy + reliability + scaling + resources across all tasks
Output-Space (surface): 3D heatmaps showing how accuracy scales with 2D problem-complexity manifolds
Reasoning-Space (compression, hazard): Token/entropy scatterplots and temporal failure curves

Tools: spiderweb, cluster, surface, compression, hazard

Use cases:

Characterize single models (fingerprints, capability zones, efficiency)
Compare multiple models (failure boundaries, tokenization differences, reasoning patterns)
Deployment decision making (identify safe operating ranges)
Cost/performance trade-off analysis

→ Profile workflow guide

Probe¶

Raw evaluation outputs, trace-level processing (0th-level step analysis)

Through compression and hazard analysis, we observe entropic degradation in truncations—but what does this degradation actually look like? Is it degenerate junk, counter loops that appear to make progress, or metacognitive failures where thinking about the problem gets the model stuck?

Tool: probe.py truncation and probe.py failure

Use cases:

Validate "broken loops" patterns from compression analysis
Classify failure modes across models (metacognitive blindness vs clear failure)
Extract concrete failure samples for reports and fine-tuning
Diagnose unexpected truncations on "easy" tasks

→ Probe workflow guide

Typical Research Progression¶

Most research follows this sequence:

1) Position - Rank models

Use rank for relative leaderboard or scores for absolute distance-from-ideal
Use cluster to validate statistical significance within a single group
Identify 3-5 candidates

2) Profile - Characterize selected models

Single model: Use spiderweb for fingerprint, surface for capability zones
Multiple models: Use surface/compression/hazard to compare failure modes
For diagnostic intent (chasing a specific anomaly), filter tighter: single eval_id + single task gives higher resolution from the same tools
Map capability zones and estimate deployment costs

3) Probe - Inspect raw traces when you need to see what the model actually produced

failure — read raw answer text and genresult facets for a specific point
truncation — classify loop types in truncated outputs
fft — frequency-domain analysis of input token sequences
Validate "broken loops" hypothesis from compression analysis
Extract concrete failure examples for documentation or fine-tuning

How Tools Map to Workflows¶

Position Tools (analyze.py)¶

Always produce ranked outputs:

scores - Absolute distance-from-ideal ranking
rank - Relative leaderboard by cluster penalty aggregation
pairwise - Head-to-head Bradley-Terry ranking
cluster - Statistical significance within a single group

Profile Tools (analyze.py)¶

Work on aggregated PointsDB data. Filtering granularity controls resolution, not workflow identity — tighter filters (single eval_id + single task) give diagnostic resolution; broader filters give characterization views:

spiderweb - Cross-task fingerprint (requires multiple groups)
surface - Accuracy across 2D parameter grids
compression - Token vs entropy reasoning efficiency
hazard - Temporal failure dynamics during generation
points - Per-point outcomes; also the source of point IDs for Probe

Probe Tools (probe.py)¶

Operate on raw NDJSON traces — require data.py pull --full:

failure - Raw answer text and genresult facets for a specific point
truncation - Loop detection and classification in truncated outputs
fft - Frequency-domain analysis of input token sequences

Next Steps¶

New users: Start with Position to understand model rankings
Researchers: Jump to Profile for characterization and diagnostics
Tool reference: See docs/tools.md for complete command documentation
Methodology: Read docs/technical-details.md for statistical foundations