ReasonScape Research Workflows¶
"Everything is a comparison"
Introduction¶
The High-Dimensional Challenge¶
ReasonScape stores evaluation results in a high-dimensional manifold (docs/manifold.md): each point exists in a 2-plane structure (Evaluation × Task-Complexity) and carries multiple outcome KPIs (accuracy, token counts, entropy, truncation, etc.).
The dimensionality explosion:
- Evaluation plane: 3 identity dimensions (model, template, sampler) + eval_id + facets{}
- Complexity plane: 2 identity dimensions (base_task, params{}) + manifold{} + views: in config
- Per-point KPIs: ~10+ measurements (outcome, tokens, entropy, accuracy, margin, etc.)
Each dataset contains thousands of points spanning 15+ dimensions with 10+ KPIs each. Analyzing this raw data directly is cognitively infeasible.
What are Workflows?¶
Workflows are illuminated paths through this high-dimensional space. They slice and reduce the data to bring complexity down to 2-3 dimensions where human analysis becomes feasible again.
Every workflow follows the same three-step process:
- Filter - Select which subset of the manifold to examine
- Group - Aggregate along specific dimensions to reduce dimensionality
- Post-process - Transform, visualize, or rank the reduced data
What distinguishes workflows is which dimensions you preserve and which you aggregate away. This choice determines what questions you can answer.
The Three P's: Hierarchical Aggregation Workflows¶
ReasonScape uses hierarchical aggregation to progressively reduce dimensionality. Each workflow operates at a different aggregation level, trading detail for tractability:
| P | Data source | Question | Operation | Guide |
|---|---|---|---|---|
| Position | PointsDB (ranked) | "Which model is better?" | Rank across aggregated groups | → |
| Profile | PointsDB (unranked) | "What can this model do?" | Characterize or compare capability shape | → |
| Probe | Raw NDJSON traces | "What does failure look like?" | Inspect inside individual outputs | → |
Workflows differ by data source and question type. Position and Profile both use the aggregated PointsDB; what separates them is whether you want a ranking or a characterization. Probe drops to raw traces when you need to see what the model actually produced.
Position¶
Multiple evaluations, ranked output
Tools: scores, rank, cluster, pairwise
Use cases:
- Leaderboard creation
- Initial triage to identify top candidates for a task
- Statistical comparisons of model cohorts
Profile¶
One or more evaluations, one or more tasks, unranked
The 'scape in ReasonScape. Profile projects high-dimensional model characterization into interpretable low-dimensional views through three geometric spaces:
- Characterization-Space (
spiderweb): Merged 2D view of accuracy + reliability + scaling + resources across all tasks - Output-Space (
surface): 3D heatmaps showing how accuracy scales with 2D problem-complexity manifolds - Reasoning-Space (
compression,hazard): Token/entropy scatterplots and temporal failure curves
Tools: spiderweb, cluster, surface, compression, hazard
Use cases:
- Characterize single models (fingerprints, capability zones, efficiency)
- Compare multiple models (failure boundaries, tokenization differences, reasoning patterns)
- Deployment decision making (identify safe operating ranges)
- Cost/performance trade-off analysis
Probe¶
Raw evaluation outputs, trace-level processing (0th-level step analysis)
Through compression and hazard analysis, we observe entropic degradation in truncations—but what does this degradation actually look like? Is it degenerate junk, counter loops that appear to make progress, or metacognitive failures where thinking about the problem gets the model stuck?
Tool: probe.py truncation and probe.py failure
Use cases:
- Validate "broken loops" patterns from compression analysis
- Classify failure modes across models (metacognitive blindness vs clear failure)
- Extract concrete failure samples for reports and fine-tuning
- Diagnose unexpected truncations on "easy" tasks
Typical Research Progression¶
Most research follows this sequence:
1) Position - Rank models
- Use
rankfor relative leaderboard orscoresfor absolute distance-from-ideal - Use
clusterto validate statistical significance within a single group - Identify 3-5 candidates
2) Profile - Characterize selected models
- Single model: Use
spiderwebfor fingerprint,surfacefor capability zones - Multiple models: Use
surface/compression/hazardto compare failure modes - For diagnostic intent (chasing a specific anomaly), filter tighter: single eval_id + single task gives higher resolution from the same tools
- Map capability zones and estimate deployment costs
3) Probe - Inspect raw traces when you need to see what the model actually produced
failure— read raw answer text and genresult facets for a specific pointtruncation— classify loop types in truncated outputsfft— frequency-domain analysis of input token sequences- Validate "broken loops" hypothesis from compression analysis
- Extract concrete failure examples for documentation or fine-tuning
How Tools Map to Workflows¶
Position Tools (analyze.py)¶
Always produce ranked outputs:
scores- Absolute distance-from-ideal rankingrank- Relative leaderboard by cluster penalty aggregationpairwise- Head-to-head Bradley-Terry rankingcluster- Statistical significance within a single group
Profile Tools (analyze.py)¶
Work on aggregated PointsDB data. Filtering granularity controls resolution, not workflow identity — tighter filters (single eval_id + single task) give diagnostic resolution; broader filters give characterization views:
spiderweb- Cross-task fingerprint (requires multiple groups)surface- Accuracy across 2D parameter gridscompression- Token vs entropy reasoning efficiencyhazard- Temporal failure dynamics during generationpoints- Per-point outcomes; also the source of point IDs for Probe
Probe Tools (probe.py)¶
Operate on raw NDJSON traces — require data.py pull --full:
failure- Raw answer text and genresult facets for a specific pointtruncation- Loop detection and classification in truncated outputsfft- Frequency-domain analysis of input token sequences
Next Steps¶
- New users: Start with Position to understand model rankings
- Researchers: Jump to Profile for characterization and diagnostics
- Tool reference: See docs/tools.md for complete command documentation
- Methodology: Read docs/technical-details.md for statistical foundations