Skip to content

ReasonScape Research Workflows

"Everything is a comparison"

Introduction

The High-Dimensional Challenge

ReasonScape stores evaluation results in a high-dimensional manifold (docs/manifold.md): each point exists in a 2-plane structure (Evaluation × Task-Complexity) and carries multiple outcome KPIs (accuracy, token counts, entropy, truncation, etc.).

The dimensionality explosion:

  • Evaluation plane: 3 identity dimensions (model, template, sampler) + eval_id + facets{}
  • Complexity plane: 2 identity dimensions (base_task, params{}) + manifold{} + views: in config
  • Per-point KPIs: ~10+ measurements (outcome, tokens, entropy, accuracy, margin, etc.)

Each dataset contains thousands of points spanning 15+ dimensions with 10+ KPIs each. Analyzing this raw data directly is cognitively infeasible.

What are Workflows?

Workflows are illuminated paths through this high-dimensional space. They slice and reduce the data to bring complexity down to 2-3 dimensions where human analysis becomes feasible again.

Every workflow follows the same three-step process:

  1. Filter - Select which subset of the manifold to examine
  2. Group - Aggregate along specific dimensions to reduce dimensionality
  3. Post-process - Transform, visualize, or rank the reduced data

What distinguishes workflows is which dimensions you preserve and which you aggregate away. This choice determines what questions you can answer.

The Three P's: Hierarchical Aggregation Workflows

ReasonScape uses hierarchical aggregation to progressively reduce dimensionality. Each workflow operates at a different aggregation level, trading detail for tractability:

P Data source Question Operation Guide
Position PointsDB (ranked) "Which model is better?" Rank across aggregated groups
Profile PointsDB (unranked) "What can this model do?" Characterize or compare capability shape
Probe Raw NDJSON traces "What does failure look like?" Inspect inside individual outputs

Workflows differ by data source and question type. Position and Profile both use the aggregated PointsDB; what separates them is whether you want a ranking or a characterization. Probe drops to raw traces when you need to see what the model actually produced.


Position

Multiple evaluations, ranked output

Tools: scores, rank, cluster, pairwise

Use cases:

  • Leaderboard creation
  • Initial triage to identify top candidates for a task
  • Statistical comparisons of model cohorts

Position Workflow Guide

Profile

One or more evaluations, one or more tasks, unranked

The 'scape in ReasonScape. Profile projects high-dimensional model characterization into interpretable low-dimensional views through three geometric spaces:

  • Characterization-Space (spiderweb): Merged 2D view of accuracy + reliability + scaling + resources across all tasks
  • Output-Space (surface): 3D heatmaps showing how accuracy scales with 2D problem-complexity manifolds
  • Reasoning-Space (compression, hazard): Token/entropy scatterplots and temporal failure curves

Tools: spiderweb, cluster, surface, compression, hazard

Use cases:

  • Characterize single models (fingerprints, capability zones, efficiency)
  • Compare multiple models (failure boundaries, tokenization differences, reasoning patterns)
  • Deployment decision making (identify safe operating ranges)
  • Cost/performance trade-off analysis

Profile workflow guide

Probe

Raw evaluation outputs, trace-level processing (0th-level step analysis)

Through compression and hazard analysis, we observe entropic degradation in truncations—but what does this degradation actually look like? Is it degenerate junk, counter loops that appear to make progress, or metacognitive failures where thinking about the problem gets the model stuck?

Tool: probe.py truncation and probe.py failure

Use cases:

  • Validate "broken loops" patterns from compression analysis
  • Classify failure modes across models (metacognitive blindness vs clear failure)
  • Extract concrete failure samples for reports and fine-tuning
  • Diagnose unexpected truncations on "easy" tasks

Probe workflow guide


Typical Research Progression

Most research follows this sequence:

1) Position - Rank models

  • Use rank for relative leaderboard or scores for absolute distance-from-ideal
  • Use cluster to validate statistical significance within a single group
  • Identify 3-5 candidates

2) Profile - Characterize selected models

  • Single model: Use spiderweb for fingerprint, surface for capability zones
  • Multiple models: Use surface/compression/hazard to compare failure modes
  • For diagnostic intent (chasing a specific anomaly), filter tighter: single eval_id + single task gives higher resolution from the same tools
  • Map capability zones and estimate deployment costs

3) Probe - Inspect raw traces when you need to see what the model actually produced

  • failure — read raw answer text and genresult facets for a specific point
  • truncation — classify loop types in truncated outputs
  • fft — frequency-domain analysis of input token sequences
  • Validate "broken loops" hypothesis from compression analysis
  • Extract concrete failure examples for documentation or fine-tuning

How Tools Map to Workflows

Position Tools (analyze.py)

Always produce ranked outputs:

  • scores - Absolute distance-from-ideal ranking
  • rank - Relative leaderboard by cluster penalty aggregation
  • pairwise - Head-to-head Bradley-Terry ranking
  • cluster - Statistical significance within a single group

Profile Tools (analyze.py)

Work on aggregated PointsDB data. Filtering granularity controls resolution, not workflow identity — tighter filters (single eval_id + single task) give diagnostic resolution; broader filters give characterization views:

  • spiderweb - Cross-task fingerprint (requires multiple groups)
  • surface - Accuracy across 2D parameter grids
  • compression - Token vs entropy reasoning efficiency
  • hazard - Temporal failure dynamics during generation
  • points - Per-point outcomes; also the source of point IDs for Probe

Probe Tools (probe.py)

Operate on raw NDJSON traces — require data.py pull --full:

  • failure - Raw answer text and genresult facets for a specific point
  • truncation - Loop detection and classification in truncated outputs
  • fft - Frequency-domain analysis of input token sequences

Next Steps

  1. New users: Start with Position to understand model rankings
  2. Researchers: Jump to Profile for characterization and diagnostics
  3. Tool reference: See docs/tools.md for complete command documentation
  4. Methodology: Read docs/technical-details.md for statistical foundations