ReasonScape Research Workflows¶

"Four questions, four workflows, one unified platform"

Introduction¶

ReasonScape supports four distinct research activities in LLM evaluation. Each uses different cross-sections of the toolset and data, targeting fundamentally different research questions:

Ranking & Benchmarking - "What's the best overall?"
Comparative Evaluation - "Which models are truly different?"
Model Characterization - "What are the trade-offs?"
Failure Diagnosis - "Why/how/when did it fail?"

The Four Research Activities¶

1. Ranking & Benchmarking¶

Research Question: "What's the best model overall?"

What it does: Aggregates diverse metrics (accuracy, truncation, resource utilization) into unified rankings across the entire test suite.

Primary tool: scores

Output: Model rankings with single score/token metric

Use case: Initial triage - identify handful of candidates for deeper investigation

Duration: 2-3 minutes

→ Read the full workflow

2. Comparative Evaluation¶

Research Question: "Which models offer statistically significant performance at a specific task?"

What it does: Statistical comparison of model sets accounting for confidence intervals. Groups models into clusters where members are statistically indistinguishable.

Primary tool: cluster

Output: Overlapping cluster sets showing true performance differences vs. measurement noise

Use case: Deep-dive comparison of candidate models from ranking phase

Duration: 5-10 minutes

→ Read the full workflow

3. Model Characterization¶

Research Question: "What are this model's strengths, weaknesses, and costs?"

What it does: Profiles a single model's capability landscape across tasks and difficulty dimensions.

Primary tools: spider (high-level fingerprint), surface (task-specific capability zones)

Output: Cognitive archetypes, capability maps, cost/performance profiles

Use case: Understanding a specific model for deployment decisions

Duration: 5-10 minutes

→ Read the full workflow

4. Failure Diagnosis¶

Research Question: "Why did this model fail on this task?"

What it does: Root-cause analysis across four information spaces: - INPUT: How is the problem represented? (fft) - REASONING: How is information being processed? (compression) - OUTPUT: Where does performance break down? (surface) - TEMPORAL: When does thinking degrade? (hazard)

Primary tools: surface, fft, compression, hazard

Output: Failure boundaries, tokenization artifacts, entropy patterns, temporal degradation curves

Use case: Deep forensics for model improvement or research

Duration: 10-20 minutes

→ Read the full workflow

Choosing the Right Workflow¶

Use this decision tree to select your workflow:

Do you need a quick overview of available models?
    YES → Workflow 1: Ranking & Benchmarking
    NO  ↓

Do you need to compare multiple models statistically?
    YES → Workflow 2: Comparative Evaluation
    NO  ↓

Do you need to understand one model's capabilities?
    YES → Workflow 3: Model Characterization
    NO  ↓

Do you need to diagnose why a model fails?
    YES → Workflow 4: Failure Diagnosis

Typical Research Progression¶

Most research follows this sequence:

Start with Ranking (scores)
Identify 3-5 candidate models
Note which tasks show performance spread
Compare Candidates (cluster)
Determine which differences are statistically significant
Identify which models are actually equivalent
Characterize Selected Models (spider, surface)
Map capability zones
Understand trade-offs
Estimate deployment costs
Diagnose Unexpected Failures (surface, fft, compression, hazard)
Root-cause specific failure modes
Form hypotheses about reasoning mechanisms
Design targeted follow-up evaluations

Prerequisites¶

All workflows assume: - You have a processed dataset (see docs/tools/evaluate.md) - You're working from the /home/mike/ai/reasonscape directory - Virtual environment is activated (source venv/bin/activate)

Quick Start Examples¶

I want to rank 10 models¶

source venv/bin/activate
python analyze.py scores data/dataset-m12x.json --output scores.md

→ See Workflow 1: Ranking

I want to compare Llama vs Qwen¶

source venv/bin/activate
python analyze.py cluster data/dataset-m12x.json \
    --filters '{"family": ["llama", "qwen"]}' \
    --stack base_task --format png

→ See Workflow 2: Comparative

I want to profile GPT-4o¶

source venv/bin/activate
python analyze.py spiderweb data/dataset-m12x.json \
    --filters '{"model_name": "gpt-4o"}' --format webpng

→ See Workflow 3: Characterization

I want to know why Phi-4 fails at arithmetic¶

source venv/bin/activate
python analyze.py surface data/dataset-m12x.json \
    --filters '{"model_name": "phi-4", "base_task": "arithmetic"}'

→ See Workflow 4: Diagnosis

Next Steps¶

New users: Start with Workflow 1: Ranking to get oriented
Researchers: Jump to Workflow 4: Diagnosis for the full diagnostic toolkit
Tool reference: See docs/tools/analyze.md for complete command documentation
Methodology: Read docs/technical-details.md for statistical foundations