ReasonScape Research Workflows¶
"Four questions, four workflows, one unified platform"
Introduction¶
ReasonScape supports four distinct research activities in LLM evaluation. Each uses different cross-sections of the toolset and data, targeting fundamentally different research questions:
- Ranking & Benchmarking - "What's the best overall?"
- Comparative Evaluation - "Which models are truly different?"
- Model Characterization - "What are the trade-offs?"
- Failure Diagnosis - "Why/how/when did it fail?"
The Four Research Activities¶
1. Ranking & Benchmarking¶
Research Question: "What's the best model overall?"
What it does: Aggregates diverse metrics (accuracy, truncation, resource utilization) into unified rankings across the entire test suite.
Primary tool: scores
Output: Model rankings with single score/token metric
Use case: Initial triage - identify handful of candidates for deeper investigation
Duration: 2-3 minutes
2. Comparative Evaluation¶
Research Question: "Which models offer statistically significant performance at a specific task?"
What it does: Statistical comparison of model sets accounting for confidence intervals. Groups models into clusters where members are statistically indistinguishable.
Primary tool: cluster
Output: Overlapping cluster sets showing true performance differences vs. measurement noise
Use case: Deep-dive comparison of candidate models from ranking phase
Duration: 5-10 minutes
3. Model Characterization¶
Research Question: "What are this model's strengths, weaknesses, and costs?"
What it does: Profiles a single model's capability landscape across tasks and difficulty dimensions.
Primary tools: spider (high-level fingerprint), surface (task-specific capability zones)
Output: Cognitive archetypes, capability maps, cost/performance profiles
Use case: Understanding a specific model for deployment decisions
Duration: 5-10 minutes
4. Failure Diagnosis¶
Research Question: "Why did this model fail on this task?"
What it does: Root-cause analysis across four information spaces:
- INPUT: How is the problem represented? (fft)
- REASONING: How is information being processed? (compression)
- OUTPUT: Where does performance break down? (surface)
- TEMPORAL: When does thinking degrade? (hazard)
Primary tools: surface, fft, compression, hazard
Output: Failure boundaries, tokenization artifacts, entropy patterns, temporal degradation curves
Use case: Deep forensics for model improvement or research
Duration: 10-20 minutes
Choosing the Right Workflow¶
Use this decision tree to select your workflow:
Do you need a quick overview of available models?
YES → Workflow 1: Ranking & Benchmarking
NO ↓
Do you need to compare multiple models statistically?
YES → Workflow 2: Comparative Evaluation
NO ↓
Do you need to understand one model's capabilities?
YES → Workflow 3: Model Characterization
NO ↓
Do you need to diagnose why a model fails?
YES → Workflow 4: Failure Diagnosis
Typical Research Progression¶
Most research follows this sequence:
- Start with Ranking (
scores) - Identify 3-5 candidate models
-
Note which tasks show performance spread
-
Compare Candidates (
cluster) - Determine which differences are statistically significant
-
Identify which models are actually equivalent
-
Characterize Selected Models (
spider,surface) - Map capability zones
- Understand trade-offs
-
Estimate deployment costs
-
Diagnose Unexpected Failures (
surface,fft,compression,hazard) - Root-cause specific failure modes
- Form hypotheses about reasoning mechanisms
- Design targeted follow-up evaluations
Prerequisites¶
All workflows assume:
- You have a processed dataset (see docs/tools/evaluate.md)
- You're working from the /home/mike/ai/reasonscape directory
- Virtual environment is activated (source venv/bin/activate)
Quick Start Examples¶
I want to rank 10 models¶
source venv/bin/activate
python analyze.py scores data/dataset-m12x.json --output scores.md
I want to compare Llama vs Qwen¶
source venv/bin/activate
python analyze.py cluster data/dataset-m12x.json \
--filters '{"family": ["llama", "qwen"]}' \
--stack base_task --format png
I want to profile GPT-4o¶
source venv/bin/activate
python analyze.py spiderweb data/dataset-m12x.json \
--filters '{"model_name": "gpt-4o"}' --format webpng
I want to know why Phi-4 fails at arithmetic¶
source venv/bin/activate
python analyze.py surface data/dataset-m12x.json \
--filters '{"model_name": "phi-4", "base_task": "arithmetic"}'
Next Steps¶
- New users: Start with Workflow 1: Ranking to get oriented
- Researchers: Jump to Workflow 4: Diagnosis for the full diagnostic toolkit
- Tool reference: See docs/tools/analyze.md for complete command documentation
- Methodology: Read docs/technical-details.md for statistical foundations