Skip to content

ReasonScape Research Workflows

"Four questions, four workflows, one unified platform"

Introduction

ReasonScape supports four distinct research activities in LLM evaluation. Each uses different cross-sections of the toolset and data, targeting fundamentally different research questions:

  1. Ranking & Benchmarking - "What's the best overall?"
  2. Comparative Evaluation - "Which models are truly different?"
  3. Model Characterization - "What are the trade-offs?"
  4. Failure Diagnosis - "Why/how/when did it fail?"

The Four Research Activities

1. Ranking & Benchmarking

Research Question: "What's the best model overall?"

What it does: Aggregates diverse metrics (accuracy, truncation, resource utilization) into unified rankings across the entire test suite.

Primary tool: scores

Output: Model rankings with single score/token metric

Use case: Initial triage - identify handful of candidates for deeper investigation

Duration: 2-3 minutes

Read the full workflow


2. Comparative Evaluation

Research Question: "Which models offer statistically significant performance at a specific task?"

What it does: Statistical comparison of model sets accounting for confidence intervals. Groups models into clusters where members are statistically indistinguishable.

Primary tool: cluster

Output: Overlapping cluster sets showing true performance differences vs. measurement noise

Use case: Deep-dive comparison of candidate models from ranking phase

Duration: 5-10 minutes

Read the full workflow


3. Model Characterization

Research Question: "What are this model's strengths, weaknesses, and costs?"

What it does: Profiles a single model's capability landscape across tasks and difficulty dimensions.

Primary tools: spider (high-level fingerprint), surface (task-specific capability zones)

Output: Cognitive archetypes, capability maps, cost/performance profiles

Use case: Understanding a specific model for deployment decisions

Duration: 5-10 minutes

Read the full workflow


4. Failure Diagnosis

Research Question: "Why did this model fail on this task?"

What it does: Root-cause analysis across four information spaces: - INPUT: How is the problem represented? (fft) - REASONING: How is information being processed? (compression) - OUTPUT: Where does performance break down? (surface) - TEMPORAL: When does thinking degrade? (hazard)

Primary tools: surface, fft, compression, hazard

Output: Failure boundaries, tokenization artifacts, entropy patterns, temporal degradation curves

Use case: Deep forensics for model improvement or research

Duration: 10-20 minutes

Read the full workflow


Choosing the Right Workflow

Use this decision tree to select your workflow:

Do you need a quick overview of available models?
    YES → Workflow 1: Ranking & Benchmarking
    NO  ↓

Do you need to compare multiple models statistically?
    YES → Workflow 2: Comparative Evaluation
    NO  ↓

Do you need to understand one model's capabilities?
    YES → Workflow 3: Model Characterization
    NO  ↓

Do you need to diagnose why a model fails?
    YES → Workflow 4: Failure Diagnosis

Typical Research Progression

Most research follows this sequence:

  1. Start with Ranking (scores)
  2. Identify 3-5 candidate models
  3. Note which tasks show performance spread

  4. Compare Candidates (cluster)

  5. Determine which differences are statistically significant
  6. Identify which models are actually equivalent

  7. Characterize Selected Models (spider, surface)

  8. Map capability zones
  9. Understand trade-offs
  10. Estimate deployment costs

  11. Diagnose Unexpected Failures (surface, fft, compression, hazard)

  12. Root-cause specific failure modes
  13. Form hypotheses about reasoning mechanisms
  14. Design targeted follow-up evaluations

Prerequisites

All workflows assume: - You have a processed dataset (see docs/tools/evaluate.md) - You're working from the /home/mike/ai/reasonscape directory - Virtual environment is activated (source venv/bin/activate)

Quick Start Examples

I want to rank 10 models

source venv/bin/activate
python analyze.py scores data/dataset-m12x.json --output scores.md
→ See Workflow 1: Ranking

I want to compare Llama vs Qwen

source venv/bin/activate
python analyze.py cluster data/dataset-m12x.json \
    --filters '{"family": ["llama", "qwen"]}' \
    --stack base_task --format png
→ See Workflow 2: Comparative

I want to profile GPT-4o

source venv/bin/activate
python analyze.py spiderweb data/dataset-m12x.json \
    --filters '{"model_name": "gpt-4o"}' --format webpng
→ See Workflow 3: Characterization

I want to know why Phi-4 fails at arithmetic

source venv/bin/activate
python analyze.py surface data/dataset-m12x.json \
    --filters '{"model_name": "phi-4", "base_task": "arithmetic"}'
→ See Workflow 4: Diagnosis


Next Steps

  1. New users: Start with Workflow 1: Ranking to get oriented
  2. Researchers: Jump to Workflow 4: Diagnosis for the full diagnostic toolkit
  3. Tool reference: See docs/tools/analyze.md for complete command documentation
  4. Methodology: Read docs/technical-details.md for statistical foundations