Datasets: Main Reference and Research Collections¶

Overview¶

ReasonScape's dataset structure reflects the research lifecycle: a stable main reference dataset provides baselines, while focused research subdatasets enable deep investigation of specific questions. Successful findings graduate back to the main dataset, creating a continuous improvement loop.

Research subdatasets aren't just data storage—they're worked examples of the ReasonScape methodology, demonstrating how to build evidence-based stories that answer practical research questions.

Dataset Taxonomy¶

Main Reference Dataset: m12x¶

Location: data/m12x/, data/m12x.db, data/dataset-m12x.json

Purpose: Stable reference evaluation proving ReasonScape methodology works at scale

Characteristics:

75+ frontier models across architectures (dense, MoE, hybrid, SSM)
12 reasoning tasks with validated manifolds
6.5B tokens of processed data
150K+ evaluation points with pre-computed compression, FFT, and statistics
Stable baseline for peer comparison and benchmarking

Use Cases:

Baseline performance comparison
Cross-model capability analysis
Cognitive archetype identification
Leaderboard and explorer visualization

Documentation: m12x.md

Research Subdatasets: m12x-*¶

Location: data/m12x-*/, data/m12x-*.db, data/dataset-m12x-*.json

Purpose: Focused investigations of specific models, families, or research questions

Characteristics:

Targeted scope - specific model families or research hypotheses
Active investigation - iterative experimentation and analysis
Complete workflow examples - demonstrate ReasonScape methodology end-to-end
Documented findings - corresponding research/*/SUMMARY.md with evidence chains
Graduation path - successful configurations can be promoted to main m12x

Use Cases:

Deep-dive model characterization
Hypothesis testing (truncation causes, sampler effects, budget optimization)
Comparative family analysis (Qwen3 variants, thinking models)
Failure mode diagnosis (Seed-OSS overthinking loops)

The Graduation Pattern¶

Research subdatasets exist to answer specific questions. When answers emerge, the winning configurations graduate to the main dataset.

Example: Seed-OSS Think Budget Investigation¶

Research Question: Why does Seed-OSS-36B show 42% truncation on Sequence tasks?

Dataset: m12x-seed/ (21 directories, 7 configurations)

Tested quantization (AWQ vs FP16)
Tested samplers (greedy vs stochastic)
Tested thinking budgets (0K, 2K, 4K, 6K, unlimited)

Finding: 4K thinking budget is optimal (879 ReasonScore, 3% truncation vs 42% for unlimited)

Graduation: Seed-OSS-36B with seedoss-4k sampler added to main m12x dataset

Evidence: research/seedoss-think-budget/SUMMARY.md

Example: Qwen3-Next RC Budget Optimization¶

Research Question: What's the optimal reasoning control budget allocation for Qwen3-Next?

Dataset: m12x-qwen-next/ (45 directories, multiple RC variants)

Tested budget ratios (think/soft/hard allocations)
Tested penalty escalation (presence/repeat penalties)
Tested total budget sizes (4K, 6K, 7.6K)

Finding: RC-ultra-balanced (4K+1.8K+1.8K) beats original ultra, especially on Boolean tasks

Graduation: Optimal configurations added to main dataset with documented rationale

Evidence: research/qwen3-next-ultra-sweep/SUMMARY.md

Graduation Criteria¶

Not all research results graduate to main m12x:

✅ Graduate when: - Configuration solves a problem (truncation, accuracy, efficiency) - Finding generalizes beyond specific research question - Model represents practical deployment scenario - Results add value to reference baselines

❌ Don't graduate when: - Finding is purely diagnostic (explains failure but doesn't fix it) - Configuration is intermediate/experimental only - Issue is architectural (can't be fixed with sampling/prompting) - Data primarily serves as cautionary example

Current Research Subdatasets¶

m12x-qwen3 (Qwen3 Family Deep Dive)¶

Size: 62 directories, 104MB database, 19 model variants

Focus: Comprehensive Qwen3 family characterization across sizes and variants

Models: - Base Qwen3 (1.7B, 4B, 8B, 14B, 32B) - Thinking variants (4B-Thinking, 30B-A3B-Thinking) - Instruct variants (4B-Instruct, 30B-A3B-Instruct) - VL variants (2B, 4B, 8B, 32B across Instruct/Thinking) - MoE variants (30B-A3B, 235B-A22B)

Key Insights: - Thinking vs Instruct performance comparison - VL model reasoning capabilities - MoE vs dense architecture tradeoffs - Size scaling patterns within family

Status: Active reference collection for Qwen3 ecosystem

m12x-qwen-next (Qwen3-Next RC Optimization)¶

Size: 45 directories, 72MB database

Focus: Reasoning Control (RC) budget optimization for Qwen3-Next thinking models

Research: Ultra budget sweep testing think/soft/hard budget allocations

Key Findings: - RC-ultra-balanced (4K+1.8K+1.8K) outperforms original ultra - Budget distribution matters more than total budget size - Higher sampling penalties actively harm performance (RP1.2 breaks Movies task) - Boolean tasks benefit from balanced budget allocation

Graduation Status: Optimal configurations being integrated into main m12x

Documentation: research/qwen3-next-ultra-sweep/

m12x-seed (Seed-OSS Think Budget Investigation)¶

Size: 21 directories, 35MB database, 7 configurations

Focus: Diagnosing and fixing Seed-OSS-36B truncation issues through thinking budget control

Configurations Tested: - Quantization: AWQ vs FP16 - Samplers: Greedy (temp=0) vs stochastic (temp=1.1, top_p=0.95) - Think budgets: 0K, 2K, 4K, 6K, unlimited

Key Findings: - Unlimited thinking budget causes overthinking loops (42% Sequence truncation) - 4K budget optimal (879 ReasonScore, 3% truncation) - Quantization irrelevant (0.2% difference) - Sampling parameters noise at scale (<1% difference) - 0K budget catastrophic (-59.8% performance)

Graduation Status: ✅ Seed-OSS-36B with seedoss-4k sampler promoted to main m12x

Documentation: research/seedoss-think-budget/

m12x-gpt-oss (GPT-OSS Reasoning Control Investigation)¶

Size: 27 directories, 65MB database, 13 configurations

Focus: Testing documented reasoning control mechanisms for GPT-OSS 120B and 20B models

Research Question: Do reasoning control parameters ("high", "medium", "low") produce measurable performance differences?

Configurations Tested: - System prompt injection: {"role": "system", "content": "Reasoning: <level>"} - Chat template kwargs: {"reasoning_effort": "<level>"} - Both 120B and 20B model variants - Combined with zerocot and baseline samplers

Critical Finding: 🚨 Reasoning control mechanisms are non-functional

All reasoning level configurations (high/medium/low) show zero statistically significant differentiation: - 120B Hard tier: All 7 configs cluster together at 75.2%–76.3% (±0.6%) - 20B Hard tier: zerocot-high and zerocot-low both score exactly 66.8% ± 0.7% - Token waste: "High" reasoning costs 137 extra tokens for +0.8% gain (not significant) - Pattern holds: Across all 12 tasks and all 3 difficulty tiers

Graduation Status: ❌ No graduation - feature doesn't work, kept as diagnostic case study

Working with Research Subdatasets¶

Creating a New Research Dataset¶

Step 1: Define Research Question

Be specific about what you're testing:

✅ "Does thinking budget affect truncation rates in Seed-OSS?"
✅ "Which RC budget allocation optimizes Qwen3-Next performance?"
❌ "How good is model X?" (too broad, use main m12x)

Step 2: Design Experiments

Identify variables to test:

Models (variants, families, architectures)
Templates (zero-shot, CoT, system prompts)
Samplers (temperature, penalties, budget limits)
Context limits (simulated lower contexts)
Tasks (focus on relevant subset)
Difficulty tiers (easy for quick tests, all for comprehensive)

Step 3: Create Dataset Configuration

# Create subdirectory for interview data
mkdir -p data/m12x-myresearch

# Create dataset config
cat > data/dataset-m12x-myresearch.json <<EOF
{
  "name": "m12x-myresearch",
  "base": "base-m12x.json",
  "evals": [
    {
      "evaluate": {"glob": "data/m12x-myresearch/*config1*/*"},
      "filters": {
        "model": "model-name",
        "template": "zeroshot-nosys",
        "sampler": "greedy-max"
      },
      "label": "Config 1 Description",
      "groups": ["family:...", "arch:...", "size:..."]
    }
  ],
  "db": "data/m12x-myresearch.db"
}
EOF

Step 4: Run Evaluations

Start with quick tests (degree 0, low precision):

python runner.py --config configs/m12x.yaml \
  --degree 0 --density normal --precision low \
  --model your-model \
  --template zeroshot-nosys \
  --sampler greedy-max

Move results to research subdataset:

mv results/timestamp/your-model+template+sampler/* data/m12x-myresearch/

Process evaluations:

python evaluate.py --dataset data/dataset-m12x-myresearch.json --parallel 16

Step 5: Create Research Project

mkdir -p research/myresearch
cd research/myresearch

# Initialize tracking files
touch PROGRESS.md      # Track what you're doing
touch NOTES-topic.md   # Capture observations

Analyzing Research Datasets¶

Follow the discovery-investigation loop:

# 1. Get overview
python analyze.py evals data/dataset-m12x-myresearch.json
python analyze.py tasks data/dataset-m12x-myresearch.json

# 2. Visual discovery (save to research/myresearch/)
python analyze.py scores data/dataset-m12x-myresearch.json --format markdown > research/myresearch/leaderboard.md
python analyze.py cluster data/dataset-m12x-myresearch.json --group-by base_task --output research/myresearch/

# 3. Forensic investigation (for interesting patterns)
python analyze.py surface data/dataset-m12x-myresearch.json --filters '{"model": "target-model"}' --output research/myresearch/
python analyze.py compression data/dataset-m12x-myresearch.json --filters '{"base_task": "problematic-task"}' --output research/myresearch/
python analyze.py hazard data/dataset-m12x-myresearch.json --filters '{"model": "target-model"}' --output research/myresearch/

# 4. Visual inspection (CRITICAL: look at the PNGs!)
ls research/myresearch/*.png

Build evidence chains: Use compression → hazard → surface plots to tell the complete story

Example: Investigating a Truncation Problem¶

Hypothesis: Model X shows high truncation on task Y. Is it budget, sampler, or architectural?

Step 1: Quick test at degree 0

# Test 3 configurations
python runner.py --degree 0 --model X --sampler greedy-max --template zerocot-nosys
python runner.py --degree 0 --model X --sampler high-budget --template zeroshot-nosys
python runner.py --degree 0 --model X --sampler low-temp --template zerocot-nosys

Step 2: Process and compare

python evaluate.py --dataset data/dataset-m12x-mytest.json
python analyze.py cluster data/dataset-m12x-mytest.json --group-by sampler

Step 3: Visual inspection

python analyze.py compression data/dataset-m12x-mytest.json --output research/mytest/
# Look for: tight clustering at context limit? high CV ratios? bimodal distributions?

Step 4: Follow the evidence - If CV ratio >50x: Chaotic behavior - If tight clustering at limit: Hard truncation - If budget changes help: Overthinking loops - If nothing helps: Architectural issue

Step 5: Document findings

# Summarize in research/mytest/SUMMARY.md
# Include: hypothesis, experiments, findings, visualizations, conclusion

Best Practices¶

Dos¶

✅ Start small - Test at degree 0 first, expand if promising
✅ Visual inspection is mandatory - Look at every PNG before drawing conclusions
✅ Document as you go - PROGRESS.md prevents forgetting what you tried
✅ Follow evidence chains - Compression → hazard → surface tells complete stories
✅ Use existing research as templates - Study worked examples before starting
✅ Graduate winners - Add successful configs to main m12x
✅ Document failures - Negative results are valuable case studies

Don'ts¶

❌ Don't skip statistical validation - Use cluster analysis, not eyeballing
❌ Don't test everything - Focus on specific hypotheses
❌ Don't ignore compression analysis - It reveals root causes
❌ Don't assume more budget helps - Test systematically (see Seed-OSS, Qwen3-Next)
❌ Don't trust sampling penalties alone - They can actively harm performance (see Qwen3-Next RP1.2)
❌ Don't graduate incomplete investigations - Finish the evidence story first