Skip to content

Datasets: Main Reference and Research Collections

Overview

ReasonScape's dataset structure reflects the research lifecycle: a stable main reference dataset provides baselines, while focused research subdatasets enable deep investigation of specific questions. Successful findings graduate back to the main dataset, creating a continuous improvement loop.

Research subdatasets aren't just data storageβ€”they're worked examples of the ReasonScape methodology, demonstrating how to build evidence-based stories that answer practical research questions.

Dataset Taxonomy

Main Reference Dataset: m12x

Location: data/m12x/, data/m12x.db, data/dataset-m12x.json

Purpose: Stable reference evaluation proving ReasonScape methodology works at scale

Characteristics:

  • 75+ frontier models across architectures (dense, MoE, hybrid, SSM)
  • 12 reasoning tasks with validated manifolds
  • 6.5B tokens of processed data
  • 150K+ evaluation points with pre-computed compression, FFT, and statistics
  • Stable baseline for peer comparison and benchmarking

Use Cases:

  • Baseline performance comparison
  • Cross-model capability analysis
  • Cognitive archetype identification
  • Leaderboard and explorer visualization

Documentation: m12x.md

Research Subdatasets: m12x-*

Location: data/m12x-*/, data/m12x-*.db, data/dataset-m12x-*.json

Purpose: Focused investigations of specific models, families, or research questions

Characteristics:

  • Targeted scope - specific model families or research hypotheses
  • Active investigation - iterative experimentation and analysis
  • Complete workflow examples - demonstrate ReasonScape methodology end-to-end
  • Documented findings - corresponding research/*/SUMMARY.md with evidence chains
  • Graduation path - successful configurations can be promoted to main m12x

Use Cases:

  • Deep-dive model characterization
  • Hypothesis testing (truncation causes, sampler effects, budget optimization)
  • Comparative family analysis (Qwen3 variants, thinking models)
  • Failure mode diagnosis (Seed-OSS overthinking loops)

The Graduation Pattern

Research subdatasets exist to answer specific questions. When answers emerge, the winning configurations graduate to the main dataset.

Example: Seed-OSS Think Budget Investigation

Research Question: Why does Seed-OSS-36B show 42% truncation on Sequence tasks?

Dataset: m12x-seed/ (21 directories, 7 configurations)

  • Tested quantization (AWQ vs FP16)
  • Tested samplers (greedy vs stochastic)
  • Tested thinking budgets (0K, 2K, 4K, 6K, unlimited)

Finding: 4K thinking budget is optimal (879 ReasonScore, 3% truncation vs 42% for unlimited)

Graduation: Seed-OSS-36B with seedoss-4k sampler added to main m12x dataset

Evidence: research/seedoss-think-budget/SUMMARY.md

Example: Qwen3-Next RC Budget Optimization

Research Question: What's the optimal reasoning control budget allocation for Qwen3-Next?

Dataset: m12x-qwen-next/ (45 directories, multiple RC variants)

  • Tested budget ratios (think/soft/hard allocations)
  • Tested penalty escalation (presence/repeat penalties)
  • Tested total budget sizes (4K, 6K, 7.6K)

Finding: RC-ultra-balanced (4K+1.8K+1.8K) beats original ultra, especially on Boolean tasks

Graduation: Optimal configurations added to main dataset with documented rationale

Evidence: research/qwen3-next-ultra-sweep/SUMMARY.md

Graduation Criteria

Not all research results graduate to main m12x:

βœ… Graduate when: - Configuration solves a problem (truncation, accuracy, efficiency) - Finding generalizes beyond specific research question - Model represents practical deployment scenario - Results add value to reference baselines

❌ Don't graduate when: - Finding is purely diagnostic (explains failure but doesn't fix it) - Configuration is intermediate/experimental only - Issue is architectural (can't be fixed with sampling/prompting) - Data primarily serves as cautionary example

Current Research Subdatasets

m12x-qwen3 (Qwen3 Family Deep Dive)

Size: 62 directories, 104MB database, 19 model variants

Focus: Comprehensive Qwen3 family characterization across sizes and variants

Models: - Base Qwen3 (1.7B, 4B, 8B, 14B, 32B) - Thinking variants (4B-Thinking, 30B-A3B-Thinking) - Instruct variants (4B-Instruct, 30B-A3B-Instruct) - VL variants (2B, 4B, 8B, 32B across Instruct/Thinking) - MoE variants (30B-A3B, 235B-A22B)

Key Insights: - Thinking vs Instruct performance comparison - VL model reasoning capabilities - MoE vs dense architecture tradeoffs - Size scaling patterns within family

Status: Active reference collection for Qwen3 ecosystem


m12x-qwen-next (Qwen3-Next RC Optimization)

Size: 45 directories, 72MB database

Focus: Reasoning Control (RC) budget optimization for Qwen3-Next thinking models

Research: Ultra budget sweep testing think/soft/hard budget allocations

Key Findings: - RC-ultra-balanced (4K+1.8K+1.8K) outperforms original ultra - Budget distribution matters more than total budget size - Higher sampling penalties actively harm performance (RP1.2 breaks Movies task) - Boolean tasks benefit from balanced budget allocation

Graduation Status: Optimal configurations being integrated into main m12x

Documentation: research/qwen3-next-ultra-sweep/


m12x-seed (Seed-OSS Think Budget Investigation)

Size: 21 directories, 35MB database, 7 configurations

Focus: Diagnosing and fixing Seed-OSS-36B truncation issues through thinking budget control

Configurations Tested: - Quantization: AWQ vs FP16 - Samplers: Greedy (temp=0) vs stochastic (temp=1.1, top_p=0.95) - Think budgets: 0K, 2K, 4K, 6K, unlimited

Key Findings: - Unlimited thinking budget causes overthinking loops (42% Sequence truncation) - 4K budget optimal (879 ReasonScore, 3% truncation) - Quantization irrelevant (0.2% difference) - Sampling parameters noise at scale (<1% difference) - 0K budget catastrophic (-59.8% performance)

Graduation Status: βœ… Seed-OSS-36B with seedoss-4k sampler promoted to main m12x

Documentation: research/seedoss-think-budget/


m12x-gpt-oss (GPT-OSS Reasoning Control Investigation)

Size: 27 directories, 65MB database, 13 configurations

Focus: Testing documented reasoning control mechanisms for GPT-OSS 120B and 20B models

Research Question: Do reasoning control parameters ("high", "medium", "low") produce measurable performance differences?

Configurations Tested: - System prompt injection: {"role": "system", "content": "Reasoning: <level>"} - Chat template kwargs: {"reasoning_effort": "<level>"} - Both 120B and 20B model variants - Combined with zerocot and baseline samplers

Critical Finding: 🚨 Reasoning control mechanisms are non-functional

All reasoning level configurations (high/medium/low) show zero statistically significant differentiation: - 120B Hard tier: All 7 configs cluster together at 75.2%–76.3% (Β±0.6%) - 20B Hard tier: zerocot-high and zerocot-low both score exactly 66.8% Β± 0.7% - Token waste: "High" reasoning costs 137 extra tokens for +0.8% gain (not significant) - Pattern holds: Across all 12 tasks and all 3 difficulty tiers

Graduation Status: ❌ No graduation - feature doesn't work, kept as diagnostic case study

Working with Research Subdatasets

Creating a New Research Dataset

Step 1: Define Research Question

Be specific about what you're testing:

  • βœ… "Does thinking budget affect truncation rates in Seed-OSS?"
  • βœ… "Which RC budget allocation optimizes Qwen3-Next performance?"
  • ❌ "How good is model X?" (too broad, use main m12x)

Step 2: Design Experiments

Identify variables to test:

  • Models (variants, families, architectures)
  • Templates (zero-shot, CoT, system prompts)
  • Samplers (temperature, penalties, budget limits)
  • Context limits (simulated lower contexts)
  • Tasks (focus on relevant subset)
  • Difficulty tiers (easy for quick tests, all for comprehensive)

Step 3: Create Dataset Configuration

# Create subdirectory for interview data
mkdir -p data/m12x-myresearch

# Create dataset config
cat > data/dataset-m12x-myresearch.json <<EOF
{
  "name": "m12x-myresearch",
  "base": "base-m12x.json",
  "evals": [
    {
      "evaluate": {"glob": "data/m12x-myresearch/*config1*/*"},
      "filters": {
        "model": "model-name",
        "template": "zeroshot-nosys",
        "sampler": "greedy-max"
      },
      "label": "Config 1 Description",
      "groups": ["family:...", "arch:...", "size:..."]
    }
  ],
  "db": "data/m12x-myresearch.db"
}
EOF

Step 4: Run Evaluations

Start with quick tests (degree 0, low precision):

python runner.py --config configs/m12x.yaml \
  --degree 0 --density normal --precision low \
  --model your-model \
  --template zeroshot-nosys \
  --sampler greedy-max

Move results to research subdataset:

mv results/timestamp/your-model+template+sampler/* data/m12x-myresearch/

Process evaluations:

python evaluate.py --dataset data/dataset-m12x-myresearch.json --parallel 16

Step 5: Create Research Project

mkdir -p research/myresearch
cd research/myresearch

# Initialize tracking files
touch PROGRESS.md      # Track what you're doing
touch NOTES-topic.md   # Capture observations

Analyzing Research Datasets

Follow the discovery-investigation loop:

# 1. Get overview
python analyze.py evals data/dataset-m12x-myresearch.json
python analyze.py tasks data/dataset-m12x-myresearch.json

# 2. Visual discovery (save to research/myresearch/)
python analyze.py scores data/dataset-m12x-myresearch.json --format markdown > research/myresearch/leaderboard.md
python analyze.py cluster data/dataset-m12x-myresearch.json --group-by base_task --output research/myresearch/

# 3. Forensic investigation (for interesting patterns)
python analyze.py surface data/dataset-m12x-myresearch.json --filters '{"model": "target-model"}' --output research/myresearch/
python analyze.py compression data/dataset-m12x-myresearch.json --filters '{"base_task": "problematic-task"}' --output research/myresearch/
python analyze.py hazard data/dataset-m12x-myresearch.json --filters '{"model": "target-model"}' --output research/myresearch/

# 4. Visual inspection (CRITICAL: look at the PNGs!)
ls research/myresearch/*.png

Build evidence chains: Use compression β†’ hazard β†’ surface plots to tell the complete story

Example: Investigating a Truncation Problem

Hypothesis: Model X shows high truncation on task Y. Is it budget, sampler, or architectural?

Step 1: Quick test at degree 0

# Test 3 configurations
python runner.py --degree 0 --model X --sampler greedy-max --template zerocot-nosys
python runner.py --degree 0 --model X --sampler high-budget --template zeroshot-nosys
python runner.py --degree 0 --model X --sampler low-temp --template zerocot-nosys

Step 2: Process and compare

python evaluate.py --dataset data/dataset-m12x-mytest.json
python analyze.py cluster data/dataset-m12x-mytest.json --group-by sampler

Step 3: Visual inspection

python analyze.py compression data/dataset-m12x-mytest.json --output research/mytest/
# Look for: tight clustering at context limit? high CV ratios? bimodal distributions?

Step 4: Follow the evidence - If CV ratio >50x: Chaotic behavior - If tight clustering at limit: Hard truncation - If budget changes help: Overthinking loops - If nothing helps: Architectural issue

Step 5: Document findings

# Summarize in research/mytest/SUMMARY.md
# Include: hypothesis, experiments, findings, visualizations, conclusion

Best Practices

Dos

  • βœ… Start small - Test at degree 0 first, expand if promising
  • βœ… Visual inspection is mandatory - Look at every PNG before drawing conclusions
  • βœ… Document as you go - PROGRESS.md prevents forgetting what you tried
  • βœ… Follow evidence chains - Compression β†’ hazard β†’ surface tells complete stories
  • βœ… Use existing research as templates - Study worked examples before starting
  • βœ… Graduate winners - Add successful configs to main m12x
  • βœ… Document failures - Negative results are valuable case studies

Don'ts

  • ❌ Don't skip statistical validation - Use cluster analysis, not eyeballing
  • ❌ Don't test everything - Focus on specific hypotheses
  • ❌ Don't ignore compression analysis - It reveals root causes
  • ❌ Don't assume more budget helps - Test systematically (see Seed-OSS, Qwen3-Next)
  • ❌ Don't trust sampling penalties alone - They can actively harm performance (see Qwen3-Next RP1.2)
  • ❌ Don't graduate incomplete investigations - Finish the evidence story first

See Also


The research subdatasets are living proof that ReasonScape works. Each one represents hundreds of hours of investigation, producing actionable insights that improved model deployments or revealed architectural issues. Use them as templates for your own investigations.