Skip to content

M6 Evaluation Suite

M6 is ReasonScape's current-generation reasoning evaluation suite, implementing a manifold-based evaluation system with orthogonal difficulty and precision controls for efficient, scalable cognitive assessment across 6 reasoning domains.

Evaluation Architecture

Manifold-Based Difficulty Progression

M6 uses parametric manifolds with degree-controlled sampling for adaptive evaluation workflows:

  • Degree 0 (Easy): 78 strategic difficulty points for rapid model comparison
  • Degree 1 (Medium): 167 comprehensive points for standard evaluation
  • Degree 2 (Hard): 215 maximum coverage points for research analysis

Key Advantage: Start with degree 0 for efficient model screening, then scale to higher degrees for detailed analysis of promising configurations.

Orthogonal Precision Control

M6 separates difficulty coverage from statistical precision through independent precision levels:

Precision Target CI Base Samples Max Samples Use Case
Low 9% 32 per point 192 per point Rapid model exploration
Medium 6% 64 per point 512 per point Standard comparison
High 4% 128 per point 1,280 per point Publication precision

Density Resampling

M6 enables resource-adaptive evaluation through density controls:

  • Normal: Full manifold coverage for comprehensive analysis
  • Lowdef: Strategic 3-point sampling (first, middle, last) for 15% efficiency gain
  • Corner: Boundary sampling (first, last) for 30% efficiency gain with preserved boundary coverage

Task Portfolio

M6 evaluates 6 reasoning domains with manifold-based difficulty scaling:

Task Coverage by Degree

Task Degree 0 Degree 1 Degree 2 Key Difficulty Dimensions
Objects 24 pts 24 pts 24 pts Target groups × Length × Distractors
Arithmetic 26 pts 39 pts 39 pts Length × Depth across number ranges
Dates 8 pts 12 pts 16 pts Question tier × Date format
Boolean 8 pts 20 pts 40 pts Length × Depth × Format variations
Movies 4 pts 24 pts 32 pts Hints × Reference count
Shuffle 24 pts 48 pts 64 pts Length × Depth × Confounders

Boolean Reasoning: Shows most aggressive scaling (8→20→40 points) as logical complexity growth reveals significant model differentiation across degrees.

Objects Processing: Maintains consistent coverage (24 points) across degrees but demonstrates dramatic performance degradation from degree 1→2, indicating selective attention limits.

Movies Analysis: Redesigned from C2's cultural knowledge approach to focus on pure pattern recognition through hint×reference information processing.

Configuration Matrix

Coverage and Resource Usage

Degree × Precision × Density Total Points Target CI Approx. Token Usage* Use Case
D0 × Low × Normal 78 9% ~3M tokens Quick model comparison
D0 × Medium × Normal 78 6% ~9M tokens Standard rapid evaluation
D1 × Low × Normal 167 9% ~10M tokens Efficient comprehensive assessment
D1 × Medium × Normal 167 6% ~30M tokens Standard evaluation
D1 × High × Normal 167 4% ~60M tokens Publication precision
D2 × Low × Normal 215 9% ~20M tokens Research analysis
D2 × Medium × Normal 215 6% ~60M tokens Research analysis
D2 × High × Normal 215 4% ~200M tokens Maximum precision research

*Token usage varies significantly by model (Qwen3 thinking models use 3-10× more tokens than standard models)

Quick Model Screening: D0 × Low × Normal - 78 difficulty points across 6 tasks - ~20M tokens for most models - 2-3 hours evaluation time

Standard Comparison: D1 × Medium × Normal
- 167 difficulty points for complete assessment - ~60M tokens for most models - 8-12 hours evaluation time

Research Analysis: D2 × High × Normal - 215 difficulty points with maximum coverage - ~150M tokens for most models
- 20+ hours evaluation time

Usage Examples

Quick Model Comparison

python runner.py \
    --config configs/m6.json \
    --degree 0 \
    --precision low \
    --density normal \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json

Standard Evaluation

python runner.py \
    --config configs/m6.json \
    --degree 1 \
    --precision medium \
    --density normal \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json

Research-Grade Analysis

python runner.py \
    --config configs/m6.json \
    --degree 2 \
    --precision high \
    --density normal \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-max.json

Resource-Optimized Evaluation

# Use corner density for 30% efficiency gain
python runner.py \
    --config configs/m6.json \
    --degree 1 \
    --precision medium \
    --density corner \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json

Technical Implementation

Statistical Methodology

Adaptive Sampling: Each difficulty point uses base samples (32-128) with automatic expansion up to maximum samples (192-1,280) based on confidence interval targeting.

Excess Accuracy Correction: Statistical adjustment for models achieving near-perfect performance on easier difficulty points.

Confidence Intervals: Wilson score intervals with continuity correction, targeting 4-9% width depending on precision level.

Manifold Parameter Space

M6 explores reasoning difficulty through parametric manifolds rather than fixed grids:

Objects Manifold: target_groups(1-8) × length(4-56) × distractors(0-4)

  • Selective attention assessment across semantic categories
  • 11 domains: Animals, Colors, Foods, etc.

Boolean Manifold: length(4-32) × depth(0-6) × format(3 types)

  • Logical reasoning with nested operators
  • Precedence rules and format variations

Arithmetic Manifolds (3 separate):

  • Small numbers (-9 to 9) with whitespace variations
  • Large numbers (-99 to 99) with depth scaling
  • Pure whitespace robustness testing

Movies Manifold: hints(0-4) × reference_count(3-16)

  • Pattern recognition through hint-based information filtering
  • Reduced cultural knowledge dependency vs C2

Shuffle Manifolds (2 separate):

  • No anchor: Pure sequential tracking
  • With anchors: Numeric/alphabetic organizational markers

Dates Manifold: tier(0-3) × format(4 types)

  • Temporal reasoning complexity scaling
  • Format variations: USA, Natural, Ordinal, Offset

Viewing M6 Results

M6 results integrate with ReasonScape's visualization tools:

# Generate M6 dataset from evaluation results
python evaluate.py --interview 'results/*m6*/*.ndjson' --output data/m6-results.json

# View multi-task rankings
python leaderboard.py data/m6-results.json

# Explore 6D cognitive assessment space  
python explorer.py data/m6-results.json

The visualization system automatically adapts to M6's manifold structure, enabling intuitive exploration across all 6 reasoning domains with degree-aware difficulty rendering.