M6 Evaluation Suite¶

M6 is ReasonScape's current-generation reasoning evaluation suite, implementing a manifold-based evaluation system with orthogonal difficulty and precision controls for efficient, scalable cognitive assessment across 6 reasoning domains.

Evaluation Architecture¶

Manifold-Based Difficulty Progression¶

M6 uses parametric manifolds with degree-controlled sampling for adaptive evaluation workflows:

Degree 0 (Easy): 78 strategic difficulty points for rapid model comparison
Degree 1 (Medium): 167 comprehensive points for standard evaluation
Degree 2 (Hard): 215 maximum coverage points for research analysis

Key Advantage: Start with degree 0 for efficient model screening, then scale to higher degrees for detailed analysis of promising configurations.

Orthogonal Precision Control¶

M6 separates difficulty coverage from statistical precision through independent precision levels:

Precision	Target CI	Base Samples	Max Samples	Use Case
Low	9%	32 per point	192 per point	Rapid model exploration
Medium	6%	64 per point	512 per point	Standard comparison
High	4%	128 per point	1,280 per point	Publication precision

Density Resampling¶

M6 enables resource-adaptive evaluation through density controls:

Normal: Full manifold coverage for comprehensive analysis
Lowdef: Strategic 3-point sampling (first, middle, last) for 15% efficiency gain
Corner: Boundary sampling (first, last) for 30% efficiency gain with preserved boundary coverage

Task Portfolio¶

M6 evaluates 6 reasoning domains with manifold-based difficulty scaling:

Task Coverage by Degree¶

Task	Degree 0	Degree 1	Degree 2	Key Difficulty Dimensions
Objects	24 pts	24 pts	24 pts	Target groups × Length × Distractors
Arithmetic	26 pts	39 pts	39 pts	Length × Depth across number ranges
Dates	8 pts	12 pts	16 pts	Question tier × Date format
Boolean	8 pts	20 pts	40 pts	Length × Depth × Format variations
Movies	4 pts	24 pts	32 pts	Hints × Reference count
Shuffle	24 pts	48 pts	64 pts	Length × Depth × Confounders

Boolean Reasoning: Shows most aggressive scaling (8→20→40 points) as logical complexity growth reveals significant model differentiation across degrees.

Objects Processing: Maintains consistent coverage (24 points) across degrees but demonstrates dramatic performance degradation from degree 1→2, indicating selective attention limits.

Movies Analysis: Redesigned from C2's cultural knowledge approach to focus on pure pattern recognition through hint×reference information processing.

Configuration Matrix¶

Coverage and Resource Usage¶

Degree × Precision × Density	Total Points	Target CI	Approx. Token Usage*	Use Case
D0 × Low × Normal	78	9%	~3M tokens	Quick model comparison
D0 × Medium × Normal	78	6%	~9M tokens	Standard rapid evaluation
D1 × Low × Normal	167	9%	~10M tokens	Efficient comprehensive assessment
D1 × Medium × Normal	167	6%	~30M tokens	Standard evaluation
D1 × High × Normal	167	4%	~60M tokens	Publication precision
D2 × Low × Normal	215	9%	~20M tokens	Research analysis
D2 × Medium × Normal	215	6%	~60M tokens	Research analysis
D2 × High × Normal	215	4%	~200M tokens	Maximum precision research

*Token usage varies significantly by model (Qwen3 thinking models use 3-10× more tokens than standard models)

Recommended Configurations¶

Quick Model Screening: D0 × Low × Normal - 78 difficulty points across 6 tasks - ~20M tokens for most models - 2-3 hours evaluation time

Standard Comparison: D1 × Medium × Normal
- 167 difficulty points for complete assessment - ~60M tokens for most models - 8-12 hours evaluation time

Research Analysis: D2 × High × Normal - 215 difficulty points with maximum coverage - ~150M tokens for most models
- 20+ hours evaluation time

Usage Examples¶

Quick Model Comparison¶

python runner.py \
    --config configs/m6.json \
    --degree 0 \
    --precision low \
    --density normal \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json

Standard Evaluation¶

python runner.py \
    --config configs/m6.json \
    --degree 1 \
    --precision medium \
    --density normal \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json

Research-Grade Analysis¶

python runner.py \
    --config configs/m6.json \
    --degree 2 \
    --precision high \
    --density normal \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-max.json

Resource-Optimized Evaluation¶

# Use corner density for 30% efficiency gain
python runner.py \
    --config configs/m6.json \
    --degree 1 \
    --precision medium \
    --density corner \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json

Technical Implementation¶

Statistical Methodology¶

Adaptive Sampling: Each difficulty point uses base samples (32-128) with automatic expansion up to maximum samples (192-1,280) based on confidence interval targeting.

Excess Accuracy Correction: Statistical adjustment for models achieving near-perfect performance on easier difficulty points.

Confidence Intervals: Wilson score intervals with continuity correction, targeting 4-9% width depending on precision level.

Manifold Parameter Space¶

M6 explores reasoning difficulty through parametric manifolds rather than fixed grids:

Objects Manifold: target_groups(1-8) × length(4-56) × distractors(0-4)

Selective attention assessment across semantic categories
11 domains: Animals, Colors, Foods, etc.

Boolean Manifold: length(4-32) × depth(0-6) × format(3 types)

Logical reasoning with nested operators
Precedence rules and format variations

Arithmetic Manifolds (3 separate):

Small numbers (-9 to 9) with whitespace variations
Large numbers (-99 to 99) with depth scaling
Pure whitespace robustness testing

Movies Manifold: hints(0-4) × reference_count(3-16)

Pattern recognition through hint-based information filtering
Reduced cultural knowledge dependency vs C2

Shuffle Manifolds (2 separate):

No anchor: Pure sequential tracking
With anchors: Numeric/alphabetic organizational markers

Dates Manifold: tier(0-3) × format(4 types)

Temporal reasoning complexity scaling
Format variations: USA, Natural, Ordinal, Offset

Viewing M6 Results¶

M6 results integrate with ReasonScape's visualization tools:

# Generate M6 dataset from evaluation results
python evaluate.py --interview 'results/*m6*/*.ndjson' --output data/m6-results.json

# View multi-task rankings
python leaderboard.py data/m6-results.json

# Explore 6D cognitive assessment space  
python explorer.py data/m6-results.json

The visualization system automatically adapts to M6's manifold structure, enabling intuitive exploration across all 6 reasoning domains with degree-aware difficulty rendering.