M6 Evaluation Suite¶
M6 is ReasonScape's current-generation reasoning evaluation suite, implementing a manifold-based evaluation system with orthogonal difficulty and precision controls for efficient, scalable cognitive assessment across 6 reasoning domains.
Evaluation Architecture¶
Manifold-Based Difficulty Progression¶
M6 uses parametric manifolds with degree-controlled sampling for adaptive evaluation workflows:
- Degree 0 (Easy): 78 strategic difficulty points for rapid model comparison
- Degree 1 (Medium): 167 comprehensive points for standard evaluation
- Degree 2 (Hard): 215 maximum coverage points for research analysis
Key Advantage: Start with degree 0 for efficient model screening, then scale to higher degrees for detailed analysis of promising configurations.
Orthogonal Precision Control¶
M6 separates difficulty coverage from statistical precision through independent precision levels:
Precision | Target CI | Base Samples | Max Samples | Use Case |
---|---|---|---|---|
Low | 9% | 32 per point | 192 per point | Rapid model exploration |
Medium | 6% | 64 per point | 512 per point | Standard comparison |
High | 4% | 128 per point | 1,280 per point | Publication precision |
Density Resampling¶
M6 enables resource-adaptive evaluation through density controls:
- Normal: Full manifold coverage for comprehensive analysis
- Lowdef: Strategic 3-point sampling (first, middle, last) for 15% efficiency gain
- Corner: Boundary sampling (first, last) for 30% efficiency gain with preserved boundary coverage
Task Portfolio¶
M6 evaluates 6 reasoning domains with manifold-based difficulty scaling:
Task Coverage by Degree¶
Task | Degree 0 | Degree 1 | Degree 2 | Key Difficulty Dimensions |
---|---|---|---|---|
Objects | 24 pts | 24 pts | 24 pts | Target groups × Length × Distractors |
Arithmetic | 26 pts | 39 pts | 39 pts | Length × Depth across number ranges |
Dates | 8 pts | 12 pts | 16 pts | Question tier × Date format |
Boolean | 8 pts | 20 pts | 40 pts | Length × Depth × Format variations |
Movies | 4 pts | 24 pts | 32 pts | Hints × Reference count |
Shuffle | 24 pts | 48 pts | 64 pts | Length × Depth × Confounders |
Boolean Reasoning: Shows most aggressive scaling (8→20→40 points) as logical complexity growth reveals significant model differentiation across degrees.
Objects Processing: Maintains consistent coverage (24 points) across degrees but demonstrates dramatic performance degradation from degree 1→2, indicating selective attention limits.
Movies Analysis: Redesigned from C2's cultural knowledge approach to focus on pure pattern recognition through hint×reference information processing.
Configuration Matrix¶
Coverage and Resource Usage¶
Degree × Precision × Density | Total Points | Target CI | Approx. Token Usage* | Use Case |
---|---|---|---|---|
D0 × Low × Normal | 78 | 9% | ~3M tokens | Quick model comparison |
D0 × Medium × Normal | 78 | 6% | ~9M tokens | Standard rapid evaluation |
D1 × Low × Normal | 167 | 9% | ~10M tokens | Efficient comprehensive assessment |
D1 × Medium × Normal | 167 | 6% | ~30M tokens | Standard evaluation |
D1 × High × Normal | 167 | 4% | ~60M tokens | Publication precision |
D2 × Low × Normal | 215 | 9% | ~20M tokens | Research analysis |
D2 × Medium × Normal | 215 | 6% | ~60M tokens | Research analysis |
D2 × High × Normal | 215 | 4% | ~200M tokens | Maximum precision research |
*Token usage varies significantly by model (Qwen3 thinking models use 3-10× more tokens than standard models)
Recommended Configurations¶
Quick Model Screening: D0 × Low × Normal - 78 difficulty points across 6 tasks - ~20M tokens for most models - 2-3 hours evaluation time
Standard Comparison: D1 × Medium × Normal
- 167 difficulty points for complete assessment
- ~60M tokens for most models
- 8-12 hours evaluation time
Research Analysis: D2 × High × Normal
- 215 difficulty points with maximum coverage
- ~150M tokens for most models
- 20+ hours evaluation time
Usage Examples¶
Quick Model Comparison¶
python runner.py \
--config configs/m6.json \
--degree 0 \
--precision low \
--density normal \
--template templates/zerocot-nosys.json \
--sampler samplers/greedy-4k.json
Standard Evaluation¶
python runner.py \
--config configs/m6.json \
--degree 1 \
--precision medium \
--density normal \
--template templates/zerocot-nosys.json \
--sampler samplers/greedy-4k.json
Research-Grade Analysis¶
python runner.py \
--config configs/m6.json \
--degree 2 \
--precision high \
--density normal \
--template templates/zerocot-nosys.json \
--sampler samplers/greedy-max.json
Resource-Optimized Evaluation¶
# Use corner density for 30% efficiency gain
python runner.py \
--config configs/m6.json \
--degree 1 \
--precision medium \
--density corner \
--template templates/zerocot-nosys.json \
--sampler samplers/greedy-4k.json
Technical Implementation¶
Statistical Methodology¶
Adaptive Sampling: Each difficulty point uses base samples (32-128) with automatic expansion up to maximum samples (192-1,280) based on confidence interval targeting.
Excess Accuracy Correction: Statistical adjustment for models achieving near-perfect performance on easier difficulty points.
Confidence Intervals: Wilson score intervals with continuity correction, targeting 4-9% width depending on precision level.
Manifold Parameter Space¶
M6 explores reasoning difficulty through parametric manifolds rather than fixed grids:
Objects Manifold: target_groups(1-8) × length(4-56) × distractors(0-4)
- Selective attention assessment across semantic categories
- 11 domains: Animals, Colors, Foods, etc.
Boolean Manifold: length(4-32) × depth(0-6) × format(3 types)
- Logical reasoning with nested operators
- Precedence rules and format variations
Arithmetic Manifolds (3 separate):
- Small numbers (-9 to 9) with whitespace variations
- Large numbers (-99 to 99) with depth scaling
- Pure whitespace robustness testing
Movies Manifold: hints(0-4) × reference_count(3-16)
- Pattern recognition through hint-based information filtering
- Reduced cultural knowledge dependency vs C2
Shuffle Manifolds (2 separate):
- No anchor: Pure sequential tracking
- With anchors: Numeric/alphabetic organizational markers
Dates Manifold: tier(0-3) × format(4 types)
- Temporal reasoning complexity scaling
- Format variations: USA, Natural, Ordinal, Offset
Viewing M6 Results¶
M6 results integrate with ReasonScape's visualization tools:
# Generate M6 dataset from evaluation results
python evaluate.py --interview 'results/*m6*/*.ndjson' --output data/m6-results.json
# View multi-task rankings
python leaderboard.py data/m6-results.json
# Explore 6D cognitive assessment space
python explorer.py data/m6-results.json
The visualization system automatically adapts to M6's manifold structure, enabling intuitive exploration across all 6 reasoning domains with degree-aware difficulty rendering.