ReasonScape Evaluation Suites¶
ReasonScape provides two complementary evaluation methodologies for systematic assessment of large language model reasoning capabilities across multiple cognitive domains.
→ M6 Experiment Documentation (Current)
→ C2 Experiment Documentation (Legacy)
Suite Selection Guide¶
Recommendation: New research should prioritize M6 for its improved efficiency, expanded task coverage, and progressive evaluation capabilities, while referencing C2 results for historical comparison and validation.
Choose M6 if you:¶
- Running new evaluations on current models
- Want efficient resource utilization (1.6-4.3× more coverage per token)
- Need progressive evaluation (easy→medium→hard difficulty scaling)
- Want expanded task coverage (6 reasoning domains including Objects and Boolean)
Choose C2 if you:¶
- Comparing against established baselines with extensive model coverage
- Need legacy compatibility with existing datasets
Current Status Summary¶
Suite | Tasks | Models Evaluated | Difficulty Levels | Status |
---|---|---|---|---|
M6 | 6 domains | 18+ configurations | 3 degrees (easy/medium/hard) | Active Development |
C2 | 4 domains | 20+ configurations | 2 levels (mini/full) | Legacy Support |
Resource Efficiency (Real Data)¶
Based on Qwen3-4B baseline across comparable configurations:
Configuration | Token Usage | Efficiency vs C2 |
---|---|---|
C2-mini | 29M tokens | 1× (baseline) |
M6 Degree 0 | 18M tokens | 1.6× more efficient |
C2-full | 223M tokens | 1× (baseline) |
M6 Degree 1 | 52M tokens | 4.3× more efficient |
M6 Degree 2 | 80M tokens | 2.8× more efficient |
Quick Start¶
M6 Progressive Workflow¶
# Stage 1: Rapid model comparison (2-3 hours)
python runner.py --config configs/m6.json --degree 0 --precision low --density normal
# Stage 2: Standard evaluation (8-12 hours)
python runner.py --config configs/m6.json --degree 1 --precision medium --density normal
# Stage 3: Research-grade analysis (20+ hours)
python runner.py --config configs/m6.json --degree 2 --precision high --density normal
C2 Traditional Workflow¶
# Stage 1: Model exploration
python runner.py --config configs/c2-mini.json
# Stage 2: Publication precision
python runner.py --config configs/c2.json
Live Results Access¶
M6 Results (Current)¶
- M6 Leaderboard: reasonscape.com/m6/leaderboard
- M6 Explorer: reasonscape.com/m6/explorer (PC required)
- M6 Dataset: reasonscape.com/data/m6
C2 Unified Results (Legacy)¶
- C2 Leaderboard: reasonscape.com/c2/leaderboard
- C2 Explorer: reasonscape.com/c2/explorer (PC required)
- C2 Dataset: reasonscape.com/data/c2
Architecture Overview¶
M6: Manifold-Based Progressive Evaluation¶
- 6 reasoning tasks: Objects, Arithmetic, Dates, Boolean, Movies, Shuffle
- 3 difficulty degrees: Adaptive complexity scaling from easy→hard
- Orthogonal controls: Independent difficulty and precision optimization
- Efficient sampling: Manifold-based parameter space exploration
C2: Grid-Based Hierarchical Evaluation¶
- 4 reasoning tasks: Arithmetic, Shuffle, Dates, Movies
- 2 precision levels: C2-mini (rapid) and C2-full (publication-quality)
- Fixed coverage: Comprehensive grid-based parameter sweeps
- Established baselines: Extensive published model comparisons
Migration and Compatibility¶
Data Compatibility:
-
Both suites use identical statistical methodology (confidence intervals, excess accuracy correction)
-
Results are cross-comparable through ReasonScore normalization
-
Visualization tools support both formats
Equivalent Configurations:
-
C2-mini
≈M6 degree 0 + medium precision
(similar coverage, higher efficiency) -
C2-full
≈M6 degree 1 + high precision
(similar precision, enhanced task portfolio)