Skip to content

ReasonScape Evaluation Suites

ReasonScape provides two complementary evaluation methodologies for systematic assessment of large language model reasoning capabilities across multiple cognitive domains.

M6 Experiment Documentation (Current)

C2 Experiment Documentation (Legacy)

Suite Selection Guide

Recommendation: New research should prioritize M6 for its improved efficiency, expanded task coverage, and progressive evaluation capabilities, while referencing C2 results for historical comparison and validation.

Choose M6 if you:

  • Running new evaluations on current models
  • Want efficient resource utilization (1.6-4.3× more coverage per token)
  • Need progressive evaluation (easy→medium→hard difficulty scaling)
  • Want expanded task coverage (6 reasoning domains including Objects and Boolean)

Choose C2 if you:

  • Comparing against established baselines with extensive model coverage
  • Need legacy compatibility with existing datasets

Current Status Summary

Suite Tasks Models Evaluated Difficulty Levels Status
M6 6 domains 18+ configurations 3 degrees (easy/medium/hard) Active Development
C2 4 domains 20+ configurations 2 levels (mini/full) Legacy Support

Resource Efficiency (Real Data)

Based on Qwen3-4B baseline across comparable configurations:

Configuration Token Usage Efficiency vs C2
C2-mini 29M tokens 1× (baseline)
M6 Degree 0 18M tokens 1.6× more efficient
C2-full 223M tokens 1× (baseline)
M6 Degree 1 52M tokens 4.3× more efficient
M6 Degree 2 80M tokens 2.8× more efficient

Quick Start

M6 Progressive Workflow

# Stage 1: Rapid model comparison (2-3 hours)
python runner.py --config configs/m6.json --degree 0 --precision low --density normal

# Stage 2: Standard evaluation (8-12 hours)  
python runner.py --config configs/m6.json --degree 1 --precision medium --density normal

# Stage 3: Research-grade analysis (20+ hours)
python runner.py --config configs/m6.json --degree 2 --precision high --density normal

C2 Traditional Workflow

# Stage 1: Model exploration
python runner.py --config configs/c2-mini.json

# Stage 2: Publication precision
python runner.py --config configs/c2.json

Live Results Access

M6 Results (Current)

C2 Unified Results (Legacy)

Architecture Overview

M6: Manifold-Based Progressive Evaluation

  • 6 reasoning tasks: Objects, Arithmetic, Dates, Boolean, Movies, Shuffle
  • 3 difficulty degrees: Adaptive complexity scaling from easy→hard
  • Orthogonal controls: Independent difficulty and precision optimization
  • Efficient sampling: Manifold-based parameter space exploration

C2: Grid-Based Hierarchical Evaluation

  • 4 reasoning tasks: Arithmetic, Shuffle, Dates, Movies
  • 2 precision levels: C2-mini (rapid) and C2-full (publication-quality)
  • Fixed coverage: Comprehensive grid-based parameter sweeps
  • Established baselines: Extensive published model comparisons

Migration and Compatibility

Data Compatibility:

  • Both suites use identical statistical methodology (confidence intervals, excess accuracy correction)

  • Results are cross-comparable through ReasonScore normalization

  • Visualization tools support both formats

Equivalent Configurations:

  • C2-miniM6 degree 0 + medium precision (similar coverage, higher efficiency)

  • C2-fullM6 degree 1 + high precision (similar precision, enhanced task portfolio)