Skip to content

C2 Evaluation Suite (Legacy)

C2 is ReasonScape's legacy reasoning evaluation suite, implemented as a hierarchical evaluation system with two complementary configurations for progressive assessment across 4 reasoning domains.

Evaluation Architecture

Hierarchical Evaluation Design

C2 implements a two-stage hierarchy optimized for efficient research workflows:

  • C2-mini: 92 strategic difficulty points for rapid model exploration (~29M tokens)
  • C2-full: 174 comprehensive points for publication-quality precision (~200M tokens)

Research Workflow: Use C2-mini (2-3 hours) for initial model comparison and template/sampler optimization, then scale to C2-full (12+ hours) for detailed analysis of promising configurations.

Grid-Based Parameter Sweeps

C2 uses comprehensive grid coverage across fixed parameter combinations:

Configuration Target Use Points Confidence Intervals Resource Usage
C2-mini Model exploration 92 ~10% per point ~100M tokens max
C2-full Publication precision 174 3.3-4% per point ~200M tokens typical

Task Coverage

C2 evaluates 4 core reasoning domains through exhaustive parameter grids:

Task Breakdown

Task C2-full Points C2-mini Points Grid Dimensions Coverage Strategy
Arithmetic 58 20 Length × Depth × Numbers × Whitespace Multiple number ranges and formatting
Shuffle 84 40 People × Depth × Confounders × Anchors Entity tracking with organizational aids
Movies 16 16 Reference Count × Choice Count Cultural pattern recognition
Dates 16 16 Question Tier × Date Format Temporal reasoning complexity

Grid Coverage Details

Arithmetic Grids:

  • Length: 8-48 terms across multiple configurations
  • Depth: 0-8 nested expression levels
  • Numbers: -9 to 9 (mini), -99 to 99 (full)
  • Whitespace: 0-100% removal variations

Shuffle Grids:

  • People: 3-11 entities for tracking complexity
  • Depth: 3-24 sequential swaps
  • Confounders: 0-8 distractor statements
  • Anchors: None, Numeric, Alphabetic organizational markers

Movies Grid:

  • Reference Count: 3-16 example movies for pattern establishment
  • Choice Count: 3-12 options for selection difficulty

Dates Grid:

  • Question Tier: 0-3 complexity levels from "today" to multi-step reasoning
  • Date Format: USA, Natural, Ordinal, Offset variations

Statistical Methodology

Progressive Confidence Targeting

C2's hierarchical design uses different statistical precision levels:

Configuration Base Samples Max Samples Target CI Statistical Focus
C2-mini 32 per point 320 per point 10% width Rapid model ranking
C2-full 128 per point 1,280 per point 3.3-4% width Publication precision

Coverage Analysis Strategy

C2-mini Applications:

  • Rapid comparison of new model releases
  • Template and sampler configuration exploration
  • Resource-efficient model ranking for research prioritization

C2-full Applications:

  • Publication-quality statistical power for cognitive analysis
  • Definitive model comparisons with high precision
  • Complete coverage for comprehensive research datasets

Data Compatibility: All C2-mini results are valid subsets of C2-full, enabling seamless scaling and progressive insight development.

Usage Examples

C2-mini Rapid Exploration

# Quick model assessment (2-3 hours)
python runner.py --config configs/c2-mini.json \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json

C2-full Publication Analysis

# Comprehensive evaluation (12+ hours)
python runner.py --config configs/c2.json \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-max.json

Progressive Workflow

# Stage 1: Configuration exploration with C2-mini
python runner.py --config configs/c2-mini.json \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json

# Stage 2: Scale promising configurations to C2-full  
python runner.py --config configs/c2.json \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json

Research Applications

When to Use C2

Legacy Dataset Analysis: Complete C2 datasets available for comparative research and validation studies.

For all other purposes, migrating to M6 is recommended instead.

C2 vs M6 Migration

Equivalent Configurations: - C2-miniM6 degree 0 + medium precision (similar coverage, M6 is 1.6× more efficient) - C2-fullM6 degree 1 + high precision (similar precision, M6 is 4.3× more efficient)

Task Evolution: - Arithmetic, Shuffle, Dates: Enhanced in M6 with manifold-based sampling - Movies: Redesigned in M6 to reduce cultural knowledge dependency - Objects, Boolean: New reasoning domains added in M6

Cross-Suite Compatibility

Statistical Methodology: Both suites use identical confidence interval calculation and excess accuracy correction, enabling direct comparison.

Visualization Tools: C2 results integrate seamlessly with ReasonScape's leaderboard and explorer tools.

Data Format: Results use compatible JSON schema for cross-suite analysis and comparison.

Technical Implementation

Grid-Based Sampling

C2 uses exhaustive parameter combinations rather than manifold sampling:

  • Complete coverage: All parameter intersections evaluated for comprehensive assessment
  • Fixed resource allocation: Predetermined sample distribution across difficulty space
  • Hierarchical precision: Two-tier system optimizes for different research needs

Statistical Precision

High-Truncation Handling: C2-full uses 5% CI targets for difficulty points with >6.6% truncation rate due to model performance ceilings.

Task-Level Aggregation:

  • C2-mini: ~2.5% overall task confidence intervals
  • C2-full: ~0.5-1.0% overall task confidence intervals

Sample Size Adaptation: Automatic expansion from base samples to maximum samples based on confidence targeting.

Viewing C2 Results

C2 results integrate with ReasonScape's visualization system:

# Generate C2 dataset from evaluation results
python evaluate.py --interview 'results/*c2*/*.ndjson' --output data/c2-results.json

# View hierarchical results with precision-aware rendering
python leaderboard.py data/c2-results.json
python explorer.py data/c2-results.json

The visualization system automatically adjusts statistical displays based on confidence interval width, clearly distinguishing between C2-mini exploration and C2-full precision results.


Legacy Status: C2 remains available for replication studies and historical comparison but is not actively developed. New research should prioritize M6 for improved efficiency, expanded task coverage, and enhanced methodology.