C2 Evaluation Suite (Legacy)¶

C2 is ReasonScape's legacy reasoning evaluation suite, implemented as a hierarchical evaluation system with two complementary configurations for progressive assessment across 4 reasoning domains.

Evaluation Architecture¶

Hierarchical Evaluation Design¶

C2 implements a two-stage hierarchy optimized for efficient research workflows:

C2-mini: 92 strategic difficulty points for rapid model exploration (~29M tokens)
C2-full: 174 comprehensive points for publication-quality precision (~200M tokens)

Research Workflow: Use C2-mini (2-3 hours) for initial model comparison and template/sampler optimization, then scale to C2-full (12+ hours) for detailed analysis of promising configurations.

Grid-Based Parameter Sweeps¶

C2 uses comprehensive grid coverage across fixed parameter combinations:

Configuration	Target Use	Points	Confidence Intervals	Resource Usage
C2-mini	Model exploration	92	~10% per point	~100M tokens max
C2-full	Publication precision	174	3.3-4% per point	~200M tokens typical

Task Coverage¶

C2 evaluates 4 core reasoning domains through exhaustive parameter grids:

Task Breakdown¶

Task	C2-full Points	C2-mini Points	Grid Dimensions	Coverage Strategy
Arithmetic	58	20	Length × Depth × Numbers × Whitespace	Multiple number ranges and formatting
Shuffle	84	40	People × Depth × Confounders × Anchors	Entity tracking with organizational aids
Movies	16	16	Reference Count × Choice Count	Cultural pattern recognition
Dates	16	16	Question Tier × Date Format	Temporal reasoning complexity

Grid Coverage Details¶

Arithmetic Grids:

Length: 8-48 terms across multiple configurations
Depth: 0-8 nested expression levels
Numbers: -9 to 9 (mini), -99 to 99 (full)
Whitespace: 0-100% removal variations

Shuffle Grids:

People: 3-11 entities for tracking complexity
Depth: 3-24 sequential swaps
Confounders: 0-8 distractor statements
Anchors: None, Numeric, Alphabetic organizational markers

Movies Grid:

Reference Count: 3-16 example movies for pattern establishment
Choice Count: 3-12 options for selection difficulty

Dates Grid:

Question Tier: 0-3 complexity levels from "today" to multi-step reasoning
Date Format: USA, Natural, Ordinal, Offset variations

Statistical Methodology¶

Progressive Confidence Targeting¶

C2's hierarchical design uses different statistical precision levels:

Configuration	Base Samples	Max Samples	Target CI	Statistical Focus
C2-mini	32 per point	320 per point	10% width	Rapid model ranking
C2-full	128 per point	1,280 per point	3.3-4% width	Publication precision

Coverage Analysis Strategy¶

C2-mini Applications:

Rapid comparison of new model releases
Template and sampler configuration exploration
Resource-efficient model ranking for research prioritization

C2-full Applications:

Publication-quality statistical power for cognitive analysis
Definitive model comparisons with high precision
Complete coverage for comprehensive research datasets

Data Compatibility: All C2-mini results are valid subsets of C2-full, enabling seamless scaling and progressive insight development.

Usage Examples¶

C2-mini Rapid Exploration¶

# Quick model assessment (2-3 hours)
python runner.py --config configs/c2-mini.json \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json

C2-full Publication Analysis¶

# Comprehensive evaluation (12+ hours)
python runner.py --config configs/c2.json \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-max.json

Progressive Workflow¶

# Stage 1: Configuration exploration with C2-mini
python runner.py --config configs/c2-mini.json \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json

# Stage 2: Scale promising configurations to C2-full  
python runner.py --config configs/c2.json \
    --template templates/zerocot-nosys.json \
    --sampler samplers/greedy-4k.json

Research Applications¶

When to Use C2¶

Legacy Dataset Analysis: Complete C2 datasets available for comparative research and validation studies.

For all other purposes, migrating to M6 is recommended instead.

C2 vs M6 Migration¶

Equivalent Configurations: - C2-mini ≈ M6 degree 0 + medium precision (similar coverage, M6 is 1.6× more efficient) - C2-full ≈ M6 degree 1 + high precision (similar precision, M6 is 4.3× more efficient)

Task Evolution: - Arithmetic, Shuffle, Dates: Enhanced in M6 with manifold-based sampling - Movies: Redesigned in M6 to reduce cultural knowledge dependency - Objects, Boolean: New reasoning domains added in M6

Cross-Suite Compatibility¶

Statistical Methodology: Both suites use identical confidence interval calculation and excess accuracy correction, enabling direct comparison.

Visualization Tools: C2 results integrate seamlessly with ReasonScape's leaderboard and explorer tools.

Data Format: Results use compatible JSON schema for cross-suite analysis and comparison.

Technical Implementation¶

Grid-Based Sampling¶

C2 uses exhaustive parameter combinations rather than manifold sampling:

Complete coverage: All parameter intersections evaluated for comprehensive assessment
Fixed resource allocation: Predetermined sample distribution across difficulty space
Hierarchical precision: Two-tier system optimizes for different research needs

Statistical Precision¶

High-Truncation Handling: C2-full uses 5% CI targets for difficulty points with >6.6% truncation rate due to model performance ceilings.

Task-Level Aggregation:

C2-mini: ~2.5% overall task confidence intervals
C2-full: ~0.5-1.0% overall task confidence intervals

Sample Size Adaptation: Automatic expansion from base samples to maximum samples based on confidence targeting.

Viewing C2 Results¶

C2 results integrate with ReasonScape's visualization system:

# Generate C2 dataset from evaluation results
python evaluate.py --interview 'results/*c2*/*.ndjson' --output data/c2-results.json

# View hierarchical results with precision-aware rendering
python leaderboard.py data/c2-results.json
python explorer.py data/c2-results.json

The visualization system automatically adjusts statistical displays based on confidence interval width, clearly distinguishing between C2-mini exploration and C2-full precision results.

Legacy Status: C2 remains available for replication studies and historical comparison but is not actively developed. New research should prioritize M6 for improved efficiency, expanded task coverage, and enhanced methodology.