C2 Evaluation Suite (Legacy)¶
C2 is ReasonScape's legacy reasoning evaluation suite, implemented as a hierarchical evaluation system with two complementary configurations for progressive assessment across 4 reasoning domains.
Evaluation Architecture¶
Hierarchical Evaluation Design¶
C2 implements a two-stage hierarchy optimized for efficient research workflows:
- C2-mini: 92 strategic difficulty points for rapid model exploration (~29M tokens)
- C2-full: 174 comprehensive points for publication-quality precision (~200M tokens)
Research Workflow: Use C2-mini (2-3 hours) for initial model comparison and template/sampler optimization, then scale to C2-full (12+ hours) for detailed analysis of promising configurations.
Grid-Based Parameter Sweeps¶
C2 uses comprehensive grid coverage across fixed parameter combinations:
Configuration | Target Use | Points | Confidence Intervals | Resource Usage |
---|---|---|---|---|
C2-mini | Model exploration | 92 | ~10% per point | ~100M tokens max |
C2-full | Publication precision | 174 | 3.3-4% per point | ~200M tokens typical |
Task Coverage¶
C2 evaluates 4 core reasoning domains through exhaustive parameter grids:
Task Breakdown¶
Task | C2-full Points | C2-mini Points | Grid Dimensions | Coverage Strategy |
---|---|---|---|---|
Arithmetic | 58 | 20 | Length × Depth × Numbers × Whitespace | Multiple number ranges and formatting |
Shuffle | 84 | 40 | People × Depth × Confounders × Anchors | Entity tracking with organizational aids |
Movies | 16 | 16 | Reference Count × Choice Count | Cultural pattern recognition |
Dates | 16 | 16 | Question Tier × Date Format | Temporal reasoning complexity |
Grid Coverage Details¶
Arithmetic Grids:
- Length: 8-48 terms across multiple configurations
- Depth: 0-8 nested expression levels
- Numbers: -9 to 9 (mini), -99 to 99 (full)
- Whitespace: 0-100% removal variations
Shuffle Grids:
- People: 3-11 entities for tracking complexity
- Depth: 3-24 sequential swaps
- Confounders: 0-8 distractor statements
- Anchors: None, Numeric, Alphabetic organizational markers
Movies Grid:
- Reference Count: 3-16 example movies for pattern establishment
- Choice Count: 3-12 options for selection difficulty
Dates Grid:
- Question Tier: 0-3 complexity levels from "today" to multi-step reasoning
- Date Format: USA, Natural, Ordinal, Offset variations
Statistical Methodology¶
Progressive Confidence Targeting¶
C2's hierarchical design uses different statistical precision levels:
Configuration | Base Samples | Max Samples | Target CI | Statistical Focus |
---|---|---|---|---|
C2-mini | 32 per point | 320 per point | 10% width | Rapid model ranking |
C2-full | 128 per point | 1,280 per point | 3.3-4% width | Publication precision |
Coverage Analysis Strategy¶
C2-mini Applications:
- Rapid comparison of new model releases
- Template and sampler configuration exploration
- Resource-efficient model ranking for research prioritization
C2-full Applications:
- Publication-quality statistical power for cognitive analysis
- Definitive model comparisons with high precision
- Complete coverage for comprehensive research datasets
Data Compatibility: All C2-mini results are valid subsets of C2-full, enabling seamless scaling and progressive insight development.
Usage Examples¶
C2-mini Rapid Exploration¶
# Quick model assessment (2-3 hours)
python runner.py --config configs/c2-mini.json \
--template templates/zerocot-nosys.json \
--sampler samplers/greedy-4k.json
C2-full Publication Analysis¶
# Comprehensive evaluation (12+ hours)
python runner.py --config configs/c2.json \
--template templates/zerocot-nosys.json \
--sampler samplers/greedy-max.json
Progressive Workflow¶
# Stage 1: Configuration exploration with C2-mini
python runner.py --config configs/c2-mini.json \
--template templates/zerocot-nosys.json \
--sampler samplers/greedy-4k.json
# Stage 2: Scale promising configurations to C2-full
python runner.py --config configs/c2.json \
--template templates/zerocot-nosys.json \
--sampler samplers/greedy-4k.json
Research Applications¶
When to Use C2¶
Legacy Dataset Analysis: Complete C2 datasets available for comparative research and validation studies.
For all other purposes, migrating to M6 is recommended instead.
C2 vs M6 Migration¶
Equivalent Configurations:
- C2-mini
≈ M6 degree 0 + medium precision
(similar coverage, M6 is 1.6× more efficient)
- C2-full
≈ M6 degree 1 + high precision
(similar precision, M6 is 4.3× more efficient)
Task Evolution: - Arithmetic, Shuffle, Dates: Enhanced in M6 with manifold-based sampling - Movies: Redesigned in M6 to reduce cultural knowledge dependency - Objects, Boolean: New reasoning domains added in M6
Cross-Suite Compatibility¶
Statistical Methodology: Both suites use identical confidence interval calculation and excess accuracy correction, enabling direct comparison.
Visualization Tools: C2 results integrate seamlessly with ReasonScape's leaderboard and explorer tools.
Data Format: Results use compatible JSON schema for cross-suite analysis and comparison.
Technical Implementation¶
Grid-Based Sampling¶
C2 uses exhaustive parameter combinations rather than manifold sampling:
- Complete coverage: All parameter intersections evaluated for comprehensive assessment
- Fixed resource allocation: Predetermined sample distribution across difficulty space
- Hierarchical precision: Two-tier system optimizes for different research needs
Statistical Precision¶
High-Truncation Handling: C2-full uses 5% CI targets for difficulty points with >6.6% truncation rate due to model performance ceilings.
Task-Level Aggregation:
- C2-mini: ~2.5% overall task confidence intervals
- C2-full: ~0.5-1.0% overall task confidence intervals
Sample Size Adaptation: Automatic expansion from base samples to maximum samples based on confidence targeting.
Viewing C2 Results¶
C2 results integrate with ReasonScape's visualization system:
# Generate C2 dataset from evaluation results
python evaluate.py --interview 'results/*c2*/*.ndjson' --output data/c2-results.json
# View hierarchical results with precision-aware rendering
python leaderboard.py data/c2-results.json
python explorer.py data/c2-results.json
The visualization system automatically adjusts statistical displays based on confidence interval width, clearly distinguishing between C2-mini exploration and C2-full precision results.
Legacy Status: C2 remains available for replication studies and historical comparison but is not actively developed. New research should prioritize M6 for improved efficiency, expanded task coverage, and enhanced methodology.