Tasks
Tasks: Parametric Test Generators¶
ReasonScape includes reasoning tasks across multiple cognitive domains, focusing on six primary dimensions of difficulty:
- Multi-Hop Reasoning - Difficult problems requiring series of sequential logical steps or iteration.
- Working Memory Efficiency - Evaluate the efficiency of working memory in multiple dimensions.
- Tokenization Challenges - Exploit LLM tokenization weaknesses by counting, stacking and transforming text.
- Novel Input Patterns/Repetition - Scenarios with inputs or outputs falling outside typical training distributions.
- Distractions - Extra information that appears relevant but is not.
- Interference - Patterns that are broken by exceptions.
Primary Task Manifolds¶
ReasonScape includes twelve comprehensive reasoning tasks with parametric difficulty manifolds, each testing distinct cognitive capabilities through systematic complexity control:
Task Capability Matrix¶
| Task | Primary Focus | Math | Logic | State Tracking | Selective Attention | Symbolic Parsing | Pattern Recognition | Semantic Categorization | Spatial Reasoning | Structural Analysis | Temporal Reasoning | Language |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Arithmetic | Mathematical reasoning with nested structures | ✓ | ✓ | ✓ | ||||||||
| Boolean | Logical evaluation with multiple notations | ✓ | ✓ | ✓ | ||||||||
| Brackets | Structural parsing and bracket matching | ✓ | ✓ | ✓ | ||||||||
| Objects | Selective counting with semantic categorization | ✓ | ✓ | ✓ | ✓ | |||||||
| Shuffle | State tracking through sequential transformations | ✓ | ✓ | ✓ | ||||||||
| Sort | Algorithmic ordering with format variations | ✓ | ✓ | ✓ | ||||||||
| Dates | Temporal reasoning and calendar arithmetic | ✓ | ✓ | ✓ | ✓ | |||||||
| Letters | Character analysis and frequency counting | ✓ | ✓ | ✓ | ✓ | |||||||
| Movies | Pattern recognition and thematic similarity | ✓ | ✓ | ✓ | ||||||||
| Sequence | Rule-based sequence generation | ✓ | ✓ | ✓ | ✓ | |||||||
| Shapes | Geometric pattern recognition | ✓ | ✓ | ✓ | ||||||||
| Cars | Multi-step spatial transformations | ✓ | ✓ | ✓ | ✓ |
Testing Conditions Applied¶
| Task | Length/Scale | Depth/Nesting | Multi-Step | Format Variation | Anchoring | Whitespace Randomization | Case Mutations | Distractions | Domain Variation |
|---|---|---|---|---|---|---|---|---|---|
| Arithmetic | ✓ | ✓ | ✓ | ||||||
| Boolean | ✓ | ✓ | ✓ | ✓ | |||||
| Brackets | ✓ | ✓ | ✓ | ||||||
| Objects | ✓ | ✓ | Details, Cross-Category | Multi-Category | |||||
| Shuffle | ✓ | ✓ | ✓ | Instructions | Multi-Domain | ||||
| Sort | ✓ | ✓ | |||||||
| Dates | ✓ | ✓ | Multi-Domain | ||||||
| Letters | ✓ | ✓ | Cross-Category | ||||||
| Movies | ✓ | ✓ | Multi-Domain, Multi-Category | ||||||
| Sequence | ✓ | ✓ | |||||||
| Shapes | ✓ | Transformations | |||||||
| Cars | ✓ | ✓ | Instructions |
Difficulty Parameter Dimensions¶
Each task family implements multiple difficulty dimensions that scale cognitive load systematically:
- Length/Scale: Information processing load (3 terms → 50+ terms, 3 entities → 12+ entities)
- Depth: Sequential/hierarchical complexity (1 level → 8+ levels of nesting, 1 swap → 4+ sequential operations)
- Interference: Attention competition (distractors, confounders, exceptions, irrelevant information streams)
- Semantic Complexity: Distribution shift (familiar procedures in unfamiliar formats, novel input patterns)
- Format Variation: Tokenization challenges (whitespace removal, case mutations, anchor formats)
Task API¶
All ReasonScape task generators implement a standardized Task Manifold API that enables parametric test generation with type safety and schema validation.
Core Interface¶
Every task manifold class must implement these methods:
class TaskManifold:
def __init__(self):
"""Initialize the task generator with any required resources"""
self.rng = RNG() # Random number generator
def generate_random(self, **params) -> List[Dict[str, Any]]:
"""Generate a list of test cases with specified parameters
Args:
**params: Task-specific generation parameters
Returns:
List of test case dictionaries with 'input', 'target', and metadata
"""
pass
def get_generation_schema(self) -> type:
"""Return Pydantic model class defining valid generation parameters"""
pass
def get_result_schema(self) -> type:
"""Return Pydantic model class defining test case result structure"""
pass
Parameter Validation with Pydantic¶
Task manifolds use Pydantic models for parameter validation and documentation:
class ArithmeticGenerationParams(BaseModel):
"""Schema for generate_random() arguments"""
count: int = Field(gt=0, description="Number of test cases to generate")
length: int = Field(ge=3, description="Number of terms")
max_depth: int = Field(ge=1, description="Maximum bracket nesting depth")
min_number: int = Field(default=-9, description="Minimum number value")
max_number: int = Field(default=9, description="Maximum number value")
prob_open: float = Field(default=0.4, ge=0, le=1, description="Probability of adding open parenthesis")
class ArithmeticTestCaseResult(BaseModel):
"""Schema for generate_test_case() return value"""
input: str = Field(description="The formatted problem text")
target: str = Field(description="The correct answer")
depth: int = Field(description="Bracket nesting depth reached")
Standard Test Case Format¶
All generate_random() methods return lists of dictionaries with these required fields:
input: The complete problem text presented to the LLMtarget: The correct answer for evaluation- Additional metadata: Task-specific fields for analysis and filtering
Example test case:
{
'input': '(3 + 4) * 2 - 1',
'target': '13',
'depth': 1,
'reference_movies': ['Movie A', 'Movie B'],
'selected_genres': ['Action', 'Drama']
}
Random Number Generation¶
Task manifolds use a seeded RNG for reproducible test generation:
def __init__(self):
self.rng = RNG() # Automatically seeded by runner.py
This ensures identical test cases across runs with the same seed, enabling: - Reproducible experiments across different models - Consistent difficulty manifolds for fair comparison - Deterministic debugging of specific test cases
Integration with Runner¶
The runner.py system automatically:
- Seeds RNG for reproducibility
- Calls generate_random() with experiment configuration
- Removes duplicate tests across batches
- Handles errors gracefully with fallback generation
The standardized API enables the ReasonScape system to treat all reasoning tasks uniformly while preserving task-specific complexity and metadata.