Tasks
Tasks: Parametric Test Generators¶
ReasonScape includes reasoning tasks across multiple cognitive domains, focusing on six primary dimensions of difficulty:
- Multi-Hop Reasoning - Difficult problems requiring series of sequential logical steps or iteration.
- Working Memory Efficiency - Evaluate the efficiency of working memory in multiple dimensions.
- Tokenization Challenges - Exploit LLM tokenization weaknesses by counting, stacking and transforming text.
- Novel Input Patterns/Repetition - Scenarios with inputs or outputs falling outside typical training distributions.
- Distractions - Extra information that appears relevant but is not.
- Interference - Patterns that are broken by exceptions.
Primary Task Manifolds¶
ReasonScape includes six core reasoning tasks with comprehensive difficulty manifolds, each testing distinct cognitive capabilities through parametric complexity control:
Task | Description | Difficulty Dimensions | Key Challenges |
---|---|---|---|
Arithmetic | Parse and evaluate mathematical expressions with parentheses, operator precedence, and nested structures | Length (3-50 terms): Expression complexity by term count Max Depth (0-8 levels): Parenthetical nesting depth Number Range (-9 to 9): Integer bounds for calculation difficulty Whitespace (0-100%): Tokenization challenge through space removal |
Mathematical reasoning, symbolic parsing, precedence rules (PEMDAS/BODMAS), structural analysis, working memory for nested calculations, robust parsing despite formatting variations |
Boolean | Evaluate complex Boolean expressions with nested logic and operator precedence | Length (3+ terms): Number of boolean values in expression Max Depth (0-8 levels): Parenthetical nesting depth Negation Probability: Frequency of NOT operators and chaining Format Variation (5 types): TRUE_FALSE, T_F, ON_OFF, BINARY, YES_NO Whitespace (0-100%): Tokenization challenge through space removal |
Logical reasoning, boolean operator precedence (not > and > or), symbolic parsing across multiple notations, negation handling, XOR operations, working memory for nested evaluations |
Objects | Count specific object categories while filtering distractors and parsing natural language quantities | Length (1+ items): Number of target items to include Max Count (0+ per item): Maximum quantity per individual item Distractor Count (0+): Number of irrelevant items from other categories Target Groups (1+ categories): Number of semantic categories to count Adjective Probability (0-100%): Extraneous descriptive information stream Anchor Format (6 types): Organizational markers (numeric, Roman, hex, etc.) |
Selective attention, semantic categorization across 11 domains, quantity extraction from natural language, arithmetic aggregation, distractor resistance, information filtering, working memory for running totals |
Shuffle | Maintain state through sequential transformations while filtering confounding information | Length (3-12 people): Number of entities to track Max Depth (1+ swaps): Sequential transformation complexity Confounding Count (0+): Irrelevant interpersonal statements Anchor Format (9 types): Organizational markers (numeric, Roman, hex, etc.) Domain Variation (5 contexts): Dancing, books, soccer, gifts, balls |
Sequential state tracking, working memory efficiency, confounding resistance, multi-domain generalization, attention control, format-independent comprehension |
Dates | Comprehend temporal information and perform calendar arithmetic within narrative contexts | Question Tier (0-3): Basic identification → Simple arithmetic → Complex calendar logic → Multi-step calculations Date Format (4 types): MM/DD/YYYY, natural language, ordinal day-of-year, relative offset Scenario Diversity (36 types): Realistic contexts from anniversaries to scheduling |
Temporal context extraction, calendar arithmetic across month/year boundaries, leap year logic, format recognition, multi-step inference, contradiction resolution, pattern recognition for recurring events |
Movies | Identify thematic and stylistic similarities between films through pattern recognition | Reference Count (4-6 movies): Size of user preference set Choice Count (3-5 options): Multiple choice complexity Genre/Theme Weights: Configurable selection probability for clustering difficulty Template Variation: Diverse question phrasings to test robust understanding |
Cultural knowledge of movie genres and themes, similarity analysis across multiple dimensions, pattern recognition for thematic connections, preference modeling from limited examples, analogical reasoning between different content types |
Planned Future Tasks¶
Additional reasoning domains under development:
Task | Description | Focus Areas |
---|---|---|
Bracket Stack | Complete bracket sequences using stack-based reasoning | Working memory, sequential processing, tokenization |
Word Sorting | Sort word lists alphabetically with case handling | String processing, sorting algorithms, repetition control |
SVG Shapes | Identify geometric shapes from SVG path coordinates | Spatial reasoning, coordinate systems, pattern recognition |
Pattern Interference | Generate sequences following base patterns plus interference rules | Cognitive flexibility, rule application, inhibitory control |
Letter Counting | Count target letters across word lists with confounding words | Character-level processing, selective attention, tokenization |
But-for Causation | Determine necessary conditions in causal scenarios | Causal reasoning, counterfactual thinking, logical inference |
Difficulty Parameter Dimensions¶
Each task family implements multiple difficulty dimensions:
- Length/Scale: Information load (3 words → 24 words, 8 objects → 24 objects)
- Depth: Sequential complexity (2 levels → 32 levels of nesting)
- Interference: Attention competition (distractors, confounders, exceptions)
- Semantic Complexity: Distribution shift (familiar procedures in unfamiliar formats)
Task API¶
All ReasonScape task generators implement a standardized Task Manifold API that enables parametric test generation with type safety and schema validation.
Core Interface¶
Every task manifold class must implement these methods:
class TaskManifold:
def __init__(self):
"""Initialize the task generator with any required resources"""
self.rng = RNG() # Random number generator
def generate_random(self, **params) -> List[Dict[str, Any]]:
"""Generate a list of test cases with specified parameters
Args:
**params: Task-specific generation parameters
Returns:
List of test case dictionaries with 'input', 'target', and metadata
"""
pass
def get_generation_schema(self) -> type:
"""Return Pydantic model class defining valid generation parameters"""
pass
def get_result_schema(self) -> type:
"""Return Pydantic model class defining test case result structure"""
pass
Parameter Validation with Pydantic¶
Task manifolds use Pydantic models for parameter validation and documentation:
class ArithmeticGenerationParams(BaseModel):
"""Schema for generate_random() arguments"""
count: int = Field(gt=0, description="Number of test cases to generate")
length: int = Field(ge=3, description="Number of terms")
max_depth: int = Field(ge=1, description="Maximum bracket nesting depth")
min_number: int = Field(default=-9, description="Minimum number value")
max_number: int = Field(default=9, description="Maximum number value")
prob_open: float = Field(default=0.4, ge=0, le=1, description="Probability of adding open parenthesis")
class ArithmeticTestCaseResult(BaseModel):
"""Schema for generate_test_case() return value"""
input: str = Field(description="The formatted problem text")
target: str = Field(description="The correct answer")
depth: int = Field(description="Bracket nesting depth reached")
Standard Test Case Format¶
All generate_random()
methods return lists of dictionaries with these required fields:
input
: The complete problem text presented to the LLMtarget
: The correct answer for evaluation- Additional metadata: Task-specific fields for analysis and filtering
Example test case:
{
'input': '(3 + 4) * 2 - 1',
'target': '13',
'depth': 1,
'reference_movies': ['Movie A', 'Movie B'],
'selected_genres': ['Action', 'Drama']
}
Random Number Generation¶
Task manifolds use a seeded RNG for reproducible test generation:
def __init__(self):
self.rng = RNG() # Automatically seeded by runner.py
This ensures identical test cases across runs with the same seed, enabling: - Reproducible experiments across different models - Consistent difficulty manifolds for fair comparison - Deterministic debugging of specific test cases
Integration with Runner¶
The runner.py system automatically:
- Seeds RNG for reproducibility
- Calls generate_random() with experiment configuration
- Removes duplicate tests across batches
- Handles errors gracefully with fallback generation
The standardized API enables the ReasonScape system to treat all reasoning tasks uniformly while preserving task-specific complexity and metadata.