Skip to content

Tasks

Tasks: Parametric Test Generators

ReasonScape includes reasoning tasks across multiple cognitive domains, focusing on six primary dimensions of difficulty:

  1. Multi-Hop Reasoning - Difficult problems requiring series of sequential logical steps or iteration.
  2. Working Memory Efficiency - Evaluate the efficiency of working memory in multiple dimensions.
  3. Tokenization Challenges - Exploit LLM tokenization weaknesses by counting, stacking and transforming text.
  4. Novel Input Patterns/Repetition - Scenarios with inputs or outputs falling outside typical training distributions.
  5. Distractions - Extra information that appears relevant but is not.
  6. Interference - Patterns that are broken by exceptions.

Primary Task Manifolds

ReasonScape includes twelve comprehensive reasoning tasks with parametric difficulty manifolds, each testing distinct cognitive capabilities through systematic complexity control:

Task Capability Matrix

Task Primary Focus Math Logic State Tracking Selective Attention Symbolic Parsing Pattern Recognition Semantic Categorization Spatial Reasoning Structural Analysis Temporal Reasoning Language
Arithmetic Mathematical reasoning with nested structures
Boolean Logical evaluation with multiple notations
Brackets Structural parsing and bracket matching
Objects Selective counting with semantic categorization
Shuffle State tracking through sequential transformations
Sort Algorithmic ordering with format variations
Dates Temporal reasoning and calendar arithmetic
Letters Character analysis and frequency counting
Movies Pattern recognition and thematic similarity
Sequence Rule-based sequence generation
Shapes Geometric pattern recognition
Cars Multi-step spatial transformations

Testing Conditions Applied

Task Length/Scale Depth/Nesting Multi-Step Format Variation Anchoring Whitespace Randomization Case Mutations Distractions Domain Variation
Arithmetic
Boolean
Brackets
Objects Details, Cross-Category Multi-Category
Shuffle Instructions Multi-Domain
Sort
Dates Multi-Domain
Letters Cross-Category
Movies Multi-Domain, Multi-Category
Sequence
Shapes Transformations
Cars Instructions

Difficulty Parameter Dimensions

Each task family implements multiple difficulty dimensions that scale cognitive load systematically:

  • Length/Scale: Information processing load (3 terms → 50+ terms, 3 entities → 12+ entities)
  • Depth: Sequential/hierarchical complexity (1 level → 8+ levels of nesting, 1 swap → 4+ sequential operations)
  • Interference: Attention competition (distractors, confounders, exceptions, irrelevant information streams)
  • Semantic Complexity: Distribution shift (familiar procedures in unfamiliar formats, novel input patterns)
  • Format Variation: Tokenization challenges (whitespace removal, case mutations, anchor formats)

Task API

All ReasonScape task generators implement a standardized Task Manifold API that enables parametric test generation with type safety and schema validation.

Core Interface

Every task manifold class must implement these methods:

class TaskManifold:
    def __init__(self):
        """Initialize the task generator with any required resources"""
        self.rng = RNG()  # Random number generator

    def generate_random(self, **params) -> List[Dict[str, Any]]:
        """Generate a list of test cases with specified parameters

        Args:
            **params: Task-specific generation parameters

        Returns:
            List of test case dictionaries with 'input', 'target', and metadata
        """
        pass

    def get_generation_schema(self) -> type:
        """Return Pydantic model class defining valid generation parameters"""
        pass

    def get_result_schema(self) -> type:
        """Return Pydantic model class defining test case result structure"""
        pass

Parameter Validation with Pydantic

Task manifolds use Pydantic models for parameter validation and documentation:

class ArithmeticGenerationParams(BaseModel):
    """Schema for generate_random() arguments"""
    count: int = Field(gt=0, description="Number of test cases to generate")
    length: int = Field(ge=3, description="Number of terms")
    max_depth: int = Field(ge=1, description="Maximum bracket nesting depth")
    min_number: int = Field(default=-9, description="Minimum number value")
    max_number: int = Field(default=9, description="Maximum number value")
    prob_open: float = Field(default=0.4, ge=0, le=1, description="Probability of adding open parenthesis")

class ArithmeticTestCaseResult(BaseModel):
    """Schema for generate_test_case() return value"""
    input: str = Field(description="The formatted problem text")
    target: str = Field(description="The correct answer")
    depth: int = Field(description="Bracket nesting depth reached")

Standard Test Case Format

All generate_random() methods return lists of dictionaries with these required fields:

  • input: The complete problem text presented to the LLM
  • target: The correct answer for evaluation
  • Additional metadata: Task-specific fields for analysis and filtering

Example test case:

{
    'input': '(3 + 4) * 2 - 1',
    'target': '13',
    'depth': 1,
    'reference_movies': ['Movie A', 'Movie B'],
    'selected_genres': ['Action', 'Drama']
}

Random Number Generation

Task manifolds use a seeded RNG for reproducible test generation:

def __init__(self):
    self.rng = RNG()  # Automatically seeded by runner.py

This ensures identical test cases across runs with the same seed, enabling: - Reproducible experiments across different models - Consistent difficulty manifolds for fair comparison - Deterministic debugging of specific test cases

Integration with Runner

The runner.py system automatically:

  1. Seeds RNG for reproducibility
  2. Calls generate_random() with experiment configuration
  3. Removes duplicate tests across batches
  4. Handles errors gracefully with fallback generation

The standardized API enables the ReasonScape system to treat all reasoning tasks uniformly while preserving task-specific complexity and metadata.