Skip to content

Tasks

Tasks: Parametric Test Generators

ReasonScape includes reasoning tasks across multiple cognitive domains, focusing on six primary dimensions of difficulty:

  1. Multi-Hop Reasoning - Difficult problems requiring series of sequential logical steps or iteration.
  2. Working Memory Efficiency - Evaluate the efficiency of working memory in multiple dimensions.
  3. Tokenization Challenges - Exploit LLM tokenization weaknesses by counting, stacking and transforming text.
  4. Novel Input Patterns/Repetition - Scenarios with inputs or outputs falling outside typical training distributions.
  5. Distractions - Extra information that appears relevant but is not.
  6. Interference - Patterns that are broken by exceptions.

Primary Task Manifolds

ReasonScape includes six core reasoning tasks with comprehensive difficulty manifolds, each testing distinct cognitive capabilities through parametric complexity control:

Task Description Difficulty Dimensions Key Challenges
Arithmetic Parse and evaluate mathematical expressions with parentheses, operator precedence, and nested structures Length (3-50 terms): Expression complexity by term count
Max Depth (0-8 levels): Parenthetical nesting depth
Number Range (-9 to 9): Integer bounds for calculation difficulty
Whitespace (0-100%): Tokenization challenge through space removal
Mathematical reasoning, symbolic parsing, precedence rules (PEMDAS/BODMAS), structural analysis, working memory for nested calculations, robust parsing despite formatting variations
Boolean Evaluate complex Boolean expressions with nested logic and operator precedence Length (3+ terms): Number of boolean values in expression
Max Depth (0-8 levels): Parenthetical nesting depth
Negation Probability: Frequency of NOT operators and chaining
Format Variation (5 types): TRUE_FALSE, T_F, ON_OFF, BINARY, YES_NO
Whitespace (0-100%): Tokenization challenge through space removal
Logical reasoning, boolean operator precedence (not > and > or), symbolic parsing across multiple notations, negation handling, XOR operations, working memory for nested evaluations
Objects Count specific object categories while filtering distractors and parsing natural language quantities Length (1+ items): Number of target items to include
Max Count (0+ per item): Maximum quantity per individual item
Distractor Count (0+): Number of irrelevant items from other categories
Target Groups (1+ categories): Number of semantic categories to count
Adjective Probability (0-100%): Extraneous descriptive information stream
Anchor Format (6 types): Organizational markers (numeric, Roman, hex, etc.)
Selective attention, semantic categorization across 11 domains, quantity extraction from natural language, arithmetic aggregation, distractor resistance, information filtering, working memory for running totals
Shuffle Maintain state through sequential transformations while filtering confounding information Length (3-12 people): Number of entities to track
Max Depth (1+ swaps): Sequential transformation complexity
Confounding Count (0+): Irrelevant interpersonal statements
Anchor Format (9 types): Organizational markers (numeric, Roman, hex, etc.)
Domain Variation (5 contexts): Dancing, books, soccer, gifts, balls
Sequential state tracking, working memory efficiency, confounding resistance, multi-domain generalization, attention control, format-independent comprehension
Dates Comprehend temporal information and perform calendar arithmetic within narrative contexts Question Tier (0-3): Basic identification → Simple arithmetic → Complex calendar logic → Multi-step calculations
Date Format (4 types): MM/DD/YYYY, natural language, ordinal day-of-year, relative offset
Scenario Diversity (36 types): Realistic contexts from anniversaries to scheduling
Temporal context extraction, calendar arithmetic across month/year boundaries, leap year logic, format recognition, multi-step inference, contradiction resolution, pattern recognition for recurring events
Movies Identify thematic and stylistic similarities between films through pattern recognition Reference Count (4-6 movies): Size of user preference set
Choice Count (3-5 options): Multiple choice complexity
Genre/Theme Weights: Configurable selection probability for clustering difficulty
Template Variation: Diverse question phrasings to test robust understanding
Cultural knowledge of movie genres and themes, similarity analysis across multiple dimensions, pattern recognition for thematic connections, preference modeling from limited examples, analogical reasoning between different content types

Planned Future Tasks

Additional reasoning domains under development:

Task Description Focus Areas
Bracket Stack Complete bracket sequences using stack-based reasoning Working memory, sequential processing, tokenization
Word Sorting Sort word lists alphabetically with case handling String processing, sorting algorithms, repetition control
SVG Shapes Identify geometric shapes from SVG path coordinates Spatial reasoning, coordinate systems, pattern recognition
Pattern Interference Generate sequences following base patterns plus interference rules Cognitive flexibility, rule application, inhibitory control
Letter Counting Count target letters across word lists with confounding words Character-level processing, selective attention, tokenization
But-for Causation Determine necessary conditions in causal scenarios Causal reasoning, counterfactual thinking, logical inference

Difficulty Parameter Dimensions

Each task family implements multiple difficulty dimensions:

  • Length/Scale: Information load (3 words → 24 words, 8 objects → 24 objects)
  • Depth: Sequential complexity (2 levels → 32 levels of nesting)
  • Interference: Attention competition (distractors, confounders, exceptions)
  • Semantic Complexity: Distribution shift (familiar procedures in unfamiliar formats)

Task API

All ReasonScape task generators implement a standardized Task Manifold API that enables parametric test generation with type safety and schema validation.

Core Interface

Every task manifold class must implement these methods:

class TaskManifold:
    def __init__(self):
        """Initialize the task generator with any required resources"""
        self.rng = RNG()  # Random number generator

    def generate_random(self, **params) -> List[Dict[str, Any]]:
        """Generate a list of test cases with specified parameters

        Args:
            **params: Task-specific generation parameters

        Returns:
            List of test case dictionaries with 'input', 'target', and metadata
        """
        pass

    def get_generation_schema(self) -> type:
        """Return Pydantic model class defining valid generation parameters"""
        pass

    def get_result_schema(self) -> type:
        """Return Pydantic model class defining test case result structure"""
        pass

Parameter Validation with Pydantic

Task manifolds use Pydantic models for parameter validation and documentation:

class ArithmeticGenerationParams(BaseModel):
    """Schema for generate_random() arguments"""
    count: int = Field(gt=0, description="Number of test cases to generate")
    length: int = Field(ge=3, description="Number of terms")
    max_depth: int = Field(ge=1, description="Maximum bracket nesting depth")
    min_number: int = Field(default=-9, description="Minimum number value")
    max_number: int = Field(default=9, description="Maximum number value")
    prob_open: float = Field(default=0.4, ge=0, le=1, description="Probability of adding open parenthesis")

class ArithmeticTestCaseResult(BaseModel):
    """Schema for generate_test_case() return value"""
    input: str = Field(description="The formatted problem text")
    target: str = Field(description="The correct answer")
    depth: int = Field(description="Bracket nesting depth reached")

Standard Test Case Format

All generate_random() methods return lists of dictionaries with these required fields:

  • input: The complete problem text presented to the LLM
  • target: The correct answer for evaluation
  • Additional metadata: Task-specific fields for analysis and filtering

Example test case:

{
    'input': '(3 + 4) * 2 - 1',
    'target': '13',
    'depth': 1,
    'reference_movies': ['Movie A', 'Movie B'],
    'selected_genres': ['Action', 'Drama']
}

Random Number Generation

Task manifolds use a seeded RNG for reproducible test generation:

def __init__(self):
    self.rng = RNG()  # Automatically seeded by runner.py

This ensures identical test cases across runs with the same seed, enabling: - Reproducible experiments across different models - Consistent difficulty manifolds for fair comparison - Deterministic debugging of specific test cases

Integration with Runner

The runner.py system automatically:

  1. Seeds RNG for reproducibility
  2. Calls generate_random() with experiment configuration
  3. Removes duplicate tests across batches
  4. Handles errors gracefully with fallback generation

The standardized API enables the ReasonScape system to treat all reasoning tasks uniformly while preserving task-specific complexity and metadata.