Arithmetic Expression Evaluation Test Manifold¶

Overview¶

The Arithmetic Expression Evaluation test manifold assesses a model's ability to correctly parse and evaluate mathematical expressions with varying complexity, including parentheses, operator precedence, and nested structures. This task tests fundamental mathematical reasoning, symbolic manipulation, and the ability to follow order of operations in computational contexts.

Task Description¶

Models are presented with arithmetic expressions containing integers, basic operators (+, -, *), and parentheses with varying levels of nesting. The task requires accurate evaluation following standard mathematical precedence rules (PEMDAS/BODMAS), where parentheses override default operator precedence and multiplication takes priority over addition and subtraction.

Key Features:

Variable complexity: Expressions range from simple chains to deeply nested parenthetical structures
Operator precedence: Tests understanding of mathematical order of operations
Parenthetical grouping: Evaluates ability to handle nested bracket structures
Integer arithmetic: Uses positive and negative single-digit integers
Whitespace variation: Expressions may have inconsistent spacing to test robust parsing

Difficulty Progression and Complexity Factors¶

Number of terms (length):

3-10 terms: Basic precedence and simple parentheses
10-20 terms: Moderate complexity with multiple operations
20+ terms: Extended expressions requiring sustained attention

Nesting Depth (max_depth):

Depth 0: Linear expressions testing operator precedence only
Depth 1: Single parenthetical level overriding precedence
Depth 2+: Nested structures requiring recursive evaluation

Number Range (min_value/max_value):

Upper and lower bound seperately defined
Defaults to the domain of single digits (-9 to 9)
Negative numbers add parsing challenges without excessive calculation burden

Whitespace Randomization (prob_dewhitespace)

Propability of removing whitespace from 0.0 - 1.0 creates tokenization challenges

Test Case Generation¶

Algorithm Overview¶

The generator uses a state machine approach to create well-formed arithmetic expressions:

State-Driven Construction: Uses finite state machine with four states (START, AFTER_NUMBER, AFTER_OPERATOR, AFTER_OPEN_PAREN) to ensure syntactic validity
Probabilistic Parentheses: Randomly adds opening and closing parentheses based on configurable probabilities
Depth Tracking: Monitors nesting depth to prevent excessive complexity while recording maximum depth reached
Balanced Structures: Ensures all opening parentheses are properly closed
Whitespace Randomization: Optionally removes spaces to test parsing robustness

Expression Components¶

Numbers:

Integer range: -9 to 9 (configurable via min_number/max_number parameters)
Single digits to maintain focus on structural complexity rather than computational difficulty

Operators:

Addition (+)
Subtraction (-)
Multiplication (*)
Division intentionally excluded to avoid floating-point complications and division-by-zero errors

Parentheses:

Configurable nesting depth (max_depth parameter)
Probabilistic insertion based on current state and depth limits
Automatic balancing ensures syntactic correctness

Configuration Parameters¶

Generation Schema (`ArithmeticGenerationParams`)¶

class ArithmeticGenerationParams(BaseModel):
    count: int                           # Number of test cases to generate (> 0)
    length: int                          # Number of terms/numbers in expression (≥ 3)
    max_depth: int                       # Maximum bracket nesting depth (≥ 1)
    min_number: int                      # Minimum number value (default: -9)
    max_number: int                      # Maximum number value (default: 9)
    prob_open: float                     # Probability of adding open parenthesis (0-1, default: 0.4)
    prob_close: float                    # Probability of adding close parenthesis (0-1, default: 0.3)
    prob_dewhitespace: float             # Probability of removing whitespace (0-1, default: 0)

Result Schema (`ArithmeticTestCaseResult`)¶

class ArithmeticTestCaseResult(BaseModel):
    input: str                          # The arithmetic expression to evaluate
    target: str                         # Correct numerical result
    depth: int                          # Maximum nesting depth achieved

Example Test Cases¶

Simple Linear Expression (length=3, max_depth=0)¶

Input: "5 + 3 * 2"
Target: "11"
Depth: 0

Analysis: Tests basic operator precedence (multiplication before addition).

Single-Level Parentheses (length=4, max_depth=1)¶

Input: "(8 - 3) * 2 + 1"
Target: "11"
Depth: 1

Analysis: Parentheses override default precedence, forcing subtraction before multiplication.

Nested Parentheses (length=5, max_depth=2)¶

Input: "((2 + 3) * (4 - 1)) + 5"
Target: "20"
Depth: 2

Analysis: Multiple levels of nesting require careful parsing and evaluation order.

Negative Numbers (length=3, max_depth=1)¶

Input: "-5 + (3 * -2)"
Target: "-11"
Depth: 1

Analysis: Tests handling of negative integers within parenthetical expressions.

Whitespace Variation (length=4, max_depth=1, prob_dewhitespace=1.0)¶

Input: "2*3+(4-1)"
Target: "9"
Depth: 1

Analysis: No spaces between operators tests robust parsing capabilities.

Complex Nested Structure (length=6, max_depth=3)¶

Input: "((-2 + 5) * (3 + (1 * 2))) - 4"
Target: "11"
Depth: 2

Analysis: Combines negative numbers, multiple nesting levels, and mixed operators.

Cognitive Skills Tested¶

Core Competencies¶

Mathematical Reasoning: Understanding arithmetic operations and their properties
Symbolic Parsing: Correctly interpreting mathematical notation and operator symbols
Precedence Rules: Applying order of operations (PEMDAS/BODMAS) consistently
Structural Analysis: Managing nested parenthetical groupings
Sequential Processing: Evaluating expressions step-by-step in correct order

Advanced Skills¶

Error Handling: Robust parsing despite whitespace variations
Working Memory: Tracking intermediate results through complex nested calculations
Pattern Recognition: Identifying common mathematical structures and shortcuts

Applications¶

This test manifold evaluates capabilities essential for:

Mathematical Computing: Foundation for more complex mathematical reasoning tasks
Code Parsing: Understanding operator precedence in programming languages
Formula Evaluation: Processing mathematical expressions in spreadsheets or scientific applications
Symbolic Reasoning: Manipulating abstract mathematical representations