Arithmetic Expression Evaluation Test Manifold¶
Overview¶
The Arithmetic Expression Evaluation test manifold assesses a model's ability to correctly parse and evaluate mathematical expressions with varying complexity, including parentheses, operator precedence, and nested structures. This task tests fundamental mathematical reasoning, symbolic manipulation, and the ability to follow order of operations in computational contexts.
Task Description¶
Models are presented with arithmetic expressions containing integers, basic operators (+, -, *), and parentheses with varying levels of nesting. The task requires accurate evaluation following standard mathematical precedence rules (PEMDAS/BODMAS), where parentheses override default operator precedence and multiplication takes priority over addition and subtraction.
Key Features:
- Variable complexity: Expressions range from simple chains to deeply nested parenthetical structures
- Operator precedence: Tests understanding of mathematical order of operations
- Parenthetical grouping: Evaluates ability to handle nested bracket structures
- Integer arithmetic: Uses positive and negative single-digit integers
- Whitespace variation: Expressions may have inconsistent spacing to test robust parsing
Difficulty Progression and Complexity Factors¶
Number of terms (length):
- 3-10 terms: Basic precedence and simple parentheses
- 10-20 terms: Moderate complexity with multiple operations
- 20+ terms: Extended expressions requiring sustained attention
Nesting Depth (max_depth):
- Depth 0: Linear expressions testing operator precedence only
- Depth 1: Single parenthetical level overriding precedence
- Depth 2+: Nested structures requiring recursive evaluation
Number Range (min_value/max_value):
- Upper and lower bound seperately defined
- Defaults to the domain of single digits (-9 to 9)
- Negative numbers add parsing challenges without excessive calculation burden
Whitespace Randomization (prob_dewhitespace)
- Propability of removing whitespace from 0.0 - 1.0 creates tokenization challenges
Test Case Generation¶
Algorithm Overview¶
The generator uses a state machine approach to create well-formed arithmetic expressions:
- State-Driven Construction: Uses finite state machine with four states (START, AFTER_NUMBER, AFTER_OPERATOR, AFTER_OPEN_PAREN) to ensure syntactic validity
- Probabilistic Parentheses: Randomly adds opening and closing parentheses based on configurable probabilities
- Depth Tracking: Monitors nesting depth to prevent excessive complexity while recording maximum depth reached
- Balanced Structures: Ensures all opening parentheses are properly closed
- Whitespace Randomization: Optionally removes spaces to test parsing robustness
Expression Components¶
Numbers:
- Integer range: -9 to 9 (configurable via min_number/max_number parameters)
- Single digits to maintain focus on structural complexity rather than computational difficulty
Operators:
- Addition (+)
- Subtraction (-)
- Multiplication (*)
- Division intentionally excluded to avoid floating-point complications and division-by-zero errors
Parentheses:
- Configurable nesting depth (max_depth parameter)
- Probabilistic insertion based on current state and depth limits
- Automatic balancing ensures syntactic correctness
Configuration Parameters¶
Generation Schema (ArithmeticGenerationParams
)¶
class ArithmeticGenerationParams(BaseModel):
count: int # Number of test cases to generate (> 0)
length: int # Number of terms/numbers in expression (≥ 3)
max_depth: int # Maximum bracket nesting depth (≥ 1)
min_number: int # Minimum number value (default: -9)
max_number: int # Maximum number value (default: 9)
prob_open: float # Probability of adding open parenthesis (0-1, default: 0.4)
prob_close: float # Probability of adding close parenthesis (0-1, default: 0.3)
prob_dewhitespace: float # Probability of removing whitespace (0-1, default: 0)
Result Schema (ArithmeticTestCaseResult
)¶
class ArithmeticTestCaseResult(BaseModel):
input: str # The arithmetic expression to evaluate
target: str # Correct numerical result
depth: int # Maximum nesting depth achieved
Example Test Cases¶
Simple Linear Expression (length=3, max_depth=0)¶
Input: "5 + 3 * 2"
Target: "11"
Depth: 0
Single-Level Parentheses (length=4, max_depth=1)¶
Input: "(8 - 3) * 2 + 1"
Target: "11"
Depth: 1
Nested Parentheses (length=5, max_depth=2)¶
Input: "((2 + 3) * (4 - 1)) + 5"
Target: "20"
Depth: 2
Negative Numbers (length=3, max_depth=1)¶
Input: "-5 + (3 * -2)"
Target: "-11"
Depth: 1
Whitespace Variation (length=4, max_depth=1, prob_dewhitespace=1.0)¶
Input: "2*3+(4-1)"
Target: "9"
Depth: 1
Complex Nested Structure (length=6, max_depth=3)¶
Input: "((-2 + 5) * (3 + (1 * 2))) - 4"
Target: "11"
Depth: 2
Cognitive Skills Tested¶
Core Competencies¶
- Mathematical Reasoning: Understanding arithmetic operations and their properties
- Symbolic Parsing: Correctly interpreting mathematical notation and operator symbols
- Precedence Rules: Applying order of operations (PEMDAS/BODMAS) consistently
- Structural Analysis: Managing nested parenthetical groupings
- Sequential Processing: Evaluating expressions step-by-step in correct order
Advanced Skills¶
- Error Handling: Robust parsing despite whitespace variations
- Working Memory: Tracking intermediate results through complex nested calculations
- Pattern Recognition: Identifying common mathematical structures and shortcuts
Applications¶
This test manifold evaluates capabilities essential for:
- Mathematical Computing: Foundation for more complex mathematical reasoning tasks
- Code Parsing: Understanding operator precedence in programming languages
- Formula Evaluation: Processing mathematical expressions in spreadsheets or scientific applications
- Symbolic Reasoning: Manipulating abstract mathematical representations