Skip to content

Arithmetic Expression Evaluation Test Manifold

Overview

The Arithmetic Expression Evaluation test manifold assesses a model's ability to correctly parse and evaluate mathematical expressions with varying complexity, including parentheses, operator precedence, and nested structures. This task tests fundamental mathematical reasoning, symbolic manipulation, and the ability to follow order of operations in computational contexts.

Task Description

Models are presented with arithmetic expressions containing integers, basic operators (+, -, *), and parentheses with varying levels of nesting. The task requires accurate evaluation following standard mathematical precedence rules (PEMDAS/BODMAS), where parentheses override default operator precedence and multiplication takes priority over addition and subtraction.

Key Features:

  • Variable complexity: Expressions range from simple chains to deeply nested parenthetical structures
  • Operator precedence: Tests understanding of mathematical order of operations
  • Parenthetical grouping: Evaluates ability to handle nested bracket structures
  • Integer arithmetic: Uses positive and negative single-digit integers
  • Whitespace variation: Expressions may have inconsistent spacing to test robust parsing

Difficulty Progression and Complexity Factors

Number of terms (length):

  • 3-10 terms: Basic precedence and simple parentheses
  • 10-20 terms: Moderate complexity with multiple operations
  • 20+ terms: Extended expressions requiring sustained attention

Nesting Depth (max_depth):

  • Depth 0: Linear expressions testing operator precedence only
  • Depth 1: Single parenthetical level overriding precedence
  • Depth 2+: Nested structures requiring recursive evaluation

Number Range (min_value/max_value):

  • Upper and lower bound seperately defined
  • Defaults to the domain of single digits (-9 to 9)
  • Negative numbers add parsing challenges without excessive calculation burden

Whitespace Randomization (prob_dewhitespace)

  • Propability of removing whitespace from 0.0 - 1.0 creates tokenization challenges

Test Case Generation

Algorithm Overview

The generator uses a state machine approach to create well-formed arithmetic expressions:

  1. State-Driven Construction: Uses finite state machine with four states (START, AFTER_NUMBER, AFTER_OPERATOR, AFTER_OPEN_PAREN) to ensure syntactic validity
  2. Probabilistic Parentheses: Randomly adds opening and closing parentheses based on configurable probabilities
  3. Depth Tracking: Monitors nesting depth to prevent excessive complexity while recording maximum depth reached
  4. Balanced Structures: Ensures all opening parentheses are properly closed
  5. Whitespace Randomization: Optionally removes spaces to test parsing robustness

Expression Components

Numbers:

  • Integer range: -9 to 9 (configurable via min_number/max_number parameters)
  • Single digits to maintain focus on structural complexity rather than computational difficulty

Operators:

  • Addition (+)
  • Subtraction (-)
  • Multiplication (*)
  • Division intentionally excluded to avoid floating-point complications and division-by-zero errors

Parentheses:

  • Configurable nesting depth (max_depth parameter)
  • Probabilistic insertion based on current state and depth limits
  • Automatic balancing ensures syntactic correctness

Configuration Parameters

Generation Schema (ArithmeticGenerationParams)

class ArithmeticGenerationParams(BaseModel):
    count: int                           # Number of test cases to generate (> 0)
    length: int                          # Number of terms/numbers in expression (≥ 3)
    max_depth: int                       # Maximum bracket nesting depth (≥ 1)
    min_number: int                      # Minimum number value (default: -9)
    max_number: int                      # Maximum number value (default: 9)
    prob_open: float                     # Probability of adding open parenthesis (0-1, default: 0.4)
    prob_close: float                    # Probability of adding close parenthesis (0-1, default: 0.3)
    prob_dewhitespace: float             # Probability of removing whitespace (0-1, default: 0)

Result Schema (ArithmeticTestCaseResult)

class ArithmeticTestCaseResult(BaseModel):
    input: str                          # The arithmetic expression to evaluate
    target: str                         # Correct numerical result
    depth: int                          # Maximum nesting depth achieved

Example Test Cases

Simple Linear Expression (length=3, max_depth=0)

Input: "5 + 3 * 2"
Target: "11"
Depth: 0
Analysis: Tests basic operator precedence (multiplication before addition).

Single-Level Parentheses (length=4, max_depth=1)

Input: "(8 - 3) * 2 + 1"
Target: "11"
Depth: 1
Analysis: Parentheses override default precedence, forcing subtraction before multiplication.

Nested Parentheses (length=5, max_depth=2)

Input: "((2 + 3) * (4 - 1)) + 5"
Target: "20"
Depth: 2
Analysis: Multiple levels of nesting require careful parsing and evaluation order.

Negative Numbers (length=3, max_depth=1)

Input: "-5 + (3 * -2)"
Target: "-11"
Depth: 1
Analysis: Tests handling of negative integers within parenthetical expressions.

Whitespace Variation (length=4, max_depth=1, prob_dewhitespace=1.0)

Input: "2*3+(4-1)"
Target: "9"
Depth: 1
Analysis: No spaces between operators tests robust parsing capabilities.

Complex Nested Structure (length=6, max_depth=3)

Input: "((-2 + 5) * (3 + (1 * 2))) - 4"
Target: "11"
Depth: 2
Analysis: Combines negative numbers, multiple nesting levels, and mixed operators.

Cognitive Skills Tested

Core Competencies

  • Mathematical Reasoning: Understanding arithmetic operations and their properties
  • Symbolic Parsing: Correctly interpreting mathematical notation and operator symbols
  • Precedence Rules: Applying order of operations (PEMDAS/BODMAS) consistently
  • Structural Analysis: Managing nested parenthetical groupings
  • Sequential Processing: Evaluating expressions step-by-step in correct order

Advanced Skills

  • Error Handling: Robust parsing despite whitespace variations
  • Working Memory: Tracking intermediate results through complex nested calculations
  • Pattern Recognition: Identifying common mathematical structures and shortcuts

Applications

This test manifold evaluates capabilities essential for:

  • Mathematical Computing: Foundation for more complex mathematical reasoning tasks
  • Code Parsing: Understanding operator precedence in programming languages
  • Formula Evaluation: Processing mathematical expressions in spreadsheets or scientific applications
  • Symbolic Reasoning: Manipulating abstract mathematical representations