Cars Movement Tracking Test Manifold¶

Overview¶

The Cars Movement Tracking test manifold evaluates a model's ability to track complex spatial transformations through sequential movement operations, requiring sophisticated spatial reasoning, attribute-based lookup, and multi-dimensional state tracking capabilities. Models must maintain accurate person-to-car mappings through diverse movement commands while filtering out confounding information that resembles operations but doesn't change state.

Task Description¶

Models are presented with an initial assignment of people to cars in a linear sequence, followed by a series of movement operations with varying complexity. After all transformations are complete, the model must identify which car position a specific person currently occupies. The task tests spatial reasoning, semantic attribute processing, reference resolution, and occupancy state tracking.

Key Features: - Multi-Dimensional Spatial Reasoning: Processing absolute positions, relative movements, and directional references - Attribute-Based Navigation: Using car properties (color, brand) for movement targeting with filtering - Reference Resolution: Handling person-relative positioning and occupancy-based movement - Confounding Resistance: Filtering out operation-like statements that don't change state - Sequential State Tracking: Maintaining accurate person-car mappings through complex transformations

Cognitive Architecture Challenge: State Space Selection¶

The Fundamental Mental Model Choice¶

This task creates a deceptive bifurcation in mental model selection that reveals deep architectural biases in language models. The problem can be represented using two competing state spaces:

Wrong Mental Model: Person-Centric State Tracking¶

# Natural but incorrect representation
alice = PersonAgent(current_car=1, preferences=["BMW"], relationships=["friends with Bob"])
bob = PersonAgent(current_car=3, preferences=["red cars"], relationships=["talks to Alice"])

Models using this representation focus on people as active agents with intentions, preferences, and social relationships. This feels natural because people are the grammatical subjects of action sentences, but leads to systematic failures in spatial tracking.

Correct Mental Model: Car-Centric State Tracking¶

# Mechanical but correct representation
cars = [
    Car(position=1, occupants=[alice], color="red", brand="Toyota"),
    Car(position=2, occupants=[], color="blue", brand="BMW"),
    Car(position=3, occupants=[bob, claire], color="green", brand="Honda")
]

Models using this representation treat people as tokens being moved through spatial positions rather than autonomous agents, enabling accurate state tracking through complex transformations.

Why Sophisticated Models Fail More¶

Large Language Models exhibit stronger anthropomorphic bias due to:

Rich Semantic Processing: Better at understanding human agency and intentionality
Narrative Coherence: Treats scenarios as stories about people rather than spatial puzzles
Social Reasoning: Gets distracted by interpersonal dynamics and preferences
Natural Language Fluency: Focuses on grammatical subjects (people) rather than spatial objects (cars)

Smaller Models often perform better because they:

Process More Mechanically: Treat sentences as state transformation commands
Exhibit Less Anthropomorphism: Don't attribute complex intentions to people
Focus on Literal Operations: Parse "gets into car #3" as position update, not personal choice
Resist Social Distractors: Less susceptible to confounding interpersonal statements

Confounding Statements as Mental Model Reinforcement¶

The confounding statements are strategically designed to strengthen the wrong mental model:

"Alice really likes the red Toyota"     → Reinforces Alice as agent with preferences
"Bob talks to Claire about the BMW"     → Creates irrelevant social dynamics
"Dave wishes he had the green Honda"    → Suggests emotional investment in cars

Models using person-centric state representation are more vulnerable to these distractors because the confounding statements feel semantically relevant to their chosen mental model.

Counter-Intuitive Scaling Effects Explained¶

"Bigger Models Perform Worse": Sophisticated language understanding becomes a liability when the task requires ignoring natural narrative structure in favor of mechanical spatial tracking.

"Simple Operations Are Hard": Basic commands like "Alice gets into car #3" appear to be about Alice's choices but actually require car-centric position updates.

"Confounding Helps Some Models": Models already using car-centric representation are unaffected by person-focused distractors, while person-centric models get increasingly confused.

This task functions as a cognitive architecture probe that measures a model's ability to resist anthropomorphic bias and select appropriate state representations for spatial reasoning problems disguised as social scenarios.

Test Case Generation¶

Algorithm Overview¶

The generator creates challenging spatial reasoning scenarios through a systematic process:

Car Generation: Create unique cars with color and brand attributes in sequential positions
Person Assignment: Place people in initial car positions (multiple people per car allowed)
Operation Selection: Choose from 8 operation types based on enabled operation mask
Movement Execution: Apply spatial, attribute-based, or person-relative movements
Confounding Injection: Insert operation-like statements that don't change state
Query Generation: Select random person for final position inquiry

Operation Type System¶

The system implements 8 distinct operation types using power-of-2 encoding for flexible combination:

Operation Type	Bitmask	Description	Example
POSITION_ABSOLUTE	1	Direct car number reference	"gets into car #3"
POSITION_REFERENCE	2	Semantic/ordinal car reference	"gets into the last car", "gets into the second car from the front"
POSITION_RELATIVE	4	Distance-based movement from current position	"moves two cars ahead", "moves one car back"
ATTRIBUTE_ABSOLUTE	32	Direct attribute-based targeting	"gets into the blue BMW", "gets into the Toyota"
ATTRIBUTE_RELATIVE	64	Directional movement with attribute filtering	"walks ahead, gets into the first red car", "walks back, gets into the second BMW"
PERSON_DIRECT	1024	Direct person-to-person reference	"gets into Bob's car"
PERSON_RELATIVE	2048	Position-offset from another person	"moves to the car behind Alice", "moves two cars in front of Bob"
PERSON_EMPTY	4096	Occupancy-based movement with filtering	"walks ahead, gets into the second empty car", "walks back, gets into the first occupied car"

Car Attribute System¶

Cars are generated with unique color-brand combinations to ensure unambiguous attribute references:

Colors: red, blue, green, black, white, silver, yellow, orange (8 options) Brands: Toyota, BMW, Honda, Ford, Audi, Mercedes, Nissan, Hyundai (8 options)

Each car is guaranteed to have a unique color-brand pairing within a test case to prevent ambiguous attribute references.

Operation Complexity Analysis¶

Spatial Reasoning Hierarchy¶

Level 1 - Direct Reference: POSITION_ABSOLUTE - Requires: Basic number-to-position mapping - Example: "gets into car #3"

Level 2 - Semantic Reference: POSITION_REFERENCE
- Requires: Understanding semantic position markers (first, last, middle) - Example: "gets into the last car", "gets into the third car from the back"

Level 3 - Relative Movement: POSITION_RELATIVE - Requires: Current position tracking + directional calculation - Example: "moves two cars ahead"

Attribute Processing Hierarchy¶

Level 1 - Direct Lookup: ATTRIBUTE_ABSOLUTE - Requires: Attribute-to-position mapping with uniqueness handling - Example: "gets into the red Toyota"

Level 2 - Filtered Search: ATTRIBUTE_RELATIVE - Requires: Directional movement + attribute filtering + ordinal counting - Example: "walks ahead, gets into the second BMW he sees"

Reference Resolution Hierarchy¶

Level 1 - Person Lookup: PERSON_DIRECT - Requires: Person-to-position mapping - Example: "gets into Bob's car"

Level 2 - Offset Calculation: PERSON_RELATIVE - Requires: Person lookup + positional offset + boundary checking - Example: "moves two cars behind Alice"

Level 3 - State-Aware Filtering: PERSON_EMPTY - Requires: Global occupancy tracking + directional search + filtering - Example: "walks ahead, gets into the second empty car"

Confounding Statement System¶

Purpose and Design¶

Confounding statements mimic the structure and vocabulary of actual movement operations but do not change any person's position. These statements test selective attention and the model's ability to distinguish between state-changing operations and similar-sounding observations.

Template Categories¶

Observation Statements¶

Operations that appear to involve cars and movement but don't change anyone's position:

Category	Examples
Car Inspection	"Alice notices the red car", "Bob looks at the BMW", "Claire checks out the Toyota nearby"
Purchase Consideration	"Alice considers buying a blue Honda", "Bob thinks about getting the white car", "Claire evaluates the Ford"
Social Car Discussion	"Alice talks to Bob about the green car", "Bob shows Claire the Mercedes", "Alice and Bob discuss the red BMW"
Car Preferences	"Alice really likes the silver Audi", "Bob prefers the Honda over other cars", "Claire wishes she had the black car"
Social Compliments	"Alice compliments Bob on the Toyota", "Bob admires Claire's taste in the blue car", "Claire praises the Mercedes"

Insertion Strategy¶

Confounding statements are randomly shuffled with actual operations to create natural narrative flow while testing the model's ability to maintain focus on state-changing operations only.

Configuration Parameters¶

Generation Schema (`CarsGenerationParams`)¶

class CarsGenerationParams(BaseModel):
    count: int                                   # Number of test cases to generate (> 0)
    num_cars: int                               # Number of cars in sequence (3-8)
    num_people: int                             # Number of people (2-6)
    max_moves: int                              # Number of movement operations (≥ 1)
    operation_mask: int                         # Bitmask of allowed operation types (default: 4095)
    multi_attribute_prob: float                 # Probability of using color+brand (0.0-1.0)
    max_filter_distance: int                    # Maximum ordinal filter distance (2-5)
    confounding_count: int                      # Number of confounding statements (≥ 0)

Standard Complexity Progression: - num_cars: [3, 4, 5, 6] - Spatial complexity - num_people: [2, 3, 4] - State tracking load - max_moves: [1, 2, 3, 4] - Sequential reasoning depth - operation_mask: Subset of 8 operation types - Reasoning complexity

Result Schema (`CarsTestCaseResult`)¶

class CarsTestCaseResult(BaseModel):
    input: str                                  # Formatted problem text
    target: str                                 # Final car position (1-indexed)
    response_enum: List[str]                    # Valid car position choices
    cars: List[Dict[str, str]]                  # Car attributes and positions
    people: List[str]                           # List of people names
    initial_assignments: List[int]              # Initial car positions (0-indexed)
    final_assignments: List[int]                # Final car positions (0-indexed)
    operations_performed: List[int]             # OperationType bitmask values used
    query_person_index: int                     # Index of queried person

Example Test Cases¶

Basic Position Tracking (num_cars=4, num_people=3, max_moves=2, operations=[POSITION_ABSOLUTE, POSITION_RELATIVE])¶

Alice, Bob, and Claire are at a car lot with 4 cars in a row.
Car lineup:
  Car #1 (first): red Toyota
  Car #2: blue BMW
  Car #3: green Honda
  Car #4 (last): black Ford
Initial positions: Alice is in car #1, Bob is in car #3, Claire is in car #4.

Alice gets into car #2
Bob moves one car ahead

Which car number is Alice in at the end?

State Evolution: - Initial: Alice→car #1, Bob→car #3, Claire→car #4 - After operation 1: Alice→car #2, Bob→car #3, Claire→car #4 - After operation 2: Alice→car #2, Bob→car #4, Claire→car #4

Expected Answer: 2

Alice, Bob, and Claire are at a car lot with 5 cars in a row.
Car lineup:
  Car #1 (first): red Toyota
  Car #2: blue BMW  
  Car #3 (middle): green Honda
  Car #4: black Ford
  Car #5 (last): white Audi
Initial positions: Alice is in car #1, Bob is in car #2, Claire is in car #5.

Alice gets into the green Honda
Bob walks ahead and gets into the first white car he sees

Which car number is Bob in at the end?

State Evolution: - Initial: Alice→car #1, Bob→car #2, Claire→car #5 - After operation 1: Alice→car #3 (green Honda), Bob→car #2, Claire→car #5 - After operation 2: Alice→car #3, Bob→car #5 (first white car ahead from #2), Claire→car #5

Expected Answer: 5

Complex Multi-Operation with Confounding (operations=[PERSON_DIRECT, PERSON_EMPTY], confounding_count=2)¶

Alice, Bob, and Claire are at a car lot with 4 cars in a row.
Car lineup:
  Car #1 (first): red Toyota
  Car #2: blue BMW
  Car #3: green Honda  
  Car #4 (last): black Ford
Initial positions: Alice is in car #1, Bob is in car #3.

Alice gets into Bob's car
Bob really likes the red Toyota
Claire walks ahead and gets into the first empty car
Alice talks to Bob about the blue BMW

Which car number is Claire in at the end?

State Evolution (ignoring confounding statements): - Initial: Alice→car #1, Bob→car #3, Claire→unassigned - After operation 1: Alice→car #3 (Bob's car), Bob→car #3, Claire→unassigned
- After operation 2: Alice→car #3, Bob→car #3, Claire→car #1 (first empty car ahead from start)

Expected Answer: 1

Cognitive Skills Tested¶

Primary Cognitive Capabilities¶

Spatial Reasoning: Understanding positional relationships, directions, and distances
Attribute Processing: Mapping semantic properties to spatial locations
Reference Resolution: Resolving person-based and occupancy-based references
Sequential State Tracking: Maintaining accurate position mappings through transformations
Working Memory: Handling multiple people, cars, and attributes simultaneously

Executive Function Skills¶

Selective Attention: Filtering operational statements from confounding observations
Cognitive Flexibility: Switching between different reference frames (absolute, relative, attribute-based)
Inhibitory Control: Ignoring irrelevant information while maintaining task focus

Applications¶

This test manifold evaluates capabilities essential for:

Spatial Navigation: Understanding movement in structured environments
Multi-Modal Reasoning: Combining positional, semantic, and relational information
State Management: Tracking dynamic systems with multiple entities and properties
Instruction Following: Executing complex commands with multiple reference types
Attention Control: Maintaining focus on relevant operations under cognitive load