Shuffle Tracking Test Manifold¶

Overview¶

The Shuffle Tracking test manifold evaluates a model's ability to maintain state through sequential transformations, requiring precise working memory and logical tracking capabilities. Models must track multiple entities through a series of swaps while filtering out irrelevant confounding information, testing core executive function and attention control.

Task Description¶

Models are presented with an initial assignment of items to people, followed by a sequence of pairwise swaps between individuals. After all transformations are complete, the model must identify which item a specific person currently possesses. The task requires maintaining an accurate mental model while ignoring deliberately inserted distracting statements.

Key Features: - Sequential State Tracking: Maintaining accurate item-person mappings through multiple swaps - Confounding Resistance: Filtering out irrelevant interpersonal information that doesn't affect item ownership - Multi-Domain Generalization: Consistent performance across different contextual scenarios - Anchor Format Variation: Processing information with different organizational structures - Working Memory Load: Handling increasing complexity through longer sequences and more entities

Test Case Generation¶

Algorithm Overview¶

The generator creates challenging state-tracking scenarios through a systematic process:

Domain Selection: Choose from 5 thematic contexts (dancing, books, soccer, gifts, balls)
Entity Generation: Sample people names and domain-appropriate items with optional adjectives
Initial State: Create one-to-one mapping between people and items
Swap Sequence: Generate random pairwise swaps up to specified depth
Confounding Injection: Insert irrelevant statements about interpersonal relationships
Anchor Formatting: Apply organizational markers (numeric, alphabetic, Roman, etc.)
Query Generation: Select random person for final state inquiry

Domain Database¶

The system includes 5 thematic domains, each with specialized vocabulary and context:

Domain	Context	Items	Adjectives	Action Verb
Dancing	Square dance partner switching	18 partner names (Patrick, Jamie, Lola, Melissa, etc.)	energetic, graceful, skilled, experienced, enthusiastic, talented	"switch partners"
Books	Literary trading among friends	18 classic titles (Catch-22, Frankenstein, The Great Gatsby, etc.)	thick, thin, worn, new, heavy, light, hardcover, paperback	"swap books"
Soccer	Position changes during a match	18 field positions (goalkeeper, striker, midfielder, benchwarmer, etc.)	starting, backup, primary, secondary, key, veteran	"trade positions"
Gifts	White elephant gift exchange	12 present types (ball, box, vase, toy, sculpture, etc.)	orange, pink, black, gold, green, brown, silver, crystal, wooden, metal	"swap presents"
Balls	Game with colored ball trading	18 colors (red, black, blue, yellow, purple, etc.)	round, bouncy, smooth, textured, inflated, heavy, lightweight, shiny	"swap balls"

Anchor Format System¶

Format Types (AnchorFormat Enum)¶

The generator supports 9 different organizational markers to test format-independent comprehension:

Format	Description	Example Pattern
NONE	No markers - plain sequential text	(no markers)
NUMERIC	Standard numbering	1, 2, 3, ...
ASCII	Alphabetic letters	A, B, C, ..., Z, [, \, ...
ALPHA	Cycling alphabet (wraps after Z)	A, B, C, ..., Z, A, B, ...
ROMAN	Roman numerals	I, II, III, IV, ...
SKIP_2	Even numbers	2, 4, 6, 8, ...
REVERSE	Countdown from total	5, 4, 3, 2, 1
HEX	Hexadecimal notation	0x01, 0x02, 0x03, ...
ELEMENTS	Periodic table symbols	H, He, Li, Be, ...

Anchor Formatting Example¶

NUMERIC Format:

1. First, Alice and Bob swap books
2. Then, Claire and Dave swap books  
3. Finally, Eve and Frank swap books

ROMAN Format:

I. First, Alice and Bob swap books
II. Then, Claire and Dave swap books
III. Finally, Eve and Frank swap books

ELEMENTS Format:

H. First, Alice and Bob swap books
He. Then, Claire and Dave swap books
Li. Finally, Eve and Frank swap books

Confounding Statement System¶

Purpose and Design¶

Confounding statements test the model's ability to maintain task focus by inserting irrelevant interpersonal information that doesn't affect item ownership. These statements are strategically placed throughout the swap sequence to increase cognitive load.

Template Categories¶

Relationship Statements¶

"{person1} really likes {person2}"
"{person1} and {person2} don't get along great"  
"{person1} is friends with {person2}"
"{person1} has known {person2} for years"

Professional Dynamics¶

"{person1} and {person2} work well together"
"{person1} and {person2} are colleagues"
"{person1} trusts {person2}"
"{person1} and {person2} communicate effectively"

Personal Qualities¶

"{person1} thinks {person2} is funny"
"{person1} respects {person2}"
"{person1} admires {person2}"
"{person1} supports {person2}"

Insertion Strategy¶

Confounding statements are randomly inserted between swap descriptions, creating natural-sounding but irrelevant narrative flow that challenges selective attention.

Configuration Parameters¶

Generation Schema (`ShuffleGenerationParams`)¶

class ShuffleGenerationParams(BaseModel):
    count: int                                   # Number of test cases to generate (> 0)
    length: int                                  # Number of people/entities (≥ 3)
    max_depth: int                              # Number of swaps to perform (≥ 1)
    confounding_count: int                      # Number of confounding statements (≥ 0)
    anchor: AnchorFormat                        # Anchor format type (enum)
    adjective_prob: float                       # Probability of adding adjectives (0.0-1.0)
    anchor_prefix: str                          # Prefix for anchor markers (default: "\n")
    anchor_suffix: str                          # Suffix for anchor markers (default: ".")

Standard Grid Configuration: - length: [4, 5, 6] - Number of participants - max_depth: [2, 3, 4] - Number of sequential swaps - confounding_count: [0, 1, 2] - Irrelevant statements inserted - Generates 27 different complexity combinations

Result Schema (`ShuffleTestCaseResult`)¶

class ShuffleTestCaseResult(BaseModel):
    input: str                                  # Formatted problem text
    target: str                                 # Correct answer (item name)
    domain: str                                 # Domain name used
    people: List[str]                          # List of people names
    items: List[str]                           # List of items (with adjectives if applied)
    swaps: List[Tuple[str, str]]               # List of (person1, person2) swap pairs
    query_person: str                          # Person being queried about
    response_enum: List[str]                   # Set of all possible items for multiple choice
    confounding_indices: List[int]             # Indices where confounding statements were inserted

Example Test Cases¶

Basic Tracking (domain=BOOKS, anchor=NONE, length=4, max_depth=3, confounding_count=0)¶

Alice, Bob, Claire, and Dave are friends and avid readers who occasionally trade books. At the start of the semester, they each buy one new book: Alice gets Catch-22, Bob gets Frankenstein, Claire gets The Pearl, and Dave gets Moby Dick.

As the semester proceeds, they start trading around the new books. First, Alice and Claire swap books. Then, Bob and Dave swap books. Finally, Claire and Bob swap books.

At the end of the semester, which book does Alice have?

State Evolution: - Initial: Alice→Catch-22, Bob→Frankenstein, Claire→The Pearl, Dave→Moby Dick - After swap 1: Alice→The Pearl, Bob→Frankenstein, Claire→Catch-22, Dave→Moby Dick
- After swap 2: Alice→The Pearl, Bob→Moby Dick, Claire→Catch-22, Dave→Frankenstein - After swap 3: Alice→The Pearl, Bob→Catch-22, Claire→Moby Dick, Dave→Frankenstein

Expected Answer: The Pearl

Complex Tracking with Confounding (domain=SOCCER, anchor=NONE, length=5, max_depth=3, confounding_count=2)¶

Alice, Bob, Claire, Dave, and Eve are on the same team in a soccer match. At the start of the match, they are each assigned to a position: Alice is playing goalkeeper, Bob is playing striker, Claire is playing midfielder, Dave is playing defender, and Eve is playing fullback.

As the game progresses, pairs of players occasionally swap positions. First, Alice and Bob trade positions. Alice really likes Claire. Then, Dave and Eve trade positions. Bob and Claire work well together. Finally, Claire and Alice trade positions.

At the end of the match, what position is Dave playing?

State Evolution (ignoring confounding statements): - Initial: Alice→goalkeeper, Bob→striker, Claire→midfielder, Dave→defender, Eve→fullback - After swap 1: Alice→striker, Bob→goalkeeper, Claire→midfielder, Dave→defender, Eve→fullback - After swap 2: Alice→striker, Bob→goalkeeper, Claire→midfielder, Dave→fullback, Eve→defender
- After swap 3: Alice→midfielder, Bob→goalkeeper, Claire→striker, Dave→fullback, Eve→defender

Expected Answer: fullback

Anchor Format Variation (domain=GIFTS, anchor=ROMAN, length=3, max_depth=2, confounding_count=0)¶

Alice, Bob, and Claire are holding a white elephant gift exchange. At the start of the event, they are each holding a present: Alice has gold box, Bob has silver vase, and Claire has wooden toy.

As the event progresses, pairs of people swap presents.
I. First, Alice and Bob swap presents
II. Then, Bob and Claire swap presents

At the end of the event, which present is Claire holding?

State Evolution: - Initial: Alice→gold box, Bob→silver vase, Claire→wooden toy - After swap I: Alice→silver vase, Bob→gold box, Claire→wooden toy - After swap II: Alice→silver vase, Bob→wooden toy, Claire→gold box

Expected Answer: gold box

Cognitive Skills Tested¶

Working Memory: Maintaining multiple item-person mappings simultaneously
Memory Churn: Dealing with multiple similar past working memory states
Sequential Processing: Applying transformations in correct temporal order
Selective Attention: Filtering relevant swaps from irrelevant social information
State Tracking: Updating mental models through sequential changes
Switching: Transitioning attention between different people and items

Applications¶

This test manifold evaluates capabilities essential for:

Process Monitoring: Tracking system states through sequential operations
Multi-Step Reasoning: Maintaining context through extended logical chains
Attention Control: Focusing on task-relevant information while filtering distractors
Procedural Following: Executing sequences of instructions with precision
Robustness: Filtering unwanted information while performing a task