Shuffle Tracking Test Manifold¶
Overview¶
The Shuffle Tracking test manifold evaluates a model's ability to maintain state through sequential transformations, requiring precise working memory and logical tracking capabilities. Models must track multiple entities through a series of swaps while filtering out irrelevant confounding information, testing core executive function and attention control.
Task Description¶
Models are presented with an initial assignment of items to people, followed by a sequence of pairwise swaps between individuals. After all transformations are complete, the model must identify which item a specific person currently possesses. The task requires maintaining an accurate mental model while ignoring deliberately inserted distracting statements.
Key Features: - Sequential State Tracking: Maintaining accurate item-person mappings through multiple swaps - Confounding Resistance: Filtering out irrelevant interpersonal information that doesn't affect item ownership - Multi-Domain Generalization: Consistent performance across different contextual scenarios - Anchor Format Variation: Processing information with different organizational structures - Working Memory Load: Handling increasing complexity through longer sequences and more entities
Test Case Generation¶
Algorithm Overview¶
The generator creates challenging state-tracking scenarios through a systematic process:
- Domain Selection: Choose from 5 thematic contexts (dancing, books, soccer, gifts, balls)
- Entity Generation: Sample people names and domain-appropriate items with optional adjectives
- Initial State: Create one-to-one mapping between people and items
- Swap Sequence: Generate random pairwise swaps up to specified depth
- Confounding Injection: Insert irrelevant statements about interpersonal relationships
- Anchor Formatting: Apply organizational markers (numeric, alphabetic, Roman, etc.)
- Query Generation: Select random person for final state inquiry
Domain Database¶
The system includes 5 thematic domains, each with specialized vocabulary and context:
Domain | Context | Items | Adjectives | Action Verb |
---|---|---|---|---|
Dancing | Square dance partner switching | 18 partner names (Patrick, Jamie, Lola, Melissa, etc.) | energetic, graceful, skilled, experienced, enthusiastic, talented | "switch partners" |
Books | Literary trading among friends | 18 classic titles (Catch-22, Frankenstein, The Great Gatsby, etc.) | thick, thin, worn, new, heavy, light, hardcover, paperback | "swap books" |
Soccer | Position changes during a match | 18 field positions (goalkeeper, striker, midfielder, benchwarmer, etc.) | starting, backup, primary, secondary, key, veteran | "trade positions" |
Gifts | White elephant gift exchange | 12 present types (ball, box, vase, toy, sculpture, etc.) | orange, pink, black, gold, green, brown, silver, crystal, wooden, metal | "swap presents" |
Balls | Game with colored ball trading | 18 colors (red, black, blue, yellow, purple, etc.) | round, bouncy, smooth, textured, inflated, heavy, lightweight, shiny | "swap balls" |
Anchor Format System¶
Format Types (AnchorFormat Enum)¶
The generator supports 9 different organizational markers to test format-independent comprehension:
Format | Description | Example Pattern |
---|---|---|
NONE | No markers - plain sequential text | (no markers) |
NUMERIC | Standard numbering | 1, 2, 3, ... |
ASCII | Alphabetic letters | A, B, C, ..., Z, [, \, ... |
ALPHA | Cycling alphabet (wraps after Z) | A, B, C, ..., Z, A, B, ... |
ROMAN | Roman numerals | I, II, III, IV, ... |
SKIP_2 | Even numbers | 2, 4, 6, 8, ... |
REVERSE | Countdown from total | 5, 4, 3, 2, 1 |
HEX | Hexadecimal notation | 0x01, 0x02, 0x03, ... |
ELEMENTS | Periodic table symbols | H, He, Li, Be, ... |
Anchor Formatting Example¶
NUMERIC Format:
1. First, Alice and Bob swap books
2. Then, Claire and Dave swap books
3. Finally, Eve and Frank swap books
ROMAN Format:
I. First, Alice and Bob swap books
II. Then, Claire and Dave swap books
III. Finally, Eve and Frank swap books
ELEMENTS Format:
H. First, Alice and Bob swap books
He. Then, Claire and Dave swap books
Li. Finally, Eve and Frank swap books
Confounding Statement System¶
Purpose and Design¶
Confounding statements test the model's ability to maintain task focus by inserting irrelevant interpersonal information that doesn't affect item ownership. These statements are strategically placed throughout the swap sequence to increase cognitive load.
Template Categories¶
Relationship Statements¶
"{person1} really likes {person2}"
"{person1} and {person2} don't get along great"
"{person1} is friends with {person2}"
"{person1} has known {person2} for years"
Professional Dynamics¶
"{person1} and {person2} work well together"
"{person1} and {person2} are colleagues"
"{person1} trusts {person2}"
"{person1} and {person2} communicate effectively"
Personal Qualities¶
"{person1} thinks {person2} is funny"
"{person1} respects {person2}"
"{person1} admires {person2}"
"{person1} supports {person2}"
Insertion Strategy¶
Confounding statements are randomly inserted between swap descriptions, creating natural-sounding but irrelevant narrative flow that challenges selective attention.
Configuration Parameters¶
Generation Schema (ShuffleGenerationParams
)¶
class ShuffleGenerationParams(BaseModel):
count: int # Number of test cases to generate (> 0)
length: int # Number of people/entities (≥ 3)
max_depth: int # Number of swaps to perform (≥ 1)
confounding_count: int # Number of confounding statements (≥ 0)
anchor: AnchorFormat # Anchor format type (enum)
adjective_prob: float # Probability of adding adjectives (0.0-1.0)
anchor_prefix: str # Prefix for anchor markers (default: "\n")
anchor_suffix: str # Suffix for anchor markers (default: ".")
Standard Grid Configuration:
- length
: [4, 5, 6] - Number of participants
- max_depth
: [2, 3, 4] - Number of sequential swaps
- confounding_count
: [0, 1, 2] - Irrelevant statements inserted
- Generates 27 different complexity combinations
Result Schema (ShuffleTestCaseResult
)¶
class ShuffleTestCaseResult(BaseModel):
input: str # Formatted problem text
target: str # Correct answer (item name)
domain: str # Domain name used
people: List[str] # List of people names
items: List[str] # List of items (with adjectives if applied)
swaps: List[Tuple[str, str]] # List of (person1, person2) swap pairs
query_person: str # Person being queried about
response_enum: List[str] # Set of all possible items for multiple choice
confounding_indices: List[int] # Indices where confounding statements were inserted
Example Test Cases¶
Basic Tracking (domain=BOOKS, anchor=NONE, length=4, max_depth=3, confounding_count=0)¶
Alice, Bob, Claire, and Dave are friends and avid readers who occasionally trade books. At the start of the semester, they each buy one new book: Alice gets Catch-22, Bob gets Frankenstein, Claire gets The Pearl, and Dave gets Moby Dick.
As the semester proceeds, they start trading around the new books. First, Alice and Claire swap books. Then, Bob and Dave swap books. Finally, Claire and Bob swap books.
At the end of the semester, which book does Alice have?
State Evolution:
- Initial: Alice→Catch-22, Bob→Frankenstein, Claire→The Pearl, Dave→Moby Dick
- After swap 1: Alice→The Pearl, Bob→Frankenstein, Claire→Catch-22, Dave→Moby Dick
- After swap 2: Alice→The Pearl, Bob→Moby Dick, Claire→Catch-22, Dave→Frankenstein
- After swap 3: Alice→The Pearl, Bob→Catch-22, Claire→Moby Dick, Dave→Frankenstein
Expected Answer: The Pearl
Complex Tracking with Confounding (domain=SOCCER, anchor=NONE, length=5, max_depth=3, confounding_count=2)¶
Alice, Bob, Claire, Dave, and Eve are on the same team in a soccer match. At the start of the match, they are each assigned to a position: Alice is playing goalkeeper, Bob is playing striker, Claire is playing midfielder, Dave is playing defender, and Eve is playing fullback.
As the game progresses, pairs of players occasionally swap positions. First, Alice and Bob trade positions. Alice really likes Claire. Then, Dave and Eve trade positions. Bob and Claire work well together. Finally, Claire and Alice trade positions.
At the end of the match, what position is Dave playing?
State Evolution (ignoring confounding statements):
- Initial: Alice→goalkeeper, Bob→striker, Claire→midfielder, Dave→defender, Eve→fullback
- After swap 1: Alice→striker, Bob→goalkeeper, Claire→midfielder, Dave→defender, Eve→fullback
- After swap 2: Alice→striker, Bob→goalkeeper, Claire→midfielder, Dave→fullback, Eve→defender
- After swap 3: Alice→midfielder, Bob→goalkeeper, Claire→striker, Dave→fullback, Eve→defender
Expected Answer: fullback
Anchor Format Variation (domain=GIFTS, anchor=ROMAN, length=3, max_depth=2, confounding_count=0)¶
Alice, Bob, and Claire are holding a white elephant gift exchange. At the start of the event, they are each holding a present: Alice has gold box, Bob has silver vase, and Claire has wooden toy.
As the event progresses, pairs of people swap presents.
I. First, Alice and Bob swap presents
II. Then, Bob and Claire swap presents
At the end of the event, which present is Claire holding?
State Evolution: - Initial: Alice→gold box, Bob→silver vase, Claire→wooden toy - After swap I: Alice→silver vase, Bob→gold box, Claire→wooden toy - After swap II: Alice→silver vase, Bob→wooden toy, Claire→gold box
Expected Answer: gold box
Cognitive Skills Tested¶
- Working Memory: Maintaining multiple item-person mappings simultaneously
- Memory Churn: Dealing with multiple similar past working memory states
- Sequential Processing: Applying transformations in correct temporal order
- Selective Attention: Filtering relevant swaps from irrelevant social information
- State Tracking: Updating mental models through sequential changes
- Switching: Transitioning attention between different people and items
Applications¶
This test manifold evaluates capabilities essential for:
- Process Monitoring: Tracking system states through sequential operations
- Multi-Step Reasoning: Maintaining context through extended logical chains
- Attention Control: Focusing on task-relevant information while filtering distractors
- Procedural Following: Executing sequences of instructions with precision
- Robustness: Filtering unwanted information while performing a task