Skip to content

Shuffle Tracking Test Manifold

Overview

The Shuffle Tracking test manifold evaluates a model's ability to maintain state through sequential transformations, requiring precise working memory and logical tracking capabilities. Models must track multiple entities through a series of swaps while filtering out irrelevant confounding information, testing core executive function and attention control.

Task Description

Models are presented with an initial assignment of items to people, followed by a sequence of pairwise swaps between individuals. After all transformations are complete, the model must identify which item a specific person currently possesses. The task requires maintaining an accurate mental model while ignoring deliberately inserted distracting statements.

Key Features: - Sequential State Tracking: Maintaining accurate item-person mappings through multiple swaps - Confounding Resistance: Filtering out irrelevant interpersonal information that doesn't affect item ownership - Multi-Domain Generalization: Consistent performance across different contextual scenarios - Anchor Format Variation: Processing information with different organizational structures - Working Memory Load: Handling increasing complexity through longer sequences and more entities

Test Case Generation

Algorithm Overview

The generator creates challenging state-tracking scenarios through a systematic process:

  1. Domain Selection: Choose from 5 thematic contexts (dancing, books, soccer, gifts, balls)
  2. Entity Generation: Sample people names and domain-appropriate items with optional adjectives
  3. Initial State: Create one-to-one mapping between people and items
  4. Swap Sequence: Generate random pairwise swaps up to specified depth
  5. Confounding Injection: Insert irrelevant statements about interpersonal relationships
  6. Anchor Formatting: Apply organizational markers (numeric, alphabetic, Roman, etc.)
  7. Query Generation: Select random person for final state inquiry

Domain Database

The system includes 5 thematic domains, each with specialized vocabulary and context:

Domain Context Items Adjectives Action Verb
Dancing Square dance partner switching 18 partner names (Patrick, Jamie, Lola, Melissa, etc.) energetic, graceful, skilled, experienced, enthusiastic, talented "switch partners"
Books Literary trading among friends 18 classic titles (Catch-22, Frankenstein, The Great Gatsby, etc.) thick, thin, worn, new, heavy, light, hardcover, paperback "swap books"
Soccer Position changes during a match 18 field positions (goalkeeper, striker, midfielder, benchwarmer, etc.) starting, backup, primary, secondary, key, veteran "trade positions"
Gifts White elephant gift exchange 12 present types (ball, box, vase, toy, sculpture, etc.) orange, pink, black, gold, green, brown, silver, crystal, wooden, metal "swap presents"
Balls Game with colored ball trading 18 colors (red, black, blue, yellow, purple, etc.) round, bouncy, smooth, textured, inflated, heavy, lightweight, shiny "swap balls"

Anchor Format System

Format Types (AnchorFormat Enum)

The generator supports 9 different organizational markers to test format-independent comprehension:

Format Description Example Pattern
NONE No markers - plain sequential text (no markers)
NUMERIC Standard numbering 1, 2, 3, ...
ASCII Alphabetic letters A, B, C, ..., Z, [, \, ...
ALPHA Cycling alphabet (wraps after Z) A, B, C, ..., Z, A, B, ...
ROMAN Roman numerals I, II, III, IV, ...
SKIP_2 Even numbers 2, 4, 6, 8, ...
REVERSE Countdown from total 5, 4, 3, 2, 1
HEX Hexadecimal notation 0x01, 0x02, 0x03, ...
ELEMENTS Periodic table symbols H, He, Li, Be, ...

Anchor Formatting Example

NUMERIC Format:

1. First, Alice and Bob swap books
2. Then, Claire and Dave swap books  
3. Finally, Eve and Frank swap books

ROMAN Format:

I. First, Alice and Bob swap books
II. Then, Claire and Dave swap books
III. Finally, Eve and Frank swap books

ELEMENTS Format:

H. First, Alice and Bob swap books
He. Then, Claire and Dave swap books
Li. Finally, Eve and Frank swap books

Confounding Statement System

Purpose and Design

Confounding statements test the model's ability to maintain task focus by inserting irrelevant interpersonal information that doesn't affect item ownership. These statements are strategically placed throughout the swap sequence to increase cognitive load.

Template Categories

Relationship Statements

"{person1} really likes {person2}"
"{person1} and {person2} don't get along great"  
"{person1} is friends with {person2}"
"{person1} has known {person2} for years"

Professional Dynamics

"{person1} and {person2} work well together"
"{person1} and {person2} are colleagues"
"{person1} trusts {person2}"
"{person1} and {person2} communicate effectively"

Personal Qualities

"{person1} thinks {person2} is funny"
"{person1} respects {person2}"
"{person1} admires {person2}"
"{person1} supports {person2}"

Insertion Strategy

Confounding statements are randomly inserted between swap descriptions, creating natural-sounding but irrelevant narrative flow that challenges selective attention.

Configuration Parameters

Generation Schema (ShuffleGenerationParams)

class ShuffleGenerationParams(BaseModel):
    count: int                                   # Number of test cases to generate (> 0)
    length: int                                  # Number of people/entities (≥ 3)
    max_depth: int                              # Number of swaps to perform (≥ 1)
    confounding_count: int                      # Number of confounding statements (≥ 0)
    anchor: AnchorFormat                        # Anchor format type (enum)
    adjective_prob: float                       # Probability of adding adjectives (0.0-1.0)
    anchor_prefix: str                          # Prefix for anchor markers (default: "\n")
    anchor_suffix: str                          # Suffix for anchor markers (default: ".")

Standard Grid Configuration: - length: [4, 5, 6] - Number of participants - max_depth: [2, 3, 4] - Number of sequential swaps - confounding_count: [0, 1, 2] - Irrelevant statements inserted - Generates 27 different complexity combinations

Result Schema (ShuffleTestCaseResult)

class ShuffleTestCaseResult(BaseModel):
    input: str                                  # Formatted problem text
    target: str                                 # Correct answer (item name)
    domain: str                                 # Domain name used
    people: List[str]                          # List of people names
    items: List[str]                           # List of items (with adjectives if applied)
    swaps: List[Tuple[str, str]]               # List of (person1, person2) swap pairs
    query_person: str                          # Person being queried about
    response_enum: List[str]                   # Set of all possible items for multiple choice
    confounding_indices: List[int]             # Indices where confounding statements were inserted

Example Test Cases

Basic Tracking (domain=BOOKS, anchor=NONE, length=4, max_depth=3, confounding_count=0)

Alice, Bob, Claire, and Dave are friends and avid readers who occasionally trade books. At the start of the semester, they each buy one new book: Alice gets Catch-22, Bob gets Frankenstein, Claire gets The Pearl, and Dave gets Moby Dick.

As the semester proceeds, they start trading around the new books. First, Alice and Claire swap books. Then, Bob and Dave swap books. Finally, Claire and Bob swap books.

At the end of the semester, which book does Alice have?

State Evolution: - Initial: Alice→Catch-22, Bob→Frankenstein, Claire→The Pearl, Dave→Moby Dick - After swap 1: Alice→The Pearl, Bob→Frankenstein, Claire→Catch-22, Dave→Moby Dick
- After swap 2: Alice→The Pearl, Bob→Moby Dick, Claire→Catch-22, Dave→Frankenstein - After swap 3: Alice→The Pearl, Bob→Catch-22, Claire→Moby Dick, Dave→Frankenstein

Expected Answer: The Pearl

Complex Tracking with Confounding (domain=SOCCER, anchor=NONE, length=5, max_depth=3, confounding_count=2)

Alice, Bob, Claire, Dave, and Eve are on the same team in a soccer match. At the start of the match, they are each assigned to a position: Alice is playing goalkeeper, Bob is playing striker, Claire is playing midfielder, Dave is playing defender, and Eve is playing fullback.

As the game progresses, pairs of players occasionally swap positions. First, Alice and Bob trade positions. Alice really likes Claire. Then, Dave and Eve trade positions. Bob and Claire work well together. Finally, Claire and Alice trade positions.

At the end of the match, what position is Dave playing?

State Evolution (ignoring confounding statements): - Initial: Alice→goalkeeper, Bob→striker, Claire→midfielder, Dave→defender, Eve→fullback - After swap 1: Alice→striker, Bob→goalkeeper, Claire→midfielder, Dave→defender, Eve→fullback - After swap 2: Alice→striker, Bob→goalkeeper, Claire→midfielder, Dave→fullback, Eve→defender
- After swap 3: Alice→midfielder, Bob→goalkeeper, Claire→striker, Dave→fullback, Eve→defender

Expected Answer: fullback

Anchor Format Variation (domain=GIFTS, anchor=ROMAN, length=3, max_depth=2, confounding_count=0)

Alice, Bob, and Claire are holding a white elephant gift exchange. At the start of the event, they are each holding a present: Alice has gold box, Bob has silver vase, and Claire has wooden toy.

As the event progresses, pairs of people swap presents.
I. First, Alice and Bob swap presents
II. Then, Bob and Claire swap presents

At the end of the event, which present is Claire holding?

State Evolution: - Initial: Alice→gold box, Bob→silver vase, Claire→wooden toy - After swap I: Alice→silver vase, Bob→gold box, Claire→wooden toy - After swap II: Alice→silver vase, Bob→wooden toy, Claire→gold box

Expected Answer: gold box

Cognitive Skills Tested

  • Working Memory: Maintaining multiple item-person mappings simultaneously
  • Memory Churn: Dealing with multiple similar past working memory states
  • Sequential Processing: Applying transformations in correct temporal order
  • Selective Attention: Filtering relevant swaps from irrelevant social information
  • State Tracking: Updating mental models through sequential changes
  • Switching: Transitioning attention between different people and items

Applications

This test manifold evaluates capabilities essential for:

  • Process Monitoring: Tracking system states through sequential operations
  • Multi-Step Reasoning: Maintaining context through extended logical chains
  • Attention Control: Focusing on task-relevant information while filtering distractors
  • Procedural Following: Executing sequences of instructions with precision
  • Robustness: Filtering unwanted information while performing a task