Object Counting Test Manifold¶

Overview¶

The Object Counting test manifold evaluates a model's ability to perform selective counting and categorization tasks, requiring precise attention to semantic categories while filtering irrelevant information. Models must identify items belonging to specific categories, extract their quantities from natural language expressions, and perform accurate arithmetic aggregation across multiple items.

Task Description¶

Models are presented with a collection of possessions described in natural language, including both target items (belonging to queried categories) and distractor items (from other categories or zero-quantity items). The task requires identifying which items belong to the target category, extracting their quantities, and computing the total count while ignoring irrelevant objects.

Key Features: - Category Recognition: Distinguishing target items from distractors across semantic boundaries - Quantity Extraction: Parsing natural language number expressions (words and numerals) - Selective Counting: Summing only items belonging to specified categories - Distractor Resistance: Ignoring items from non-target categories and zero-quantity items - Multi-Category Aggregation: Handling queries spanning multiple related categories - Format Robustness: Processing information with different organizational structures

Test Case Generation¶

Algorithm Overview¶

The generator creates challenging counting scenarios through a systematic process:

Category Selection: Choose 1-N target categories from 11 semantic domains
Target Item Generation: Sample items from target categories with random quantities (0 to max_count)
Distractor Injection: Add items from non-target categories to increase cognitive load
Natural Language Formatting: Convert quantities to word form with appropriate articles
Adjective Application: Optionally add descriptive adjectives based on probability
Anchor Formatting: Apply organizational markers for structure variation
Query Generation: Create category-specific or multi-category questions

Category Database¶

The system includes 11 semantic categories, each with comprehensive item vocabularies:

Category	Items Count	Example Items	Question Text
Musical Instruments	9	accordion, clarinet, flute, piano, violin, guitar	"musical instruments"
Fruits	10	apple, banana, grape, orange, strawberry, plum	"fruits"
Vegetables	10	cabbage, carrot, broccoli, lettuce, potato, onion	"vegetables"
Animals	15	bear, cat, dog, duck, frog, mouse, rabbit, snake	"animals"
Clothing	11	shirt, pants, dress, jacket, hat, shoe, tie, scarf	"pieces of clothing"
Tools	10	hammer, screwdriver, wrench, saw, drill, pliers	"tools"
Sports Equipment	9	bat, racket, helmet, glove, puck, paddle, goal	"pieces of sports equipment"
Books and Media	10	textbook, magazine, DVD, CD, comic book, journal	"books and media items"
Office Supplies	10	pen, pencil, stapler, paperclip, folder, calculator	"office supplies"
Toys	10	doll, puzzle, board game, toy car, yo-yo, kite	"toys"
Jewelry	9	ring, necklace, bracelet, earring, pendant, chain	"pieces of jewelry"

Natural Language Quantity System¶

The generator converts numerical quantities into natural language expressions with grammatically correct articles and pluralization:

Quantity Formatting Rules: - Zero: "zero vegetables" or "no vegetables" (random selection) - One: "a carrot" or "an apple" (a/an based on adjective vowel sounds) - Multiple: "two carrots", "three apples", "ten oranges" (number words 2-10, numerals >10)

Adjective Stream Integration: - Information Stream Control: prob_adjective controls the presence of extraneous descriptive information that is not relevant to the counting task - Stream Intensity Levels: - 0.0: Clean presentation - no adjective stream, minimal cognitive load - 1.0: Full adjective stream - every item has descriptive modifiers, maximum extraneous information - 0.5: Intermittent stream - sometimes present, sometimes absent (actually the hardest due to inconsistent patterns) - Adjective Pool: big, small, large, tiny, green, red, blue, yellow, old, new, shiny, rusty - Grammatical Consistency: Article adjustment for vowel sounds ("an old apple", "a big orange") - Cognitive Challenge: Higher probabilities increase working memory load by adding task-irrelevant descriptive details

Anchor Format System¶

Format Types (AnchorFormat Enum)¶

The generator supports 6 different organizational markers to test format-independent comprehension:

Format	Description	Example Pattern
NONE	No markers - plain sequential text	(no markers)
NUMERIC	Standard numbering	1, 2, 3, ...
ALPHA	Alphabetic letters (cycling)	A, B, C, ..., Z, A, B, ...
ROMAN	Roman numerals	I, II, III, IV, ...
REVERSE	Countdown from total	5, 4, 3, 2, 1
HEX	Hexadecimal notation	0x01, 0x02, 0x03, ...

Anchor Formatting Example¶

NUMERIC Format:

I have 
1. two apples,
2. a hammer,
3. three oranges,
and 4. a screwdriver.

How many fruits do I have?

ROMAN Format:

I have 
I. two apples,
II. a hammer,
III. three oranges,
and IV. a screwdriver.

How many fruits do I have?

HEX Format:

I have 
0x01. two apples,
0x02. a hammer,
0x03. three oranges,
and 0x04. a screwdriver.

How many fruits do I have?

Distractor System¶

Primary Distractors: Cross-Category Items¶

Items from non-target categories that should be ignored during counting: - Semantic Interference: Similar-looking or phonetically similar items from other categories - Contextual Plausibility: Items that might reasonably appear together in real scenarios - Quantity Variation: Distractors can have any quantity (0 to max_count) to increase complexity

Secondary Distractors: Zero-Quantity Target Items¶

Target category items with zero quantity that test precise attention:

"I have zero apples, two bananas, no oranges, and three grapes. How many fruits do I have?"

Expected Answer: 5 (ignoring zero apples and no oranges)

Strategic Placement¶

Distractors are randomly distributed throughout item lists to prevent positional biases and maintain cognitive load across the entire problem.

Configuration Parameters¶

Generation Schema (`ObjectCountingGenerationParams`)¶

class ObjectCountingGenerationParams(BaseModel):
    count: int                                   # Number of test cases to generate (> 0)
    length: int                                  # Number of target items to include (≥ 1)
    max_count: int                              # Maximum count per item (≥ 0)
    distractor_count: int                       # Number of distractor items (≥ 0)
    target_groups: int                          # Number of target categories (≥ 1)
    prob_adjective: float                       # Probability of adding adjective stream - extraneous descriptive information (0.0-1.0)
    anchor: AnchorFormat                        # Anchor format type (enum)
    anchor_prefix: str                          # Prefix for anchor markers (default: "\n")
    anchor_suffix: str                          # Suffix for anchor markers (default: ". ")

Result Schema (`ObjectCountingTestCaseResult`)¶

class ObjectCountingTestCaseResult(BaseModel):
    input: str                                  # Formatted problem text
    target: str                                 # Correct answer (total count)
    target_categories: List[str]               # Categories that were counted
    target_count: int                          # Total count of target items
    distractor_count: int                      # Number of distractor items included

Example Test Cases¶

Single Category Counting (category=fruits, length=4, max_count=3, distractor_count=2)¶

I have a red apple, two hammers, three bananas, a screwdriver, an orange, and two grapes.

How many fruits do I have?

Target Items Analysis: - red apple (1) ✓ - bananas (3) ✓
- orange (1) ✓ - grapes (2) ✓

Distractors: hammers, screwdriver (tools category)

Expected Answer: 7

Multi-Category Counting (categories=[fruits, vegetables], length=6, max_count=5, distractor_count=3)¶

I have two apples, a hammer, three carrots, zero oranges, a lettuce head, a wrench, an onion, and four bananas.

How many fruits and vegetables do I have?

Target Items Analysis: - apples (2) ✓ [fruits] - carrots (3) ✓ [vegetables] - oranges (0) ✗ [zero quantity] - lettuce head (1) ✓ [vegetables]
- onion (1) ✓ [vegetables] - bananas (4) ✓ [fruits]

Distractors: hammer, wrench (tools category)

Expected Answer: 11

Zero-Quantity Distractor Challenge (category=animals, length=5, max_count=4, distractor_count=2)¶

I have no cats, two dogs, a piano, three rabbits, zero mice, a violin, and four frogs.

How many animals do I have?

Target Items Analysis: - cats (0) ✗ [zero quantity] - dogs (2) ✓ - rabbits (3) ✓ - mice (0) ✗ [zero quantity]
- frogs (4) ✓

Distractors: piano, violin (musical instruments category)

Expected Answer: 9

Extraneous Information Stream Challenge (category=clothing, length=4, prob_adjective=0.5, distractor_count=3)¶

I have a big shirt, two pants, an old hat, three jackets, a rusty hammer, two tiny screws, and a wrench.

How many pieces of clothing do I have?

Information Stream Analysis: - Relevant Information: shirt (1), pants (2), hat (1), jackets (3) = 7 total - Extraneous Descriptive Stream: "big", "old", "rusty", "tiny" - Cognitive Load: High - half the items carry task-irrelevant descriptive information while half don't

Target Items Analysis: - big red shirt (1) ✓ - small blue pants (2) ✓ - old shiny hat (1) ✓ - new yellow jackets (3) ✓

Distractors: rusty hammer, tiny screws, large wrench (tools category)