Skip to content

Object Counting Test Manifold

Overview

The Object Counting test manifold evaluates a model's ability to perform selective counting and categorization tasks, requiring precise attention to semantic categories while filtering irrelevant information. Models must identify items belonging to specific categories, extract their quantities from natural language expressions, and perform accurate arithmetic aggregation across multiple items.

Task Description

Models are presented with a collection of possessions described in natural language, including both target items (belonging to queried categories) and distractor items (from other categories or zero-quantity items). The task requires identifying which items belong to the target category, extracting their quantities, and computing the total count while ignoring irrelevant objects.

Key Features: - Category Recognition: Distinguishing target items from distractors across semantic boundaries - Quantity Extraction: Parsing natural language number expressions (words and numerals) - Selective Counting: Summing only items belonging to specified categories - Distractor Resistance: Ignoring items from non-target categories and zero-quantity items - Multi-Category Aggregation: Handling queries spanning multiple related categories - Format Robustness: Processing information with different organizational structures

Test Case Generation

Algorithm Overview

The generator creates challenging counting scenarios through a systematic process:

  1. Category Selection: Choose 1-N target categories from 11 semantic domains
  2. Target Item Generation: Sample items from target categories with random quantities (0 to max_count)
  3. Distractor Injection: Add items from non-target categories to increase cognitive load
  4. Natural Language Formatting: Convert quantities to word form with appropriate articles
  5. Adjective Application: Optionally add descriptive adjectives based on probability
  6. Anchor Formatting: Apply organizational markers for structure variation
  7. Query Generation: Create category-specific or multi-category questions

Category Database

The system includes 11 semantic categories, each with comprehensive item vocabularies:

Category Items Count Example Items Question Text
Musical Instruments 9 accordion, clarinet, flute, piano, violin, guitar "musical instruments"
Fruits 10 apple, banana, grape, orange, strawberry, plum "fruits"
Vegetables 10 cabbage, carrot, broccoli, lettuce, potato, onion "vegetables"
Animals 15 bear, cat, dog, duck, frog, mouse, rabbit, snake "animals"
Clothing 11 shirt, pants, dress, jacket, hat, shoe, tie, scarf "pieces of clothing"
Tools 10 hammer, screwdriver, wrench, saw, drill, pliers "tools"
Sports Equipment 9 bat, racket, helmet, glove, puck, paddle, goal "pieces of sports equipment"
Books and Media 10 textbook, magazine, DVD, CD, comic book, journal "books and media items"
Office Supplies 10 pen, pencil, stapler, paperclip, folder, calculator "office supplies"
Toys 10 doll, puzzle, board game, toy car, yo-yo, kite "toys"
Jewelry 9 ring, necklace, bracelet, earring, pendant, chain "pieces of jewelry"

Natural Language Quantity System

The generator converts numerical quantities into natural language expressions with grammatically correct articles and pluralization:

Quantity Formatting Rules: - Zero: "zero vegetables" or "no vegetables" (random selection) - One: "a carrot" or "an apple" (a/an based on adjective vowel sounds) - Multiple: "two carrots", "three apples", "ten oranges" (number words 2-10, numerals >10)

Adjective Stream Integration: - Information Stream Control: prob_adjective controls the presence of extraneous descriptive information that is not relevant to the counting task - Stream Intensity Levels: - 0.0: Clean presentation - no adjective stream, minimal cognitive load - 1.0: Full adjective stream - every item has descriptive modifiers, maximum extraneous information - 0.5: Intermittent stream - sometimes present, sometimes absent (actually the hardest due to inconsistent patterns) - Adjective Pool: big, small, large, tiny, green, red, blue, yellow, old, new, shiny, rusty - Grammatical Consistency: Article adjustment for vowel sounds ("an old apple", "a big orange") - Cognitive Challenge: Higher probabilities increase working memory load by adding task-irrelevant descriptive details

Anchor Format System

Format Types (AnchorFormat Enum)

The generator supports 6 different organizational markers to test format-independent comprehension:

Format Description Example Pattern
NONE No markers - plain sequential text (no markers)
NUMERIC Standard numbering 1, 2, 3, ...
ALPHA Alphabetic letters (cycling) A, B, C, ..., Z, A, B, ...
ROMAN Roman numerals I, II, III, IV, ...
REVERSE Countdown from total 5, 4, 3, 2, 1
HEX Hexadecimal notation 0x01, 0x02, 0x03, ...

Anchor Formatting Example

NUMERIC Format:

I have 
1. two apples,
2. a hammer,
3. three oranges,
and 4. a screwdriver.

How many fruits do I have?

ROMAN Format:

I have 
I. two apples,
II. a hammer,
III. three oranges,
and IV. a screwdriver.

How many fruits do I have?

HEX Format:

I have 
0x01. two apples,
0x02. a hammer,
0x03. three oranges,
and 0x04. a screwdriver.

How many fruits do I have?

Distractor System

Primary Distractors: Cross-Category Items

Items from non-target categories that should be ignored during counting: - Semantic Interference: Similar-looking or phonetically similar items from other categories - Contextual Plausibility: Items that might reasonably appear together in real scenarios - Quantity Variation: Distractors can have any quantity (0 to max_count) to increase complexity

Secondary Distractors: Zero-Quantity Target Items

Target category items with zero quantity that test precise attention:

"I have zero apples, two bananas, no oranges, and three grapes. How many fruits do I have?"
Expected Answer: 5 (ignoring zero apples and no oranges)

Strategic Placement

Distractors are randomly distributed throughout item lists to prevent positional biases and maintain cognitive load across the entire problem.

Configuration Parameters

Generation Schema (ObjectCountingGenerationParams)

class ObjectCountingGenerationParams(BaseModel):
    count: int                                   # Number of test cases to generate (> 0)
    length: int                                  # Number of target items to include (≥ 1)
    max_count: int                              # Maximum count per item (≥ 0)
    distractor_count: int                       # Number of distractor items (≥ 0)
    target_groups: int                          # Number of target categories (≥ 1)
    prob_adjective: float                       # Probability of adding adjective stream - extraneous descriptive information (0.0-1.0)
    anchor: AnchorFormat                        # Anchor format type (enum)
    anchor_prefix: str                          # Prefix for anchor markers (default: "\n")
    anchor_suffix: str                          # Suffix for anchor markers (default: ". ")

Result Schema (ObjectCountingTestCaseResult)

class ObjectCountingTestCaseResult(BaseModel):
    input: str                                  # Formatted problem text
    target: str                                 # Correct answer (total count)
    target_categories: List[str]               # Categories that were counted
    target_count: int                          # Total count of target items
    distractor_count: int                      # Number of distractor items included

Example Test Cases

Single Category Counting (category=fruits, length=4, max_count=3, distractor_count=2)

I have a red apple, two hammers, three bananas, a screwdriver, an orange, and two grapes.

How many fruits do I have?

Target Items Analysis: - red apple (1) ✓ - bananas (3) ✓
- orange (1) ✓ - grapes (2) ✓

Distractors: hammers, screwdriver (tools category)

Expected Answer: 7

Multi-Category Counting (categories=[fruits, vegetables], length=6, max_count=5, distractor_count=3)

I have two apples, a hammer, three carrots, zero oranges, a lettuce head, a wrench, an onion, and four bananas.

How many fruits and vegetables do I have?

Target Items Analysis: - apples (2) ✓ [fruits] - carrots (3) ✓ [vegetables] - oranges (0) ✗ [zero quantity] - lettuce head (1) ✓ [vegetables]
- onion (1) ✓ [vegetables] - bananas (4) ✓ [fruits]

Distractors: hammer, wrench (tools category)

Expected Answer: 11

Zero-Quantity Distractor Challenge (category=animals, length=5, max_count=4, distractor_count=2)

I have no cats, two dogs, a piano, three rabbits, zero mice, a violin, and four frogs.

How many animals do I have?

Target Items Analysis: - cats (0) ✗ [zero quantity] - dogs (2) ✓ - rabbits (3) ✓ - mice (0) ✗ [zero quantity]
- frogs (4) ✓

Distractors: piano, violin (musical instruments category)

Expected Answer: 9

Extraneous Information Stream Challenge (category=clothing, length=4, prob_adjective=0.5, distractor_count=3)

I have a big shirt, two pants, an old hat, three jackets, a rusty hammer, two tiny screws, and a wrench.

How many pieces of clothing do I have?

Information Stream Analysis: - Relevant Information: shirt (1), pants (2), hat (1), jackets (3) = 7 total - Extraneous Descriptive Stream: "big", "old", "rusty", "tiny" - Cognitive Load: High - half the items carry task-irrelevant descriptive information while half don't

Target Items Analysis: - big red shirt (1) ✓ - small blue pants (2) ✓ - old shiny hat (1) ✓ - new yellow jackets (3) ✓

Distractors: rusty hammer, tiny screws, large wrench (tools category)

Expected Answer: 7

Anchor Format with Multi-Category (anchor=ROMAN, categories=[toys, books], length=6, distractor_count=4)

I have 
I. two dolls,
II. a hammer,
III. three textbooks,
IV. a wrench,  
V. four puzzles,
VI. a screwdriver,
VII. two comic books,
VIII. a drill,
IX. a board game, and
X. a saw.

How many toys and books and media items do I have?

Target Items Analysis: - dolls (2) ✓ [toys] - textbooks (3) ✓ [books and media] - puzzles (4) ✓ [toys]
- comic books (2) ✓ [books and media] - board game (1) ✓ [toys]

Distractors: hammer, wrench, screwdriver, drill, saw (tools category)

Expected Answer: 12

Cognitive Skills Tested

  • Selective Attention: Focusing on target categories while ignoring distractors
  • Information Filtering: Separating task-relevant data from extraneous descriptive streams
  • Semantic Categorization: Accurate classification of items into conceptual groups
  • Quantity Parsing: Extracting numerical information from natural language
  • Arithmetic Aggregation: Accurate summation across multiple quantities
  • Working Memory: Maintaining running totals while processing item lists and filtering irrelevant adjective streams
  • Language Processing: Handling grammatical variations and adjective modifiers
  • Cognitive Load Management: Processing information efficiently despite varying levels of extraneous descriptive content

Applications

This test manifold evaluates capabilities essential for:

  • Data Analysis: Filtering and aggregating information by categories
  • Inventory Management: Counting items within specific classifications
  • Reading Comprehension: Extracting quantitative information from text
  • Attention to Detail: Precise identification of relevant vs. irrelevant information
  • Mathematical Reasoning: Combining language understanding with arithmetic operations
  • Information Filtering: Focusing on task-relevant data while ignoring distractors