Table Reasoning Test Manifold¶

Tables is part of the r12 evaluation set, replacing Movies from the prior m12x configuration.

Overview¶

The Table Reasoning test manifold evaluates a model's ability to parse structured tabular data in multiple formats and perform operations on tables, requiring systematic data extraction, format interpretation, filtering, and aggregation capabilities. Models must handle diverse table representations while accurately applying filtering conditions, counting operations, and statistical computations over structured data.

Task Description¶

Models are presented with structured data organized in a table format (CSV, Markdown, Fixed-Width, or JSON) containing categorical and numeric columns, followed by a natural language question requiring computation over the table data. The task requires parsing table syntax, extracting relevant rows via filtering, selecting appropriate columns, and computing answers through count, aggregation (sum/mode/median), extremum (min/max), or positional (first/last) operations.

Key Features: - Format Flexibility: Parsing tables across multiple standard formats with different syntactic structures - Categorical Filtering: Understanding facet-based filtering with equality and inequality conditions - Numeric Filtering: Applying comparison operators (>, <, >=, <=) and range constraints - Aggregation Operations: Computing sum, mode, median, min, max across filtered datasets - Row Identification: Handling both numeric IDs and string labels for row referencing - Structured Parsing: Extracting values from tabular layouts with varying spacing and delimiters - Compositional Reasoning: Combining multiple operations (filter then aggregate) into coherent answers

Cognitive Architecture Challenge: Format Invariance and Systematic Aggregation¶

The Fundamental Recognition Problem¶

This task creates a format invariance challenge that tests whether models can extract identical semantic information from fundamentally different syntactic representations. The same table can be presented as:

CSV:
id,label,facet,age,weight_kg
1,Alice the Lion,mammal,5,180

Markdown:
| id | label | facet | age | weight_kg |
|----|-------|-------|-----|-----------|
| 1 | Alice the Lion | mammal | 5 | 180 |

Fixed-Width:
id  label            facet   age  weight_kg
--  -----            -----   ---  ---------
1   Alice the Lion   mammal  5    180

JSON:
[{"id": 1, "label": "Alice the Lion", "facet": "mammal", "age": 5, "weight_kg": 180}]

Parsing Challenge: Models must normalize these syntactically distinct representations into a unified internal data model before performing operations.

Why Format Variation Matters¶

Large Language Models often struggle with format invariance because:

Training Bias: Most training data emphasizes specific table formats (CSV dominates ML contexts)
Tokenization Fragility: Different delimiters (, vs | vs whitespace) tokenize differently, changing how models perceive row/column boundaries
Positional Sensitivity: Fixed-width formats exploit spatial reasoning, while CSV relies on delimiter counting
Syntax Dominance: Models may focus on format-specific syntax rather than semantic content

Smaller Models sometimes perform better because they:

Process Mechanically: Apply simpler parsing without format bias
Avoid Over-Specialization: Don't exploit format-specific shortcuts that fail on variants
Distribute Attention Uniformly: Don't preferentially attend to delimiters or column separators

Operation Complexity as Mental Model Test¶

Beyond format parsing, the task probes whether models implement correct aggregation semantics:

Incorrect Mental Model: Treating aggregation operations independently

"Sum the values" → search for sum-like operation
"Find the mode" → separate logic for finding most common value
"Get minimum" → independent min lookup

Correct Mental Model: Understanding that all operations receive the same filtered dataset

1. Parse table structure
2. Apply filter constraints consistently
3. Extract target column values from filtered rows
4. Apply operation (sum/mode/median/min/max/first/last) to those values

Semantic Consistency: All operations (COUNT, SUM, MIN, MAX, FIRST, LAST) operate uniformly on filtered data. The FIRST_OR_LAST operation extracts values from the first or last row within the filtered result set, maintaining consistency with other set-reduction operations. Row-specific lookups are now modeled as a filter type (SET with cardinality=1) rather than as a special operation behavior.

Models using the correct model produce consistent results across operations, while those using incorrect models show operation-specific degradation in performance.

Randomized Operation Semantics¶

The SUM_MODE_MEDIAN operation (lines 413-457 in tables.py) introduces strategic randomization to prevent shortcut learning:

All three operations (sum/mode/median) are always valid through data skewing
Mode skewing: Replaces random values with ordinal values ensuring one appears most frequently (no ties)
Median skewing: For even-count datasets, adjusts values if needed to ensure integer median
Random selection: Model cannot assume a single operation; must compute all three and let randomization choose

This prevents models from memorizing operation-specific heuristics and forces them to actually compute the values.

Implementation Details: - Mode skewing happens for all SUM_MODE_MEDIAN operations before variant selection to ensure mode is always well-defined - Median skewing happens only when median is selected and the dataset has an even count with an odd sum of middle values - Skewing modifies synthetic data values in-place, which is acceptable since all values are randomly generated

Inequality Filtering as Negation Reasoning¶

The FACET filter (lines 267-285) uses equality/inequality with 50% probability:

if use_equality:
    # "How many mammal animals are in the table?"
    description = f"{target_facet} animals"
    filter_func = lambda row: row['facet'] == target_facet
else:
    # "How many animals that are not mammal are in the table?"
    description = f"animals that are not {target_facet}"
    filter_func = lambda row: row['facet'] != target_facet

Inequality filtering tests whether models correctly implement negation logic without simply inverting the count or making errors in logical reasoning.

Test Case Generation¶

Algorithm Overview¶

The generator creates table reasoning scenarios through a systematic process:

Table Generation: Create table with ID column, string label column, facet (categorical) column, and N numeric metric columns
Data Initialization: Generate realistic animal names (cross-product of first names and animal types) with random metric values
Filter Creation: Generate filter conditions (facet equality/inequality, numeric comparison, or numeric range)
Data Filtering: Apply filter to subset rows, with fallback to unfiltered if no matches
Operation Selection: Choose target column and execute operation (count, sum/mode/median, min/max, first/last)
Question Generation: Create natural language question reflecting the operation and filter
Format Selection: Render table in one of four formats
Result Compilation: Return complete problem with answer and metadata

Table Data Structure¶

Each table contains the following columns:

Column	Type	Description	Example
id	Integer (1-indexed)	Numeric row identifier	1, 2, 3, ...
label	String	Unique animal name generated from name cross-product	"Alice the Lion", "Bob the Tiger"
facet	String	Categorical dimension from selected scheme	"mammal", "bird", "terrestrial"
metric_1 through metric_N	Integer	Numeric columns for aggregation operations	age=15, weight_kg=250, height_cm=180

The system includes three facet schemes for categorical filtering:

Scheme	Values	Description	Type
taxonomy	mammal, bird, reptile, fish, amphibian	Biological classification	Categorical
habitat	terrestrial, aquatic, aerial, amphibious	Primary habitat type	Categorical
size	small, medium, large, giant	Relative size category	Categorical

One scheme is randomly selected per test case and assigned to each row in the table.

Metric Definitions¶

The system includes 8 metric types with domain-realistic value ranges:

Metric	Unit	Range	Description
age	years	0-100	Current age of animal
weight_kg	kilograms	1-10000	Body weight
height_cm	centimeters	10-500	Height at shoulder
lifespan_years	years	1-100	Maximum expected lifespan
speed_kmh	kilometers/hour	1-120	Maximum movement speed
offspring_count	count	0-20	Number of offspring per breeding cycle
territory_km2	square kilometers	1-1000	Territory size
gestation_days	days	10-700	Pregnancy duration

N metrics are randomly selected per table (N = num_columns parameter).

Table Formatting¶

CSV Format¶

Standard comma-separated values with header row:

id,label,facet,age,weight_kg
1,Alice the Lion,mammal,35,180
2,Bob the Tiger,mammal,12,160
3,Charlie the Eagle,bird,8,4

Markdown Format¶

GitHub-style markdown table with pipe delimiters:

| id | label | facet | age | weight_kg |
|---|---|---|---|---|
| 1 | Alice the Lion | mammal | 35 | 180 |
| 2 | Bob the Tiger | mammal | 12 | 160 |
| 3 | Charlie the Eagle | bird | 8 | 4 |

Fixed-Width Format¶

Aligned columns with whitespace padding:

id  label             facet   age  weight_kg
--  -----             -----   ---  ---------
1   Alice the Lion    mammal  35   180
2   Bob the Tiger     mammal  12   160
3   Charlie the Eagle bird    8    4

JSON Format¶

Array of objects with explicit key-value pairs:

[
  {"id": 1, "label": "Alice the Lion", "facet": "mammal", "age": 35, "weight_kg": 180},
  {"id": 2, "label": "Bob the Tiger", "facet": "mammal", "age": 12, "weight_kg": 160},
  {"id": 3, "label": "Charlie the Eagle", "facet": "bird", "age": 8, "weight_kg": 4}
]

Operation Type System¶

The system implements 6 operation types. Operations 1–4 are used in r12. Operations 5–6 are used in tables-16k.

COUNT¶

Counts rows matching filter condition.

Variants: - Matching (50%): "How many animals are in the table?" or "How many mammal animals are in the table?" - Not Matching (50% when filter exists): "How many animals are NOT mammal?" (only when filter present)

Example: Given 15 total rows with 8 mammal rows and 7 non-mammal rows: - "How many mammal animals are in the table?" → Answer: 8 - "How many animals are NOT mammal?" → Answer: 7

SUM_MODE_MEDIAN¶

Aggregates numeric column values, randomly choosing between sum, mode, or median.

Semantics: - Sum: Total of all values in target column for filtered rows - Mode: Most frequently occurring value in target column for filtered rows - Median: Middle value when sorted (always returns integer through light-skewing)

Random Selection Logic: All three operations are always valid. The generator ensures unambiguous integer answers through strategic data skewing.

Data Skewing for Unambiguous Answers:

Since all table data is synthetically generated with random values, the generator applies light-skewing to guarantee clean integer answers:

Mode Skewing: Replaces random values in the target column with a smaller set of ordinal values, ensuring one value appears more frequently than all others. This guarantees an unambiguous mode (no ties).
Median Skewing: For even-count filtered datasets, if the sum of the two middle values is odd (which would produce a non-integer median), the generator increments the lower middle value by 1 to make the sum even. This ensures the median is always an integer.

Why Skewing Doesn't Matter: The skewed values are just as synthetic as the original random values. The goal is to create well-defined test cases where the correct answer is unambiguous, preventing spurious failures due to ambiguity in aggregation semantics rather than actual model errors.

Example: Given 5 mammal animals with weight_kg values [150, 165, 150, 180, 150]: - Sum variant: 150 + 165 + 150 + 180 + 150 = 795 - Mode variant: 150 (appears 3 times, guaranteed by skewing) - Median variant: 165 (middle of sorted [150, 150, 150, 165, 180])

Example with median skewing: Given 4 animals with age values [10, 15, 20, 25]: - Without skewing: median = (15 + 20) / 2 = 17.5 (non-integer, ambiguous) - After skewing: values become [10, 16, 20, 25], median = (16 + 20) / 2 = 18 (integer, unambiguous)

MIN_OR_MAX¶

Finds minimum or maximum value in numeric column for filtered rows.

Variants (50% each): - Minimum: "What is the minimum height for mammal animals?" → Answer: smallest value - Maximum: "What is the maximum height for mammal animals?" → Answer: largest value

Example: Given 8 mammal animals with age values [5, 8, 12, 15, 25, 30, 35, 40]: - Minimum variant: 5 - Maximum variant: 40

FIRST_OR_LAST (operation=4, legacy)¶

Gets value from first or last row in filtered dataset, randomly choosing between the two variants. Used in r12. Not used in tables-16k due to the asymmetric failure profile of first vs last under context pressure (see LAST and FIRST below).

Variants (50% each): - First: "What is the first age for mammal animals?" → Answer: age from first row after filtering for mammals - Last: "What is the last weight_kg for animals with age at least 10?" → Answer: weight_kg from last row after filtering

Semantic Consistency: This operation applies to filtered data, just like COUNT, SUM, MIN, and MAX. The first/last position is determined within the filtered result set, not the original table.

LAST (operation=5)¶

Gets value from the last row in filtered dataset. Used in tables-16k as the primary positional capacity probe.

Rationale: Under context pressure, the answer lives at the table suffix — either truncated away or in the lowest-attention zone. This makes last a clean probe of channel capacity: the decay curve tracks directly with how much of the table the model can effectively process. Empirically, all positional failures under capacity pressure are last, never first.

SET filter incompatible: Semantically ambiguous when rows are named individuals.

FIRST (operation=6)¶

Gets value from the first row in filtered dataset. Capability gate, not a capacity probe.

Rationale: The answer lives at the table prefix — always in the high-attention primacy zone regardless of total context length. Models that fail first have a broken attention mechanism, not a capacity limitation. Accuracy is near-ceiling even at maximum context pressure. Included for completeness and as an explicit control condition; not used in standard tables-16k evaluation runs.

Filter Type System¶

The system implements 5 filter types:

NONE¶

No filtering applied.

Behavior: All rows included in operation. COUNT returns total row count.

Question Example: "How many animals are in the table?"

Categorical filtering on the facet column (chosen from active facet scheme).

Variants (50% each): - Equality (facet == value): "How many mammal animals are in the table?" - Inequality (facet != value): "How many animals that are not mammal are in the table?"

Selection Logic: 1. Random facet value selected from values present in current table data 2. 50% chance of equality (==) or inequality (!=) 3. Filter applied to create subset 4. Operation executed on filtered rows

Example: Facet scheme = taxonomy, target_facet = mammal: - Equality: Include only rows where facet == "mammal" - Inequality: Include only rows where facet != "mammal"

NUMERIC_COMPARISON¶

Numeric comparison with operators >, <, >=, <= on selected metric column.

Operators: - > (greater than): "animals with age greater than 15" - < (less than): "animals with age less than 15" - >= (at least): "animals with age at least 15" - <= (at most): "animals with age at most 15"

Selection Logic: 1. Random metric column selected from available metrics 2. Random comparison operator chosen from [>, <, >=, <=] 3. Threshold value selected from existing data values (ensures some matches) 4. Filter applied: rows where column [operator] threshold

Example: metric=weight_kg, operator=>=, threshold=200: - Include rows where weight_kg >= 200

Question Example: "How many animals with weight_kg at least 200 are in the table?"

NUMERIC_RANGE¶

Inclusive range filter on numeric column: min_val <= column <= max_val

Selection Logic: 1. Random metric column selected 2. All values sorted 3. Min bound chosen from lower half of sorted values 4. Max bound chosen from upper half of sorted values 5. Values swapped if min > max 6. Filter applied: rows where min_val <= column <= max_val

Example: metric=age, min_val=10, max_val=35: - Include rows where 10 <= age <= 35

Question Example: "How many animals with age between 10 and 35 are in the table?"

SET¶

Row-based filtering by specific IDs or labels (cardinality: 1 to max_set_size).

Selection Logic: 1. Random number of rows selected (1 to min(max_set_size, num_rows)) 2. 50% chance of using IDs or labels for identification 3. Filter applied: rows where id/label in selected set 4. Operation executed on filtered rows

English Formatting: - Cardinality 1 (ID): "the animal with ID 5" - Cardinality 1 (Label): "Alice the Lion" - Cardinality 2 (IDs): "animals with IDs 3 and 7" - Cardinality 2 (Labels): "Alice the Lion and Bob the Tiger" - Cardinality 3+ (IDs): "animals with IDs 1, 2, and 3" (Oxford comma) - Cardinality 3+ (Labels): "Alice the Lion, Bob the Tiger, and Charlie the Bear" (Oxford comma)

Configuration: The max_set_size parameter (default: 5) controls maximum cardinality: - Set max_set_size=1 to force single-row lookups - Set max_set_size=10 to allow larger sets

Example with cardinality=1: "What is the total weight_kg for Alice the Lion?" - Include only row where label == "Alice the Lion"

Example with cardinality=3: "How many animals with IDs 2, 5, and 8 are in the table?" - Include rows where id in {2, 5, 8}

Configuration Parameters¶

Generation Schema (`TablesGenerationParams`)¶

class TablesGenerationParams(BaseModel):
    count: int = Field(gt=0, description="Number of test cases to generate")
    num_rows: int = Field(ge=3, le=30, description="Number of rows in table")
    num_columns: int = Field(ge=2, le=8, description="Number of numeric data columns")
    format: TableFormat = Field(description="Table format type")
    operation: Operation = Field(description="Operation to perform")
    filter_type: FilterType = Field(default=FilterType.NONE, description="Type of filter to apply")
    domain_name: Optional[str] = Field(default=None, description="Domain name (default: animals)")
    max_set_size: int = Field(default=5, ge=1, description="Maximum number of rows in SET filter")

Parameter Ranges: - count: [1, 100] - Number of test cases per batch - num_rows: [3, 60] - Table scale (affects parsing and counting complexity) - num_columns: [2, 8] - Number of numeric columns (affects aggregation cost) - format: [CSV, MARKDOWN, FIXED_WIDTH, JSON] - Table format variant (enum values 1-4) - operation: [COUNT, SUM_MODE_MEDIAN, MIN_OR_MAX, FIRST_OR_LAST, LAST, FIRST] - Aggregation operation (enum values 1-6); r12 uses 1-4, tables-16k uses 1-3, 5 - filter_type: [NONE, FACET, NUMERIC_COMPARISON, NUMERIC_RANGE, SET] - Filtering complexity (enum values 1-5) - domain_name: ["animals"] - Currently only animals domain supported - max_set_size: [1, num_rows] - Maximum cardinality for SET filter (default: 5, set to 1 for single-row lookups)

Standard Complexity Progression: - num_rows: [5, 10, 30, 60] - Parsing/counting load (as used in configs/tables.yaml) - num_columns: [8] - Fixed at 8 columns in standard config - format: [1=CSV, 2=MARKDOWN, 3=FIXED_WIDTH, 4=JSON] - Syntactic variation - operation: [1=COUNT, 2=SUM_MODE_MEDIAN, 3=MIN_OR_MAX, 4=FIRST_OR_LAST] - Semantic complexity - filter_type: [1=NONE, 2=FACET, 3=NUMERIC_COMPARISON, 4=NUMERIC_RANGE, 5=SET] - Filter reasoning

Result Schema (`TablesTestCaseResult`)¶

class TablesTestCaseResult(BaseModel):
    input: str = Field(description="The formatted problem text")
    target: str = Field(description="The correct integer answer")
    format: int = Field(description="Table format used (TableFormat enum)")
    operation: int = Field(description="Operation performed (Operation enum)")
    filter_type: int = Field(description="Filter type applied (FilterType enum)")
    num_rows: int = Field(description="Number of rows in table")
    num_columns: int = Field(description="Number of numeric columns")
    domain: str = Field(description="Domain name used")
    row_id_type: int = Field(description="Row identifier type (RowIdType enum)")
    table_data: List[Dict] = Field(description="Raw table data")
    question_metadata: Dict = Field(description="Question generation metadata")

Metadata Fields: - target_column: Name of metric column used for aggregation (null for COUNT) - filter_description: Natural language description of filter applied (null for NONE) - filter_detail: Technical filter specification (e.g., "facet==mammal", "age>=20") - facet_scheme: Name of facet scheme used - metrics: List of metric column names in table - operation_variant: Specific variant executed (e.g., "sum", "mode", "min", "max", "first", "last", "matching", "not_matching")

Example Test Cases¶

Basic COUNT with No Filter (format=CSV, operation=COUNT, filter_type=NONE)¶

Given the following table:

id,label,facet,age,weight_kg
1,Alice the Lion,mammal,35,180
2,Bob the Tiger,mammal,12,160
3,Charlie the Eagle,bird,8,4
4,Diana the Dolphin,fish,15,300
5,Emma the Frog,amphibian,5,1

How many animals are in the table?

Analysis: - Table Structure: 5 rows, 2 numeric columns (age, weight_kg) - Filter: None, all rows included - Operation: COUNT - Calculation: 5 total rows - Expected Answer: 5

COUNT with FACET Equality Filter (format=MARKDOWN, operation=COUNT, filter_type=FACET)¶

Given the following table:

| id | label | facet | age | weight_kg |
|---|---|---|---|---|
| 1 | Alice the Lion | mammal | 35 | 180 |
| 2 | Bob the Tiger | mammal | 12 | 160 |
| 3 | Charlie the Eagle | bird | 8 | 4 |
| 4 | Diana the Dolphin | fish | 15 | 300 |
| 5 | Emma the Frog | amphibian | 5 | 1 |

How many mammal animals are in the table?

Analysis: - Filter: FACET equality (facet == "mammal") - Filtered Rows: [Alice the Lion, Bob the Tiger] = 2 rows - Operation: COUNT (matching variant) - Expected Answer: 2

COUNT with FACET Inequality Filter (format=MARKDOWN, operation=COUNT, filter_type=FACET)¶

Given the following table:

| id | label | facet | age | weight_kg |
|---|---|---|---|---|
| 1 | Alice the Lion | mammal | 35 | 180 |
| 2 | Bob the Tiger | mammal | 12 | 160 |
| 3 | Charlie the Eagle | bird | 8 | 4 |
| 4 | Diana the Dolphin | fish | 15 | 300 |
| 5 | Emma the Frog | amphibian | 5 | 1 |

How many animals that are not mammal are in the table?

Analysis: - Filter: FACET inequality (facet != "mammal") - Filtered Rows: [Charlie the Eagle, Diana the Dolphin, Emma the Frog] = 3 rows - Operation: COUNT (not_matching variant) - Expected Answer: 3

SUM with NUMERIC_COMPARISON Filter (format=FIXED_WIDTH, operation=SUM_MODE_MEDIAN, filter_type=NUMERIC_COMPARISON)¶

Given the following table:

id  label             facet      age  weight_kg
--  -----             -----      ---  ----------
1   Alice the Lion    mammal     35   180
2   Bob the Tiger     mammal     12   160
3   Charlie the Eagle bird       8    4
4   Diana the Dolphin fish       15   300
5   Emma the Frog     amphibian  5    1

What is the total weight_kg for animals with age at least 10?

Analysis: - Filter: NUMERIC_COMPARISON (age >= 10) - Filtered Rows: [Alice (35, 180), Bob (12, 160), Diana (15, 300)] = 3 rows - Operation: SUM_MODE_MEDIAN, operation_variant = "sum" (chosen at random) - Target Column: weight_kg - Calculation: 180 + 160 + 300 = 640 - Expected Answer: 640

MIN with NUMERIC_RANGE Filter (format=JSON, operation=MIN_OR_MAX, filter_type=NUMERIC_RANGE)¶

Given the following table:

[
  {"id": 1, "label": "Alice the Lion", "facet": "mammal", "age": 35, "weight_kg": 180},
  {"id": 2, "label": "Bob the Tiger", "facet": "mammal", "age": 12, "weight_kg": 160},
  {"id": 3, "label": "Charlie the Eagle", "facet": "bird", "age": 8, "weight_kg": 4},
  {"id": 4, "label": "Diana the Dolphin", "facet": "fish", "age": 15, "weight_kg": 300},
  {"id": 5, "label": "Emma the Frog", "facet": "amphibian", "age": 5, "weight_kg": 1}
]

What is the minimum weight_kg for animals with age between 8 and 20?

Analysis: - Filter: NUMERIC_RANGE (8 <= age <= 20) - Filtered Rows: [Charlie (8, 4), Bob (12, 160), Diana (15, 300)] = 3 rows - Operation: MIN_OR_MAX, operation_variant = "min" (chosen at random) - Target Column: weight_kg - Values: [4, 160, 300] - Calculation: min(4, 160, 300) = 4 - Expected Answer: 4

Given the following table:

id,label,facet,age,weight_kg
1,Alice the Lion,mammal,35,180
2,Bob the Tiger,mammal,12,160
3,Charlie the Eagle,bird,8,4
4,Diana the Dolphin,fish,15,300
5,Emma the Frog,amphibian,5,1

What is the first age for mammal animals?

Analysis: - Filter: FACET equality (facet == "mammal") - Filtered Rows: [Alice the Lion (35, 180), Bob the Tiger (12, 160)] - Operation: FIRST_OR_LAST, operation_variant = "first" - Target Column: age - First Row After Filtering: Alice the Lion with age=35 - Expected Answer: 35

LAST with SET Filter (format=MARKDOWN, operation=FIRST_OR_LAST, filter_type=SET)¶

Given the following table:

| id | label | facet | age | weight_kg |
|---|---|---|---|---|
| 1 | Alice the Lion | mammal | 35 | 180 |
| 2 | Bob the Tiger | mammal | 12 | 160 |
| 3 | Charlie the Eagle | bird | 8 | 4 |
| 4 | Diana the Dolphin | fish | 15 | 300 |
| 5 | Emma the Frog | amphibian | 5 | 1 |

What is the last weight_kg for Alice the Lion, Bob the Tiger, and Charlie the Eagle?

Analysis: - Filter: SET (cardinality=3, using labels) - Filtered Rows: [Alice the Lion (35, 180), Bob the Tiger (12, 160), Charlie the Eagle (8, 4)] - Operation: FIRST_OR_LAST, operation_variant = "last" - Target Column: weight_kg - Last Row After Filtering: Charlie the Eagle with weight_kg=4 - Expected Answer: 4

MODE with FACET Filter (format=MARKDOWN, operation=SUM_MODE_MEDIAN, filter_type=FACET)¶

Given the following table:

| id | label | facet | age |
|---|---|---|---|
| 1 | Alice the Lion | mammal | 15 |
| 2 | Bob the Tiger | mammal | 15 |
| 3 | Charlie the Cat | mammal | 15 |
| 4 | Diana the Dolphin | fish | 20 |
| 5 | Emma the Eagle | bird | 25 |

What is the most common age for mammal animals?

Analysis: - Filter: FACET equality (facet == "mammal") - Filtered Rows: [Alice (15), Bob (15), Charlie (15)] = 3 rows (odd count) - Operation: SUM_MODE_MEDIAN, operation_variant = "mode" (chosen at random from all three) - Target Column: age - Mode Calculation: Counter([15, 15, 15]) → 15 appears 3 times - Expected Answer: 15

MEDIAN with NUMERIC_COMPARISON Filter (format=FIXED_WIDTH, operation=SUM_MODE_MEDIAN, filter_type=NUMERIC_COMPARISON)¶

Example 1: Odd-count median (no skewing needed)

Given the following table:

id  label          facet   weight_kg
--  -----          -----   ----------
1   Alice the Lion mammal  150
2   Bob the Tiger  mammal  165
3   Charlie the Ox mammal  170
4   Diana the Fish fish    100
5   Emma the Eagle bird    3

What is the median weight_kg for animals with weight_kg greater than 100?

Analysis: - Filter: NUMERIC_COMPARISON (weight_kg > 100) - Filtered Rows: [Alice (150), Bob (165), Charlie (170)] = 3 rows (odd count) - Operation: SUM_MODE_MEDIAN, operation_variant = "median" - Target Column: weight_kg - Median Calculation: sorted([150, 165, 170]) → middle value = 165 - Expected Answer: 165

Example 2: Even-count median with skewing

Given the following table:

id  label          facet   age
--  -----          -----   ---
1   Alice the Lion mammal  10
2   Bob the Tiger  mammal  16
3   Charlie the Ox mammal  20
4   Diana the Fish fish    25

What is the median age for mammal animals?

Analysis: - Filter: FACET equality (facet == "mammal") - Filtered Rows: [Alice (10), Bob (16), Charlie (20)] = 3... wait, actually 4 rows before filter - After filtering: Let's say [Alice (10), Bob (15), Charlie (20), Diana (25)] = 4 mammals (even count) - Operation: SUM_MODE_MEDIAN, operation_variant = "median" - Target Column: age - Original middle values: ages[1]=15, ages[2]=20, sum=35 (odd) - Skewing applied: ages[1] incremented to 16, sum now=36 (even) - Median Calculation: (16 + 20) / 2 = 18 - Expected Answer: 18 (integer, unambiguous)

Operation Complexity Analysis¶

Parsing and Format Recognition¶

Level 1 - CSV Parsing: - Requires: Delimiter-based column extraction - Simplest format (delimiter is unambiguous) - Example: "Split by comma, extract values"

Level 2 - Markdown Parsing: - Requires: Pipe-delimiter recognition, header/separator row handling - Moderate complexity (pipes appear in content, need column alignment) - Example: "Recognize pipes, skip separator row with dashes"

Level 3 - Fixed-Width Parsing: - Requires: Spatial column alignment, variable whitespace handling - Higher complexity (no delimiters, must align by position) - Example: "Identify column boundaries from header alignment"

Level 4 - JSON Parsing: - Requires: JSON structure understanding, key-value extraction - Moderate complexity (explicit keys, but JSON syntax overhead) - Example: "Extract array, iterate objects, access by key"

Filtering Complexity¶

Level 1 - No Filter: - Requires: Including all rows - Simplest (all rows participate in operation)

Level 2 - Facet Equality/Inequality: - Requires: String comparison and set membership logic - Moderate (simple == or != operation on categorical column)

Level 3 - Numeric Comparison: - Requires: Numeric value extraction, comparison operator application - Moderate-High (>, <, >=, <= require different boundary semantics)

Level 4 - Numeric Range: - Requires: Two comparison conditions, compound filtering logic - Highest complexity (must maintain bounds correctly, handle edge cases)

Operation Complexity¶

Level 1 - COUNT: - Requires: Row counting after filtering - Simplest aggregation (linear scan of filtered rows)

Level 2 - MIN_OR_MAX: - Requires: Numeric comparison across values, extremum selection - Moderate (single pass to find extremum, two variants)

Level 3 - FIRST_OR_LAST / LAST / FIRST: - Requires: Row ordering understanding, boundary access - Moderate under low context pressure (requires knowing which row is first/last in table) - LAST degrades sharply under context pressure; FIRST does not (primacy zone immunity)

Level 4 - SUM_MODE_MEDIAN: - Requires: Multiple aggregation implementations, correct variant selection - Highest complexity (sum requires summation, mode requires frequency analysis, median requires sorting)

Cognitive Skills Tested¶

Primary Cognitive Capabilities¶

Structured Data Parsing: Understanding tabular formats and extracting rows/columns
Format Invariance: Recognizing equivalent data across different syntactic representations
Categorical Reasoning: Understanding facet schemes and applying equality/inequality filters
Numeric Reasoning: Applying comparison operators, handling ranges and bounds
Aggregation Logic: Computing count, sum, mode, median, min, max correctly
Column Selection: Identifying target columns for operations from descriptions
Row Filtering: Understanding and applying complex filtering conditions

Executive Function Skills¶

Systematic Processing: Applying multi-step operations (filter then aggregate) in order
Boundary Handling: Correctly handling edge cases (empty filters, single rows, ties in mode)
Quantitative Reasoning: Understanding aggregation semantics and when variants apply
Attention to Detail: Handling syntax variations without semantic confusion

Applications¶

This test manifold evaluates capabilities essential for:

Data Analysis: Extracting insights from tabular datasets
Business Intelligence: Processing structured data for decision-making
Scientific Computing: Parsing and analyzing experimental results in tables
Database Querying: Understanding SQL-like filtering and aggregation operations
Spreadsheet Operations: Working with Excel/Google Sheets data structures
Information Extraction: Pulling specific data from structured documents
Report Generation: Querying and summarizing tabular information
Data Validation: Checking consistency and properties of structured datasets