Sort Test Manifold¶
Overview¶
The Sort test manifold evaluates a model's ability to perform alphabetical sorting of word collections while handling various mechanical difficulties in text processing. Models must process space-separated word lists, apply case-insensitive alphabetical ordering, and output sorted results while managing mechanical obstacles such as case normalization, punctuation handling, word duplications, and varied vocabulary sources.
Key Discovery: Analysis of model failures revealed that sorting difficulty stems primarily from mechanical string processing errors (case conversion failures, punctuation confusion, character-by-character comparison mistakes) rather than combinatorial reasoning complexity. The mechanical difficulty system explicitly controls these mechanical obstacles to isolate and measure different aspects of string processing capability.
Task Description¶
Models are presented with collections of unsorted words and must sort them alphabetically in a case-insensitive manner. The task requires lexicographic ordering, case normalization, and format transformation while handling distractors in the form of mixed-case words and potential duplications that increase cognitive complexity.
Key Features: - Mechanical Difficulty Control: Explicit control over normalization to isolate string processing failures - Lexicographic Ordering: Sorting words alphabetically regardless of case - Case Normalization: Converting all output to lowercase for consistency - Symbol Filtering: Optional removal of non-alphabetic characters (hyphens, apostrophes) - Format Transformation: Converting space-separated input to newline-separated output - Dictionary Sampling: Using real dictionary words with realistic punctuation - Run-Length Control: Selecting consecutive dictionary segments for coherent word groups - Case Mutation: Random case transformations to test case-insensitive sorting (when enabled) - Word Duplication: Controlled repetition of words to increase processing complexity - Uniqueness Validation: Ensuring generated test cases are distinct
Test Case Generation¶
Algorithm Overview¶
The generator creates challenging sorting scenarios through a systematic process:
- Dictionary Loading: Load and pre-sort words from dictionary file for efficient sampling
- Mechanical Normalization: Apply force_lowercase and/or remove_symbols to entire word list
- Dictionary Deduplication: Remove duplicates created by normalization (e.g., "Reader's" and "readers" → "readers")
- Run-Length Sampling: Select consecutive word segments from normalized dictionary
- Word Collection: Accumulate words until target length is reached through multiple runs
- Duplication Application: Apply controlled word duplication based on probability
- Case Mutation: Apply random case transformations (only if not force_lowercase mode)
- Collection Shuffling: Randomize word order to eliminate positional sorting cues
- Target Generation: Create lowercase, alphabetically sorted target output
- Uniqueness Checking: Ensure generated test cases haven't been seen before
Dictionary Sampling Strategy¶
Run-Length Approach: Instead of random individual word selection, the system uses consecutive dictionary segments:
- Coherence Benefit: Consecutive words often share semantic or morphological relationships
- Efficiency Gain: Reduces random access patterns in large dictionaries
- Controlled Variety: run_length parameter controls segment size (1 = random, larger = more coherent)
- Multiple Runs: System performs multiple sampling runs until target word count is reached
Word Accumulation Process:
1. Random Start: Select random starting position in sorted dictionary
2. Segment Extraction: Copy up to run_length consecutive words
3. Length Constraint: Respect remaining word count needed
4. Iteration: Repeat until target length is achieved
Transformation System¶
Case Mutation: Random case transformations applied with prob_mutation:
- lowercase: Convert entire word to lowercase
- UPPERCASE: Convert entire word to uppercase
- Title Case: Capitalize first letter, lowercase remainder
- Application: Applied to final word in collection after duplication
Word Duplication: Controlled repetition with prob_duplication:
- Timing: Applied before adding each word to collection
- Effect: Creates duplicate entries that must be sorted correctly
- Cognitive Load: Tests whether models handle repeated elements properly
Output Format Transformation¶
Input Format: Space-separated words with "Input: " prefix
Input: apple Banana CHERRY dog
Target Format: Newline-separated lowercase words in alphabetical order
apple
banana
cherry
dog
Configuration Parameters¶
Mechanical Difficulty System¶
The mechanical parameter controls normalization to isolate different types of string processing failures:
class MechanicalDifficulty(IntEnum):
NONE = 0 # No normalization (hardest: case mutations + symbols)
SYMBOLS_ONLY = 1 # Remove symbols only, keep case mutations
LOWERCASE_ONLY = 2 # Force lowercase, keep symbols
BOTH = 3 # Both normalizations (easiest: clean lowercase alphas)
Bit Flags:
- Bit 0: force_lowercase (0=no, 1=yes) - Pre-normalize all words to lowercase
- Bit 1: remove_symbols (0=no, 1=yes) - Strip non-alphabetic chars (hyphens, apostrophes, etc.)
Mode Effects:
| Mode | force_lowercase |
remove_symbols |
Tests | Difficulty |
|---|---|---|---|---|
| 0 (NONE) | No | No | Case handling + Symbol ordering | Hardest |
| 1 (SYMBOLS_ONLY) | No | Yes | Case handling only | Hard |
| 2 (LOWERCASE_ONLY) | Yes | No | Symbol ordering only | Medium |
| 3 (BOTH) | Yes | Yes | Pure alphabetical sorting | Easiest |
Generation Schema (SortGenerationParams)¶
class SortGenerationParams(BaseModel):
count: int # Number of test cases to generate (> 0)
length: int # Number of words in each test case (> 0)
run_length: int # Maximum consecutive words from dictionary (> 0)
prob_mutation: float # Probability of case mutation (0.0-1.0, default: 0.3)
prob_duplication: float # Probability of word duplication (0.0-1.0, default: 0.2)
mechanical: int # Mechanical difficulty mode (0-3, default: 0)
# Validation: prob_mutation must be 0.0 when force_lowercase is enabled (mechanical in [2, 3])
Result Schema (SortTestCaseResult)¶
class SortTestCaseResult(BaseModel):
input: str # Space-separated unsorted words with "Input: " prefix
target: str # Newline-separated sorted words (lowercase)
Example Test Cases¶
Mode 0: NONE - Full Mechanical Difficulty (mechanical=0, length=4, prob_mutation=0.3)¶
Challenges: Case mutations + Symbol ordering (hyphens, apostrophes)
Input: Reader's lawman savage HOUR-LONG
Mechanical Obstacles: - Mixed case: Reader's (title), lawman (lower), savage (lower), HOUR-LONG (upper) - Symbols: apostrophe in "Reader's", hyphen in "HOUR-LONG" - Case comparison: 'L' vs 'l' vs 'R' vs 'r' vs 's' vs 'S' vs 'H' - Symbol ordering: Where does '-' and '\'' sort relative to letters?
Common Failures: - Wrong: "lawman, Reader's, savage, HOUR-LONG" (case comparison error: 'l' < 'R' failed) - Wrong: "hour-long, lawman, Reader's, savage" (apostrophe removed: "Reader's" → "readers") - Wrong: "HOUR-LONG, lawman, savage, Reader's" (symbol ordering confusion)
Expected Output:
hour-long
lawman
reader's
savage
Mode 1: SYMBOLS_ONLY - Case Difficulty Only (mechanical=1, length=4, prob_mutation=0.3)¶
Challenges: Case mutations, no symbol confusion
Dictionary normalized: "Reader's" → "Readers", "HOUR-LONG" → "HOURLONG"
Input: Readers lawman savage HOURLONG
Mechanical Obstacles: - Mixed case only (no symbols to confuse ordering) - Character comparison: 'H' vs 'l' vs 'R' vs 's'
Common Failures: - Wrong: "lawman, Readers, savage, HOURLONG" (case comparison error) - Wrong: "Readers, lawman, savage, HOURLONG" (compared 'R' < 'l' incorrectly)
Expected Output:
hourlong
lawman
readers
savage
Mode 2: LOWERCASE_ONLY - Symbol Difficulty Only (mechanical=2, length=4, prob_mutation=0.0)¶
Challenges: Symbol ordering, no case confusion
Dictionary normalized: All lowercase from start, prob_mutation forced to 0.0
Input: reader's lawman savage hour-long
Mechanical Obstacles: - Symbol ordering: Where do '-' and '\'' sort? - Character comparison with symbols: "hour-long" vs "lawman" (compare 'h' vs 'l', then '-' matters if 'h' match)
Common Failures: - Wrong: "lawman, reader's, savage, hour-long" (put "lawman" before "hour-long") - Wrong: "hour-long, lawman, readers, savage" (apostrophe removed/ignored)
Expected Output:
hour-long
lawman
reader's
savage
Mode 3: BOTH - Pure Alphabetical Sorting (mechanical=3, length=4, prob_mutation=0.0)¶
Challenges: None (baseline capability test)
Dictionary normalized: All lowercase, symbols removed, prob_mutation forced to 0.0
Input: readers lawman savage hourlong
Mechanical Obstacles: None - clean lowercase alphabetic words only
Expected Output:
hourlong
lawman
readers
savage
Basic Alphabetical Sorting (mechanical=3, length=5, run_length=3, prob_mutation=0.0, prob_duplication=0.0)¶
Input: elephant dog cat bird apple
Sorting Process: - Original: [elephant, dog, cat, bird, apple] - Lowercase: [elephant, dog, cat, bird, apple] - Alphabetical: [apple, bird, cat, dog, elephant]
Expected Output:
apple
bird
cat
dog
elephant
Case-Insensitive Sorting (length=4, prob_mutation=0.8)¶
Input: ZEBRA apple Banana cherry
Case Analysis: - ZEBRA: uppercase mutation - apple: lowercase (original) - Banana: title case mutation - cherry: lowercase (original)
Sorting Process: - Case-insensitive sort: [apple, Banana, cherry, ZEBRA] - Lowercase output: [apple, banana, cherry, zebra]
Expected Output:
apple
banana
cherry
zebra
Word Duplication Challenge (length=6, prob_duplication=0.5)¶
Input: cat dog cat bird apple dog
Duplication Analysis: - cat: appears twice - dog: appears twice - bird: appears once - apple: appears once
Sorting Process: - All instances sorted: [apple, bird, cat, cat, dog, dog]
Expected Output:
apple
bird
cat
cat
dog
dog
Mixed Complexity (length=8, run_length=2, prob_mutation=0.4, prob_duplication=0.3)¶
Input: HOUSE tree HOUSE garden flower Tree mountain river
Transformation Analysis: - HOUSE: uppercase mutation, duplicated - tree/Tree: different cases of same word - garden: lowercase (original) - flower: lowercase (original) - mountain: lowercase (original) - river: lowercase (original)
Sorting Process: - Case-insensitive grouping: [flower, garden, HOUSE, HOUSE, mountain, river, Tree, tree] - Lowercase output: [flower, garden, house, house, mountain, river, tree, tree]
Expected Output:
flower
garden
house
house
mountain
river
tree
tree
Run-Length Coherence (length=6, run_length=6)¶
Input: abandon ability able about above absence
Dictionary Coherence: All words from same alphabetical region (consecutive 'ab-' words)
Sorting Process: - Already near-sorted due to dictionary source - Final order: [abandon, ability, able, about, above, absence]
Expected Output:
abandon
ability
able
about
above
absence
Mechanical Obstacles System¶
Primary Obstacle: Case Normalization (Controlled by force_lowercase)¶
When Disabled (mechanical=0 or 1): Models must handle mixed-case comparisons
Random case transformations (prob_mutation) test case-insensitive sorting:
- Visual Complexity: Mixed case creates visual noise (e.g., "ZEBRA" vs "apple")
- Character Comparison: Must correctly compare 'A'=65 vs 'a'=97 vs 'Z'=90 vs 'z'=122
- Common Failures:
- Incomplete lowercase conversion: "deTOURS" instead of "detours"
- Wrong position comparison: comparing wrong character indices
- Typos during conversion: "cockeyed" → "cokeyed"
When Enabled (mechanical=2 or 3): Words pre-normalized to lowercase, prob_mutation forced to 0.0
Secondary Obstacle: Symbol Ordering (Controlled by remove_symbols)¶
When Disabled (mechanical=0 or 2): Models must handle punctuation in words
Dictionary contains realistic words with symbols (hyphens, apostrophes): - Symbol ASCII Ordering: '-' (45) and '\'' (39) sort before letters (97+) - Common Failures: - Symbol removal: "Reader's" → "readers" (apostrophe stripped) - Wrong symbol ordering: Confusion about where "hour-long" sorts relative to "house" - Verbose reasoning: Models spend 25K+ chars reasoning about ASCII codes and still fail
When Enabled (mechanical=1 or 3): Symbols stripped via regex [^a-zA-Z] before word list creation
Tertiary Obstacle: Word Duplications¶
Repeated words that test duplicate handling and attention:
- Duplicate Processing: Models must sort all instances of repeated words correctly
- Attention Challenge: Duplicates may cause confusion about unique vs. repeated elements
- Ordering Consistency: All instances of same word must appear together in sorted output
- Controlled by: prob_duplication parameter (independent of mechanical mode)
Strategic Distribution¶
Words are shuffled after all transformations to ensure: - No Positional Cues: Original dictionary order is completely randomized - Mixed Complexity: Case mutations (if enabled) and duplications distributed throughout input - Mechanical Consistency: Processing difficulty maintained across entire word list
Cognitive Skills Tested¶
Mechanical String Processing (Core Capability)¶
- Character-by-Character Comparison: Accurate lexicographic ordering at character level
- Case Normalization: Converting uppercase to lowercase without errors (Mode 0, 1)
- Symbol Handling: Correctly ordering words containing hyphens, apostrophes (Mode 0, 2)
- ASCII Code Reasoning: Understanding character ordering (e.g., '-' < 'a', '\'' < 'a')
Algorithmic Skills¶
- Lexicographic Ordering: Understanding alphabetical sequence relationships
- Pattern Recognition: Identifying alphabetical patterns across varied presentations
- Working Memory: Maintaining sort criteria while processing word sequences
- Attention to Detail: Accurate position-specific character comparison
- Duplicate Handling: Correctly processing repeated elements in collections
- Systematic Processing: Applying consistent sorting rules across entire collections
Format Transformation¶
- Input Parsing: Converting space-separated words to individual tokens
- Output Formatting: Generating newline-separated results correctly
Key Failure Modes Observed¶
Based on empirical analysis of model failures:
- Wrong Alphabetical Comparisons (even on trivial 4-word cases):
- Comparing wrong character positions (e.g., "annie" after "arimathea")
-
Incorrect character ordering (e.g., "lawman" after "savage" when 'l' < 's')
-
Case Conversion Errors:
- Incomplete conversion: "deTOURS" instead of "detours"
-
Character typos: "cockeyed" → "cokeyed"
-
Symbol Processing Errors:
- Apostrophe removal: "Reader's" → "readers"
-
Symbol ordering confusion despite explicit ASCII reasoning
-
Word Omission: Catastrophic failures where words completely dropped from output
Critical Insight: Even models that use explicit <think> reasoning with ASCII code comparisons (e.g., "'e' (101) < 'o' (111)") still make mechanical errors. Verbose chain-of-thought does not guarantee correct character-level string processing.
Applications¶
This test manifold evaluates capabilities essential for:
- Data Organization: Sorting textual data in databases and applications
- Information Retrieval: Organizing search results and directory listings
- Text Processing: Alphabetizing content in documents and reports
- User Interface: Implementing sort functionality in applications
- Data Validation: Ensuring consistent ordering in data processing pipelines
- Content Management: Organizing textual content for presentation
- Search Optimization: Preparing sorted indexes for efficient searching
- Quality Assurance: Validating sort implementations in software systems
- Document Processing: Organizing references, glossaries, and indexes
- Database Operations: Implementing ORDER BY functionality and data organization