Skip to content

Sort Test Manifold

Overview

The Sort test manifold evaluates a model's ability to perform alphabetical sorting of word collections while handling various mechanical difficulties in text processing. Models must process space-separated word lists, apply case-insensitive alphabetical ordering, and output sorted results while managing mechanical obstacles such as case normalization, punctuation handling, word duplications, and varied vocabulary sources.

Key Discovery: Analysis of model failures revealed that sorting difficulty stems primarily from mechanical string processing errors (case conversion failures, punctuation confusion, character-by-character comparison mistakes) rather than combinatorial reasoning complexity. The mechanical difficulty system explicitly controls these mechanical obstacles to isolate and measure different aspects of string processing capability.

Task Description

Models are presented with collections of unsorted words and must sort them alphabetically in a case-insensitive manner. The task requires lexicographic ordering, case normalization, and format transformation while handling distractors in the form of mixed-case words and potential duplications that increase cognitive complexity.

Key Features: - Mechanical Difficulty Control: Explicit control over normalization to isolate string processing failures - Lexicographic Ordering: Sorting words alphabetically regardless of case - Case Normalization: Converting all output to lowercase for consistency - Symbol Filtering: Optional removal of non-alphabetic characters (hyphens, apostrophes) - Format Transformation: Converting space-separated input to newline-separated output - Dictionary Sampling: Using real dictionary words with realistic punctuation - Run-Length Control: Selecting consecutive dictionary segments for coherent word groups - Case Mutation: Random case transformations to test case-insensitive sorting (when enabled) - Word Duplication: Controlled repetition of words to increase processing complexity - Uniqueness Validation: Ensuring generated test cases are distinct

Test Case Generation

Algorithm Overview

The generator creates challenging sorting scenarios through a systematic process:

  1. Dictionary Loading: Load and pre-sort words from dictionary file for efficient sampling
  2. Mechanical Normalization: Apply force_lowercase and/or remove_symbols to entire word list
  3. Dictionary Deduplication: Remove duplicates created by normalization (e.g., "Reader's" and "readers" → "readers")
  4. Run-Length Sampling: Select consecutive word segments from normalized dictionary
  5. Word Collection: Accumulate words until target length is reached through multiple runs
  6. Duplication Application: Apply controlled word duplication based on probability
  7. Case Mutation: Apply random case transformations (only if not force_lowercase mode)
  8. Collection Shuffling: Randomize word order to eliminate positional sorting cues
  9. Target Generation: Create lowercase, alphabetically sorted target output
  10. Uniqueness Checking: Ensure generated test cases haven't been seen before

Dictionary Sampling Strategy

Run-Length Approach: Instead of random individual word selection, the system uses consecutive dictionary segments: - Coherence Benefit: Consecutive words often share semantic or morphological relationships - Efficiency Gain: Reduces random access patterns in large dictionaries - Controlled Variety: run_length parameter controls segment size (1 = random, larger = more coherent) - Multiple Runs: System performs multiple sampling runs until target word count is reached

Word Accumulation Process: 1. Random Start: Select random starting position in sorted dictionary 2. Segment Extraction: Copy up to run_length consecutive words 3. Length Constraint: Respect remaining word count needed 4. Iteration: Repeat until target length is achieved

Transformation System

Case Mutation: Random case transformations applied with prob_mutation: - lowercase: Convert entire word to lowercase - UPPERCASE: Convert entire word to uppercase
- Title Case: Capitalize first letter, lowercase remainder - Application: Applied to final word in collection after duplication

Word Duplication: Controlled repetition with prob_duplication: - Timing: Applied before adding each word to collection - Effect: Creates duplicate entries that must be sorted correctly - Cognitive Load: Tests whether models handle repeated elements properly

Output Format Transformation

Input Format: Space-separated words with "Input: " prefix

Input: apple Banana CHERRY dog

Target Format: Newline-separated lowercase words in alphabetical order

apple
banana
cherry
dog

Configuration Parameters

Mechanical Difficulty System

The mechanical parameter controls normalization to isolate different types of string processing failures:

class MechanicalDifficulty(IntEnum):
    NONE = 0           # No normalization (hardest: case mutations + symbols)
    SYMBOLS_ONLY = 1   # Remove symbols only, keep case mutations
    LOWERCASE_ONLY = 2 # Force lowercase, keep symbols
    BOTH = 3           # Both normalizations (easiest: clean lowercase alphas)

Bit Flags: - Bit 0: force_lowercase (0=no, 1=yes) - Pre-normalize all words to lowercase - Bit 1: remove_symbols (0=no, 1=yes) - Strip non-alphabetic chars (hyphens, apostrophes, etc.)

Mode Effects:

Mode force_lowercase remove_symbols Tests Difficulty
0 (NONE) No No Case handling + Symbol ordering Hardest
1 (SYMBOLS_ONLY) No Yes Case handling only Hard
2 (LOWERCASE_ONLY) Yes No Symbol ordering only Medium
3 (BOTH) Yes Yes Pure alphabetical sorting Easiest

Generation Schema (SortGenerationParams)

class SortGenerationParams(BaseModel):
    count: int                                   # Number of test cases to generate (> 0)
    length: int                                 # Number of words in each test case (> 0)
    run_length: int                             # Maximum consecutive words from dictionary (> 0)
    prob_mutation: float                        # Probability of case mutation (0.0-1.0, default: 0.3)
    prob_duplication: float                     # Probability of word duplication (0.0-1.0, default: 0.2)
    mechanical: int                             # Mechanical difficulty mode (0-3, default: 0)

    # Validation: prob_mutation must be 0.0 when force_lowercase is enabled (mechanical in [2, 3])

Result Schema (SortTestCaseResult)

class SortTestCaseResult(BaseModel):
    input: str                                  # Space-separated unsorted words with "Input: " prefix
    target: str                                 # Newline-separated sorted words (lowercase)

Example Test Cases

Mode 0: NONE - Full Mechanical Difficulty (mechanical=0, length=4, prob_mutation=0.3)

Challenges: Case mutations + Symbol ordering (hyphens, apostrophes)

Input: Reader's lawman savage HOUR-LONG

Mechanical Obstacles: - Mixed case: Reader's (title), lawman (lower), savage (lower), HOUR-LONG (upper) - Symbols: apostrophe in "Reader's", hyphen in "HOUR-LONG" - Case comparison: 'L' vs 'l' vs 'R' vs 'r' vs 's' vs 'S' vs 'H' - Symbol ordering: Where does '-' and '\'' sort relative to letters?

Common Failures: - Wrong: "lawman, Reader's, savage, HOUR-LONG" (case comparison error: 'l' < 'R' failed) - Wrong: "hour-long, lawman, Reader's, savage" (apostrophe removed: "Reader's" → "readers") - Wrong: "HOUR-LONG, lawman, savage, Reader's" (symbol ordering confusion)

Expected Output:

hour-long
lawman
reader's
savage

Mode 1: SYMBOLS_ONLY - Case Difficulty Only (mechanical=1, length=4, prob_mutation=0.3)

Challenges: Case mutations, no symbol confusion

Dictionary normalized: "Reader's" → "Readers", "HOUR-LONG" → "HOURLONG"

Input: Readers lawman savage HOURLONG

Mechanical Obstacles: - Mixed case only (no symbols to confuse ordering) - Character comparison: 'H' vs 'l' vs 'R' vs 's'

Common Failures: - Wrong: "lawman, Readers, savage, HOURLONG" (case comparison error) - Wrong: "Readers, lawman, savage, HOURLONG" (compared 'R' < 'l' incorrectly)

Expected Output:

hourlong
lawman
readers
savage

Mode 2: LOWERCASE_ONLY - Symbol Difficulty Only (mechanical=2, length=4, prob_mutation=0.0)

Challenges: Symbol ordering, no case confusion

Dictionary normalized: All lowercase from start, prob_mutation forced to 0.0

Input: reader's lawman savage hour-long

Mechanical Obstacles: - Symbol ordering: Where do '-' and '\'' sort? - Character comparison with symbols: "hour-long" vs "lawman" (compare 'h' vs 'l', then '-' matters if 'h' match)

Common Failures: - Wrong: "lawman, reader's, savage, hour-long" (put "lawman" before "hour-long") - Wrong: "hour-long, lawman, readers, savage" (apostrophe removed/ignored)

Expected Output:

hour-long
lawman
reader's
savage

Mode 3: BOTH - Pure Alphabetical Sorting (mechanical=3, length=4, prob_mutation=0.0)

Challenges: None (baseline capability test)

Dictionary normalized: All lowercase, symbols removed, prob_mutation forced to 0.0

Input: readers lawman savage hourlong

Mechanical Obstacles: None - clean lowercase alphabetic words only

Expected Output:

hourlong
lawman
readers
savage

Basic Alphabetical Sorting (mechanical=3, length=5, run_length=3, prob_mutation=0.0, prob_duplication=0.0)

Input: elephant dog cat bird apple

Sorting Process: - Original: [elephant, dog, cat, bird, apple] - Lowercase: [elephant, dog, cat, bird, apple] - Alphabetical: [apple, bird, cat, dog, elephant]

Expected Output:

apple
bird
cat
dog
elephant

Case-Insensitive Sorting (length=4, prob_mutation=0.8)

Input: ZEBRA apple Banana cherry

Case Analysis: - ZEBRA: uppercase mutation - apple: lowercase (original) - Banana: title case mutation - cherry: lowercase (original)

Sorting Process: - Case-insensitive sort: [apple, Banana, cherry, ZEBRA] - Lowercase output: [apple, banana, cherry, zebra]

Expected Output:

apple
banana
cherry
zebra

Word Duplication Challenge (length=6, prob_duplication=0.5)

Input: cat dog cat bird apple dog

Duplication Analysis: - cat: appears twice - dog: appears twice - bird: appears once - apple: appears once

Sorting Process: - All instances sorted: [apple, bird, cat, cat, dog, dog]

Expected Output:

apple
bird
cat
cat
dog
dog

Mixed Complexity (length=8, run_length=2, prob_mutation=0.4, prob_duplication=0.3)

Input: HOUSE tree HOUSE garden flower Tree mountain river

Transformation Analysis: - HOUSE: uppercase mutation, duplicated - tree/Tree: different cases of same word - garden: lowercase (original) - flower: lowercase (original) - mountain: lowercase (original) - river: lowercase (original)

Sorting Process: - Case-insensitive grouping: [flower, garden, HOUSE, HOUSE, mountain, river, Tree, tree] - Lowercase output: [flower, garden, house, house, mountain, river, tree, tree]

Expected Output:

flower
garden
house
house
mountain
river
tree
tree

Run-Length Coherence (length=6, run_length=6)

Input: abandon ability able about above absence

Dictionary Coherence: All words from same alphabetical region (consecutive 'ab-' words)

Sorting Process: - Already near-sorted due to dictionary source - Final order: [abandon, ability, able, about, above, absence]

Expected Output:

abandon
ability
able
about
above
absence

Mechanical Obstacles System

Primary Obstacle: Case Normalization (Controlled by force_lowercase)

When Disabled (mechanical=0 or 1): Models must handle mixed-case comparisons

Random case transformations (prob_mutation) test case-insensitive sorting: - Visual Complexity: Mixed case creates visual noise (e.g., "ZEBRA" vs "apple") - Character Comparison: Must correctly compare 'A'=65 vs 'a'=97 vs 'Z'=90 vs 'z'=122 - Common Failures: - Incomplete lowercase conversion: "deTOURS" instead of "detours" - Wrong position comparison: comparing wrong character indices - Typos during conversion: "cockeyed" → "cokeyed"

When Enabled (mechanical=2 or 3): Words pre-normalized to lowercase, prob_mutation forced to 0.0

Secondary Obstacle: Symbol Ordering (Controlled by remove_symbols)

When Disabled (mechanical=0 or 2): Models must handle punctuation in words

Dictionary contains realistic words with symbols (hyphens, apostrophes): - Symbol ASCII Ordering: '-' (45) and '\'' (39) sort before letters (97+) - Common Failures: - Symbol removal: "Reader's" → "readers" (apostrophe stripped) - Wrong symbol ordering: Confusion about where "hour-long" sorts relative to "house" - Verbose reasoning: Models spend 25K+ chars reasoning about ASCII codes and still fail

When Enabled (mechanical=1 or 3): Symbols stripped via regex [^a-zA-Z] before word list creation

Tertiary Obstacle: Word Duplications

Repeated words that test duplicate handling and attention: - Duplicate Processing: Models must sort all instances of repeated words correctly - Attention Challenge: Duplicates may cause confusion about unique vs. repeated elements - Ordering Consistency: All instances of same word must appear together in sorted output - Controlled by: prob_duplication parameter (independent of mechanical mode)

Strategic Distribution

Words are shuffled after all transformations to ensure: - No Positional Cues: Original dictionary order is completely randomized - Mixed Complexity: Case mutations (if enabled) and duplications distributed throughout input - Mechanical Consistency: Processing difficulty maintained across entire word list

Cognitive Skills Tested

Mechanical String Processing (Core Capability)

  • Character-by-Character Comparison: Accurate lexicographic ordering at character level
  • Case Normalization: Converting uppercase to lowercase without errors (Mode 0, 1)
  • Symbol Handling: Correctly ordering words containing hyphens, apostrophes (Mode 0, 2)
  • ASCII Code Reasoning: Understanding character ordering (e.g., '-' < 'a', '\'' < 'a')

Algorithmic Skills

  • Lexicographic Ordering: Understanding alphabetical sequence relationships
  • Pattern Recognition: Identifying alphabetical patterns across varied presentations
  • Working Memory: Maintaining sort criteria while processing word sequences
  • Attention to Detail: Accurate position-specific character comparison
  • Duplicate Handling: Correctly processing repeated elements in collections
  • Systematic Processing: Applying consistent sorting rules across entire collections

Format Transformation

  • Input Parsing: Converting space-separated words to individual tokens
  • Output Formatting: Generating newline-separated results correctly

Key Failure Modes Observed

Based on empirical analysis of model failures:

  1. Wrong Alphabetical Comparisons (even on trivial 4-word cases):
  2. Comparing wrong character positions (e.g., "annie" after "arimathea")
  3. Incorrect character ordering (e.g., "lawman" after "savage" when 'l' < 's')

  4. Case Conversion Errors:

  5. Incomplete conversion: "deTOURS" instead of "detours"
  6. Character typos: "cockeyed" → "cokeyed"

  7. Symbol Processing Errors:

  8. Apostrophe removal: "Reader's" → "readers"
  9. Symbol ordering confusion despite explicit ASCII reasoning

  10. Word Omission: Catastrophic failures where words completely dropped from output

Critical Insight: Even models that use explicit <think> reasoning with ASCII code comparisons (e.g., "'e' (101) < 'o' (111)") still make mechanical errors. Verbose chain-of-thought does not guarantee correct character-level string processing.

Applications

This test manifold evaluates capabilities essential for:

  • Data Organization: Sorting textual data in databases and applications
  • Information Retrieval: Organizing search results and directory listings
  • Text Processing: Alphabetizing content in documents and reports
  • User Interface: Implementing sort functionality in applications
  • Data Validation: Ensuring consistent ordering in data processing pipelines
  • Content Management: Organizing textual content for presentation
  • Search Optimization: Preparing sorted indexes for efficient searching
  • Quality Assurance: Validating sort implementations in software systems
  • Document Processing: Organizing references, glossaries, and indexes
  • Database Operations: Implementing ORDER BY functionality and data organization