Skip to content

Movies: Set Membership Pattern Extraction Under Noise

The Movies task is available in the ReasonScape engine but is not part of the r12 evaluation set. r12 uses the Tables task instead. Movies remains available for custom evaluations.

Overview

The Movies task evaluates a model's ability to extract patterns from noisy categorical data by identifying which entity belongs to a given set. This is a pure information-based reasoning task—all categorical facets (genres, themes) are explicitly provided in the problem, requiring no prior knowledge of actual movies.

Key Insight: The primary difficulty dimension is Signal-to-Noise Ratio (SNR), not computational complexity or domain knowledge. Models must filter signal (pattern-defining facets) from noise (confounding facets) to determine set membership.

Domain Independence: Movies are used as a convenient proxy because they have rich categorical attributes (genres, themes). The exact same task structure works with Pokemon (types, abilities), cars (features, classes), books (genres, topics), or any other entity domain with multiple facets. The task tests pure reasoning capability, not domain knowledge.

What's Actually Being Tested

This task measures pure reasoning capabilities:

  1. Signal Extraction: Identifying which facets define the pattern (target facets) vs which are irrelevant (noise facets)
  2. Pattern Recognition Under Noise: Detecting commonalities when signal is buried in confounding information
  3. Few-Shot Learning: Inferring patterns from minimal examples (reference_count)
  4. Attention Capacity: Maintaining focus on signal as context length increases

All categorical information is provided in the problem. Models do not need any prior knowledge of movies, genres, or themes—they only need to identify which facets are shared across the reference set.

Why Movies?

Movies serve as an effective test domain because:

  1. Rich categorical structure: Each movie has multiple genres and themes (multi-valued facets)
  2. Natural ambiguity: Many movies belong to multiple genres/themes simultaneously
  3. Familiar framing: "Recommend a movie" is intuitive for users
  4. Large feature space: 19 genres × 25 themes creates diverse combinations
  5. Easily interpretable: Humans can verify test case validity

Critical point: Nothing in the task mechanics is specific to movies. The same algorithm generates valid test cases for any domain where: - Entities have multiple categorical attributes (facets) - Facets are multi-valued (entities can have multiple genres, types, etc.) - A database exists mapping entities to their facets

Example domains that work identically: - Pokemon: types, abilities, habitats, generations - Books: genres, themes, target audience, narrative style - Cars: class, features, brand, market segment - Restaurants: cuisine, atmosphere, price range, dietary options

Task Mechanics

Set Membership Structure

Each test case consists of:

  1. Reference Set: N movies that all share certain facets (the "signal")
  2. Target: 1 movie that belongs to the reference set (shares signal facets)
  3. Distractors: K-1 movies that do NOT belong (anti-reference set)
  4. Choices: Target + Distractors, shuffled

The model must identify which choice belongs with the reference set.

Signal and Noise

Signal (Pattern-Defining Facets): - num_target_facets: How many facets define set membership - Example: "All movies share genres: [Horror, Thriller]" - Low target facets = subtle pattern = harder

Noise (Confounding Facets): - num_noise_facets: How many additional facets vary randomly across the set - Example: Some are Drama, some are Mystery, some are Sci-Fi (irrelevant to pattern) - High noise facets = buried signal = harder

Context: - reference_count: How many examples provided to learn the pattern - More examples generally help, BUT: - Strong models saturate around ref=4 (no benefit from more) - Weak models can experience attention breakdown with too many examples (inverse relationship!)

Difficulty Formula

difficulty  num_noise_facets / num_target_facets

# Modulated by reference_count (diminishing returns):
effective_difficulty = noise / (target * log(reference_count + 1))

Example SNR profiles:

High SNR (Easy):
  target_facets=2, noise_facets=2, reference_count=6
  → Clear pattern, lots of examples
  → Strong models: ~95%, Weak models: ~70%

Medium SNR:
  target_facets=1, noise_facets=3, reference_count=4
  → Subtle pattern, moderate context
  → Strong models: ~87%, Weak models: ~60%

Low SNR (Hard):
  target_facets=1, noise_facets=4, reference_count=2
  → Buried pattern, minimal examples
  → Strong models: ~78%, Weak models: ~55%

Configuration Parameters

Generation Schema (MovieGenerationParams)

class MovieGenerationParams(BaseModel):
    count: int                                    # Number of test cases to generate

    # SNR Configuration (Primary Difficulty Knobs)
    num_target_facets: int                        # Signal strength (1-2)
    num_noise_facets: int                         # Noise level (0-5)

    # Context Configuration
    reference_count: int                          # Examples to learn from (2-6)
    choice_count: int                            # Multiple choice options (3-5)

    # Answer Options Configuration
    num_answer_facets: int                       # Facet dimensionality in choices (3-5)

Facet Display: All movies (both reference set and answer choices) are shown with their categorical facets explicitly labeled in the format: Movie Title (Facet1, Facet2, Facet3, ...). Genres and themes are merged into a single comma-separated list. This makes the task purely information-based with no domain knowledge required.

Typical Configurations

SNR Gradient Testing (recommended):

# Low SNR (Hard)
dict(num_target_facets=1, num_noise_facets=4, reference_count=2)

# Medium SNR
dict(num_target_facets=1, num_noise_facets=3, reference_count=3)

# High SNR (Easy)
dict(num_target_facets=2, num_noise_facets=2, reference_count=6)

Attention Breakdown Testing (for weak models):

# Test if more references help or hurt
reference_counts = [2, 3, 4, 6]
# Weak models may show: ref=2 > ref=3 > ref=4 > ref=6 (inverse!)
# Strong models show: ref=2 < ref=3 ≈ ref=4 ≈ ref=6 (saturation)

Result Schema (MovieTestCaseResult)

class MovieTestCaseResult(BaseModel):
    input: str                                   # Formatted problem text
    target: str                                  # Correct answer (A)-(Z)

    # Test Case Structure
    reference_movies: List[Dict[str, Any]]       # Movies in reference set
    choices: List[Dict[str, Any]]                # All answer choices

    # Signal/Noise Configuration
    selected_genres: List[str]                   # Target facets (signal)
    selected_themes: List[str]                   # Target facets (signal)
    num_target_facets: int                       # Signal dimensionality
    num_noise_facets: int                        # Noise level

    # Metadata
    question_template_index: int                 # Template used (for sensitivity analysis)

Question Templates

Questions use natural language to frame the set membership task as a recommendation:

Template Categories

Direct Similarity:

"Which movie is most similar to {reference_movies}?"
"Which movie shares the most characteristics with {reference_movies}?"
"Which movie fits the same pattern as {reference_movies}?"

Recommendation Framing:

"Based on these movies: {reference_movies} - which would you recommend next?"
"If you enjoyed {reference_movies}, which movie would you like most?"
"Which movie would appeal to someone who enjoys {reference_movies}?"

Group Membership (most direct):

"Which movie belongs with this group: {reference_movies}?"
"Complete this movie collection: {reference_movies}."

Example Rendering

Template: "Which movie is most similar to {reference_movies}?"

Example Problem (Medium SNR):

Which movie is most similar to Venom (Alien-Contact, Sci-Fi, Identity-Crisis, Horror); Contact (Sci-Fi, Identity-Crisis, Alien-Contact, Drama); District 9 (Class-Struggle, Action, Alien-Contact, Sci-Fi)?

Options:
(A) Arrival (Time-Loop, Alien-Contact, Sci-Fi)
(B) Moulin Rouge! (Musical, Romance, Drama)
(C) Free Solo (Adventure, Survival, Mental-Illness)
(D) Boss Level (Mystery, Revenge, Thriller)

Analysis: - Target facet: [Alien-Contact] (appears in all 3 reference movies) - Noise facets: [Sci-Fi, Identity-Crisis, Horror, Class-Struggle, Action] (vary across references) - Correct answer: (A) - only choice containing Alien-Contact - Difficulty: Model must identify the single common facet amid varying noise facets

Example Test Cases by SNR

High SNR (Easy): target=2, noise=2, ref=4

If you liked Guardians of the Galaxy Vol. 2 (Father-Son, Found-Family, Action, Adventure);
The Lion King (Adventure, Coming-of-Age, Drama, Father-Son);
Star Wars: Episode V - The Empire Strikes Back (Adventure, Action, Father-Son, Good-vs-Evil);
Interstellar (Father-Son, Sacrifice, Adventure, Drama),
what should you watch next?

Options:
(A) When Harry Met Sally (Comedy, Romance, Drama)
(B) Captain Phillips (Crime, Survival, Action)
(C) The Hateful Eight (Crime, Betrayal, Mystery)
(D) Finding Nemo (Father-Son, Coming-of-Age, Adventure)  ← TARGET

Expected: (D)

Why Easy: - 2 target facets ([Adventure, Father-Son]) = strong signal present in all references - 2 noise facets per reference ([Found-Family, Coming-of-Age, Drama, etc.]) provide moderate context - 4 examples make pattern clear - SNR ≈ 2/2 = 1.0 (balanced)

Medium SNR: target=1, noise=3, ref=3

Which movie is most similar to Venom (Alien-Contact, Sci-Fi, Identity-Crisis, Horror);
Contact (Sci-Fi, Identity-Crisis, Alien-Contact, Drama);
District 9 (Class-Struggle, Action, Alien-Contact, Sci-Fi)?

Options:
(A) Arrival (Time-Loop, Alien-Contact, Sci-Fi)  ← TARGET
(B) Moulin Rouge! (Musical, Romance, Drama)
(C) Free Solo (Adventure, Survival, Mental-Illness)
(D) Boss Level (Mystery, Revenge, Thriller)

Expected: (A)

Why Medium: - 1 target facet ([Alien-Contact]) = subtle signal, requires careful attention - 3 noise facets ([Sci-Fi, Identity-Crisis, Horror, etc.]) vary across references - Only 3 examples provide moderate context - SNR ≈ 3/1 = 3.0 (noisy)

Low SNR (Hard): target=1, noise=4, ref=2

Based on these movies: Rudy (Biography, Underdog, Drama, Sport, Sacrifice);
The Miracle (Drama, Found-Family, Family, Underdog, Sport) -
which would you recommend next?

Options:
(A) Manchester by the Sea (Drama, Identity-Crisis, Found-Family)
(B) Fargo (Drama, Thriller, Crime, Fish-Out-of-Water)
(C) Spider-Man: Far From Home (Coming-of-Age, Adventure, Sci-Fi, Action)
(D) The Blindside (Found-Family, Drama, Sport, Biography)  ← TARGET
(E) Chappie (Action, Crime, Artificial-Intelligence, Coming-of-Age)

Expected: (D)

Why Hard: - 1 target facet ([Sport]) = very subtle signal buried in noise - 4 noise facets per movie make pattern extraction difficult - Only 2 examples = minimal context to learn from - 5 choices increase difficulty - SNR ≈ 4/1 = 4.0 (very noisy) - Note: Multiple distractors share some facets with references (Drama, Found-Family, Biography) requiring precise pattern identification

Attention Test: If this was ref=6 with 6 Sport movies, weak models might actually perform WORSE due to attention breakdown spreading focus across too many examples.

Movie Database

~400 movies spanning multiple decades, each annotated with:

Genres (19 categories): - Action, Adventure, Animation, Biography, Comedy, Crime, Documentary, Drama, Family, Fantasy, History, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western

Themes (25 categories): - Coming-of-Age, Redemption, Revenge, Love-Triangle, Fish-Out-of-Water, Good-vs-Evil, Underdog, Betrayal, Sacrifice, Identity-Crisis, Father-Son, Time-Loop, Artificial-Intelligence, Dystopian-Society, Heist, Road-Trip, Survival, Corruption, Mental-Illness, Technology-Dependence, Class-Struggle, Alien-Contact, Superhero-Origin, Found-Family, Moral-Ambiguity

Note: Database could be replaced with any entity set having multi-valued categorical attributes. The generation algorithm is domain-agnostic.

Cognitive Skills Tested

All skills tested are domain-independent reasoning capabilities:

  • Signal Extraction: Identifying relevant features from noise
  • Pattern Recognition Under Noise: Detecting structure when SNR is low
  • Few-Shot Learning: Inferring patterns from limited examples
  • Multi-Dimensional Pattern Matching: Considering multiple facets simultaneously
  • Attention Management: Maintaining focus on signal as context increases

No prior knowledge is required—all categorical information is provided in the problem text.

Model Capability Indicators

Strong Models: - Performance saturates at ref=4-6 (more examples don't help) - Handle high noise (4-5 noise facets) with minimal degradation - Maintain >85% accuracy even at low SNR

Medium Models: - Show clear gradient from ref=2 to ref=4 - Struggle with noise>3 but acceptable with noise=2-3 - Performance: 75-85% at medium SNR

Weak Models: - May show inverse gradient (ref=2 > ref=6) due to attention breakdown - Severely impacted by noise>3 - Performance: 55-65% at medium SNR, <50% at low SNR

Performance Indicators

From empirical testing across 5 models:

Model Class High SNR Medium SNR Low SNR Attention Pattern
Strong (GLM-4.5, 82B+) 90-95% 85-90% 80-85% Saturates at ref=4
Medium (30-32B reasoning) 85-90% 80-85% 75-82% Plateau ref=4-6
Small (8-13B) 70-80% 65-75% 60-70% Slight improvement with refs
Weak (8B base) 65-70% 58-62% 55-60% INVERSE: ref=2 > ref=6

Key Finding: Difficulty scales primarily with SNR, not reference count or choice count. The num_noise_facets / num_target_facets ratio is the strongest predictor of task difficulty across all model classes.

Applications

This task evaluates capabilities essential for:

  • Content Recommendation Systems: Finding patterns in user preferences
  • Few-Shot Classification: Learning categories from minimal labeled examples
  • Pattern Recognition Under Noise: Extracting signal from noisy categorical data
  • Multi-Modal Feature Integration: Combining multiple feature dimensions
  • Attention Capacity: Managing information as context scales

The domain-independent nature and complete information provision make this useful for evaluating pure reasoning capability without cultural bias or knowledge requirements.

Implementation Notes

Domain Portability

To adapt this task to another domain:

  1. Create entity database with multi-valued categorical facets
  2. Implement facet filtering (select entities matching target facets)
  3. Implement anti-facet filtering (select entities matching NO target facets)
  4. Update question templates to reference new domain
  5. No changes needed to generation algorithm or difficulty calculation

Example adaptation for Pokemon:

# Database: pokemon.json with {name, types, abilities, habitat}
# Facets: types=["Fire", "Water", ...], abilities=[...], habitat=[...]
# Question: "Which Pokemon belongs with this group: {references}?"
# Everything else identical

Generation Algorithm

  1. Randomly select num_target_facets from available facets (genres + themes)
  2. Filter database to entities matching ALL target facets (reference pool)
  3. Filter database to entities matching NONE of target facets (anti-reference pool)
  4. Sample reference_count movies from reference pool
  5. Sample 1 target from reference pool (correct answer)
  6. Sample choice_count - 1 distractors from anti-reference pool
  7. For each entity, add num_noise_facets random facets from their full facet list
  8. Shuffle choices and format question using random template

This algorithm works identically for any domain with categorical facets.