Movies: Set Membership Pattern Extraction Under Noise¶
The Movies task is available in the ReasonScape engine but is not part of the r12 evaluation set. r12 uses the Tables task instead. Movies remains available for custom evaluations.
Overview¶
The Movies task evaluates a model's ability to extract patterns from noisy categorical data by identifying which entity belongs to a given set. This is a pure information-based reasoning task—all categorical facets (genres, themes) are explicitly provided in the problem, requiring no prior knowledge of actual movies.
Key Insight: The primary difficulty dimension is Signal-to-Noise Ratio (SNR), not computational complexity or domain knowledge. Models must filter signal (pattern-defining facets) from noise (confounding facets) to determine set membership.
Domain Independence: Movies are used as a convenient proxy because they have rich categorical attributes (genres, themes). The exact same task structure works with Pokemon (types, abilities), cars (features, classes), books (genres, topics), or any other entity domain with multiple facets. The task tests pure reasoning capability, not domain knowledge.
What's Actually Being Tested¶
This task measures pure reasoning capabilities:
- Signal Extraction: Identifying which facets define the pattern (target facets) vs which are irrelevant (noise facets)
- Pattern Recognition Under Noise: Detecting commonalities when signal is buried in confounding information
- Few-Shot Learning: Inferring patterns from minimal examples (reference_count)
- Attention Capacity: Maintaining focus on signal as context length increases
All categorical information is provided in the problem. Models do not need any prior knowledge of movies, genres, or themes—they only need to identify which facets are shared across the reference set.
Why Movies?¶
Movies serve as an effective test domain because:
- Rich categorical structure: Each movie has multiple genres and themes (multi-valued facets)
- Natural ambiguity: Many movies belong to multiple genres/themes simultaneously
- Familiar framing: "Recommend a movie" is intuitive for users
- Large feature space: 19 genres × 25 themes creates diverse combinations
- Easily interpretable: Humans can verify test case validity
Critical point: Nothing in the task mechanics is specific to movies. The same algorithm generates valid test cases for any domain where: - Entities have multiple categorical attributes (facets) - Facets are multi-valued (entities can have multiple genres, types, etc.) - A database exists mapping entities to their facets
Example domains that work identically: - Pokemon: types, abilities, habitats, generations - Books: genres, themes, target audience, narrative style - Cars: class, features, brand, market segment - Restaurants: cuisine, atmosphere, price range, dietary options
Task Mechanics¶
Set Membership Structure¶
Each test case consists of:
- Reference Set: N movies that all share certain facets (the "signal")
- Target: 1 movie that belongs to the reference set (shares signal facets)
- Distractors: K-1 movies that do NOT belong (anti-reference set)
- Choices: Target + Distractors, shuffled
The model must identify which choice belongs with the reference set.
Signal and Noise¶
Signal (Pattern-Defining Facets):
- num_target_facets: How many facets define set membership
- Example: "All movies share genres: [Horror, Thriller]"
- Low target facets = subtle pattern = harder
Noise (Confounding Facets):
- num_noise_facets: How many additional facets vary randomly across the set
- Example: Some are Drama, some are Mystery, some are Sci-Fi (irrelevant to pattern)
- High noise facets = buried signal = harder
Context:
- reference_count: How many examples provided to learn the pattern
- More examples generally help, BUT:
- Strong models saturate around ref=4 (no benefit from more)
- Weak models can experience attention breakdown with too many examples (inverse relationship!)
Difficulty Formula¶
difficulty ≈ num_noise_facets / num_target_facets
# Modulated by reference_count (diminishing returns):
effective_difficulty = noise / (target * log(reference_count + 1))
Example SNR profiles:
High SNR (Easy):
target_facets=2, noise_facets=2, reference_count=6
→ Clear pattern, lots of examples
→ Strong models: ~95%, Weak models: ~70%
Medium SNR:
target_facets=1, noise_facets=3, reference_count=4
→ Subtle pattern, moderate context
→ Strong models: ~87%, Weak models: ~60%
Low SNR (Hard):
target_facets=1, noise_facets=4, reference_count=2
→ Buried pattern, minimal examples
→ Strong models: ~78%, Weak models: ~55%
Configuration Parameters¶
Generation Schema (MovieGenerationParams)¶
class MovieGenerationParams(BaseModel):
count: int # Number of test cases to generate
# SNR Configuration (Primary Difficulty Knobs)
num_target_facets: int # Signal strength (1-2)
num_noise_facets: int # Noise level (0-5)
# Context Configuration
reference_count: int # Examples to learn from (2-6)
choice_count: int # Multiple choice options (3-5)
# Answer Options Configuration
num_answer_facets: int # Facet dimensionality in choices (3-5)
Facet Display: All movies (both reference set and answer choices) are shown with their categorical facets explicitly labeled in the format: Movie Title (Facet1, Facet2, Facet3, ...). Genres and themes are merged into a single comma-separated list. This makes the task purely information-based with no domain knowledge required.
Typical Configurations¶
SNR Gradient Testing (recommended):
# Low SNR (Hard)
dict(num_target_facets=1, num_noise_facets=4, reference_count=2)
# Medium SNR
dict(num_target_facets=1, num_noise_facets=3, reference_count=3)
# High SNR (Easy)
dict(num_target_facets=2, num_noise_facets=2, reference_count=6)
Attention Breakdown Testing (for weak models):
# Test if more references help or hurt
reference_counts = [2, 3, 4, 6]
# Weak models may show: ref=2 > ref=3 > ref=4 > ref=6 (inverse!)
# Strong models show: ref=2 < ref=3 ≈ ref=4 ≈ ref=6 (saturation)
Result Schema (MovieTestCaseResult)¶
class MovieTestCaseResult(BaseModel):
input: str # Formatted problem text
target: str # Correct answer (A)-(Z)
# Test Case Structure
reference_movies: List[Dict[str, Any]] # Movies in reference set
choices: List[Dict[str, Any]] # All answer choices
# Signal/Noise Configuration
selected_genres: List[str] # Target facets (signal)
selected_themes: List[str] # Target facets (signal)
num_target_facets: int # Signal dimensionality
num_noise_facets: int # Noise level
# Metadata
question_template_index: int # Template used (for sensitivity analysis)
Question Templates¶
Questions use natural language to frame the set membership task as a recommendation:
Template Categories¶
Direct Similarity:
"Which movie is most similar to {reference_movies}?"
"Which movie shares the most characteristics with {reference_movies}?"
"Which movie fits the same pattern as {reference_movies}?"
Recommendation Framing:
"Based on these movies: {reference_movies} - which would you recommend next?"
"If you enjoyed {reference_movies}, which movie would you like most?"
"Which movie would appeal to someone who enjoys {reference_movies}?"
Group Membership (most direct):
"Which movie belongs with this group: {reference_movies}?"
"Complete this movie collection: {reference_movies}."
Example Rendering¶
Template: "Which movie is most similar to {reference_movies}?"
Example Problem (Medium SNR):
Which movie is most similar to Venom (Alien-Contact, Sci-Fi, Identity-Crisis, Horror); Contact (Sci-Fi, Identity-Crisis, Alien-Contact, Drama); District 9 (Class-Struggle, Action, Alien-Contact, Sci-Fi)?
Options:
(A) Arrival (Time-Loop, Alien-Contact, Sci-Fi)
(B) Moulin Rouge! (Musical, Romance, Drama)
(C) Free Solo (Adventure, Survival, Mental-Illness)
(D) Boss Level (Mystery, Revenge, Thriller)
Analysis: - Target facet: [Alien-Contact] (appears in all 3 reference movies) - Noise facets: [Sci-Fi, Identity-Crisis, Horror, Class-Struggle, Action] (vary across references) - Correct answer: (A) - only choice containing Alien-Contact - Difficulty: Model must identify the single common facet amid varying noise facets
Example Test Cases by SNR¶
High SNR (Easy): target=2, noise=2, ref=4¶
If you liked Guardians of the Galaxy Vol. 2 (Father-Son, Found-Family, Action, Adventure);
The Lion King (Adventure, Coming-of-Age, Drama, Father-Son);
Star Wars: Episode V - The Empire Strikes Back (Adventure, Action, Father-Son, Good-vs-Evil);
Interstellar (Father-Son, Sacrifice, Adventure, Drama),
what should you watch next?
Options:
(A) When Harry Met Sally (Comedy, Romance, Drama)
(B) Captain Phillips (Crime, Survival, Action)
(C) The Hateful Eight (Crime, Betrayal, Mystery)
(D) Finding Nemo (Father-Son, Coming-of-Age, Adventure) ← TARGET
Expected: (D)
Why Easy: - 2 target facets ([Adventure, Father-Son]) = strong signal present in all references - 2 noise facets per reference ([Found-Family, Coming-of-Age, Drama, etc.]) provide moderate context - 4 examples make pattern clear - SNR ≈ 2/2 = 1.0 (balanced)
Medium SNR: target=1, noise=3, ref=3¶
Which movie is most similar to Venom (Alien-Contact, Sci-Fi, Identity-Crisis, Horror);
Contact (Sci-Fi, Identity-Crisis, Alien-Contact, Drama);
District 9 (Class-Struggle, Action, Alien-Contact, Sci-Fi)?
Options:
(A) Arrival (Time-Loop, Alien-Contact, Sci-Fi) ← TARGET
(B) Moulin Rouge! (Musical, Romance, Drama)
(C) Free Solo (Adventure, Survival, Mental-Illness)
(D) Boss Level (Mystery, Revenge, Thriller)
Expected: (A)
Why Medium: - 1 target facet ([Alien-Contact]) = subtle signal, requires careful attention - 3 noise facets ([Sci-Fi, Identity-Crisis, Horror, etc.]) vary across references - Only 3 examples provide moderate context - SNR ≈ 3/1 = 3.0 (noisy)
Low SNR (Hard): target=1, noise=4, ref=2¶
Based on these movies: Rudy (Biography, Underdog, Drama, Sport, Sacrifice);
The Miracle (Drama, Found-Family, Family, Underdog, Sport) -
which would you recommend next?
Options:
(A) Manchester by the Sea (Drama, Identity-Crisis, Found-Family)
(B) Fargo (Drama, Thriller, Crime, Fish-Out-of-Water)
(C) Spider-Man: Far From Home (Coming-of-Age, Adventure, Sci-Fi, Action)
(D) The Blindside (Found-Family, Drama, Sport, Biography) ← TARGET
(E) Chappie (Action, Crime, Artificial-Intelligence, Coming-of-Age)
Expected: (D)
Why Hard: - 1 target facet ([Sport]) = very subtle signal buried in noise - 4 noise facets per movie make pattern extraction difficult - Only 2 examples = minimal context to learn from - 5 choices increase difficulty - SNR ≈ 4/1 = 4.0 (very noisy) - Note: Multiple distractors share some facets with references (Drama, Found-Family, Biography) requiring precise pattern identification
Attention Test: If this was ref=6 with 6 Sport movies, weak models might actually perform WORSE due to attention breakdown spreading focus across too many examples.
Movie Database¶
~400 movies spanning multiple decades, each annotated with:
Genres (19 categories): - Action, Adventure, Animation, Biography, Comedy, Crime, Documentary, Drama, Family, Fantasy, History, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western
Themes (25 categories): - Coming-of-Age, Redemption, Revenge, Love-Triangle, Fish-Out-of-Water, Good-vs-Evil, Underdog, Betrayal, Sacrifice, Identity-Crisis, Father-Son, Time-Loop, Artificial-Intelligence, Dystopian-Society, Heist, Road-Trip, Survival, Corruption, Mental-Illness, Technology-Dependence, Class-Struggle, Alien-Contact, Superhero-Origin, Found-Family, Moral-Ambiguity
Note: Database could be replaced with any entity set having multi-valued categorical attributes. The generation algorithm is domain-agnostic.
Cognitive Skills Tested¶
All skills tested are domain-independent reasoning capabilities:
- Signal Extraction: Identifying relevant features from noise
- Pattern Recognition Under Noise: Detecting structure when SNR is low
- Few-Shot Learning: Inferring patterns from limited examples
- Multi-Dimensional Pattern Matching: Considering multiple facets simultaneously
- Attention Management: Maintaining focus on signal as context increases
No prior knowledge is required—all categorical information is provided in the problem text.
Model Capability Indicators¶
Strong Models: - Performance saturates at ref=4-6 (more examples don't help) - Handle high noise (4-5 noise facets) with minimal degradation - Maintain >85% accuracy even at low SNR
Medium Models: - Show clear gradient from ref=2 to ref=4 - Struggle with noise>3 but acceptable with noise=2-3 - Performance: 75-85% at medium SNR
Weak Models: - May show inverse gradient (ref=2 > ref=6) due to attention breakdown - Severely impacted by noise>3 - Performance: 55-65% at medium SNR, <50% at low SNR
Performance Indicators¶
From empirical testing across 5 models:
| Model Class | High SNR | Medium SNR | Low SNR | Attention Pattern |
|---|---|---|---|---|
| Strong (GLM-4.5, 82B+) | 90-95% | 85-90% | 80-85% | Saturates at ref=4 |
| Medium (30-32B reasoning) | 85-90% | 80-85% | 75-82% | Plateau ref=4-6 |
| Small (8-13B) | 70-80% | 65-75% | 60-70% | Slight improvement with refs |
| Weak (8B base) | 65-70% | 58-62% | 55-60% | INVERSE: ref=2 > ref=6 |
Key Finding: Difficulty scales primarily with SNR, not reference count or choice count. The num_noise_facets / num_target_facets ratio is the strongest predictor of task difficulty across all model classes.
Applications¶
This task evaluates capabilities essential for:
- Content Recommendation Systems: Finding patterns in user preferences
- Few-Shot Classification: Learning categories from minimal labeled examples
- Pattern Recognition Under Noise: Extracting signal from noisy categorical data
- Multi-Modal Feature Integration: Combining multiple feature dimensions
- Attention Capacity: Managing information as context scales
The domain-independent nature and complete information provision make this useful for evaluating pure reasoning capability without cultural bias or knowledge requirements.
Implementation Notes¶
Domain Portability¶
To adapt this task to another domain:
- Create entity database with multi-valued categorical facets
- Implement facet filtering (select entities matching target facets)
- Implement anti-facet filtering (select entities matching NO target facets)
- Update question templates to reference new domain
- No changes needed to generation algorithm or difficulty calculation
Example adaptation for Pokemon:
# Database: pokemon.json with {name, types, abilities, habitat}
# Facets: types=["Fire", "Water", ...], abilities=[...], habitat=[...]
# Question: "Which Pokemon belongs with this group: {references}?"
# Everything else identical
Generation Algorithm¶
- Randomly select
num_target_facetsfrom available facets (genres + themes) - Filter database to entities matching ALL target facets (reference pool)
- Filter database to entities matching NONE of target facets (anti-reference pool)
- Sample
reference_countmovies from reference pool - Sample 1 target from reference pool (correct answer)
- Sample
choice_count - 1distractors from anti-reference pool - For each entity, add
num_noise_facetsrandom facets from their full facet list - Shuffle choices and format question using random template
This algorithm works identically for any domain with categorical facets.