Compare

ReasonScape Compare Tools¶

The ReasonScape Compare tools create systematic 2D grid visualizations that enable multi-dimensional analysis of AI reasoning capabilities across models, tasks, and difficulty parameters. These tools transform the interactive exploration capabilities of the ReasonScape Explorer into comprehensive comparison matrices for research and evaluation.

Overview¶

ReasonScape provides two complementary comparison tools:

compare_surface.py: Creates grids of 3D difficulty manifold surfaces for cross-model performance comparison
compare_project.py: Creates grids of token-frequency FFT analysis plots for cognitive architecture comparison

Both tools generate publication-ready PNG exports with embedded statistics and multi-line labeling for complex experimental conditions.

Surface Comparison Tool (compare_surface.py)¶

The Surface Comparison tool creates systematic grids of 3D difficulty manifold visualizations, enabling researchers to compare model performance across identical reasoning challenges.

Surface Comparison Example

Key Features¶

Cross-Model Analysis: Compare multiple models on identical difficulty surfaces
Multi-Task Support: Analyze single tasks (all surfaces) or all tasks (first surface each)
Statistical Annotations: Embedded min/mean/max accuracy with test counts and token usage
Group Filtering: Focus on specific model families or experimental conditions
Consistent Scaling: Unified Z-axis (0-1 accuracy) and camera angles across all subplots

Usage¶

Basic Cross-Model Comparison¶

Compare all available models on the first surface of each task:

python compare_surface.py data/dataset-m6.json --output all_models_overview.png

Single Task Deep Dive¶

Analyze all surfaces of a specific task across models:

python compare_surface.py data/dataset-m6.json --task arithmetic --output arithmetic_deep_dive.png

All Surfaces Analysis¶

Generate comprehensive comparison of all surfaces from all tasks:

python compare_surface.py data/dataset-m6.json --task all --output comprehensive_analysis.png

Group-Filtered Analysis¶

Focus on specific model families using group filters:

python compare_surface.py data/dataset-m6.json --groups "opensource,meta" --output opensource_comparison.png

Command Line Arguments¶

config: Dataset configuration JSON file (required)
--task: Task filter - specific task name, "all" for all surfaces, or omit for first surface per task
--output: Output PNG filename (default: comparison.png)
--groups: Comma-separated list of scenario groups to include

Grid Layout¶

The tool generates a grid where:

Rows: Model scenarios (filtered by groups if specified)
Columns: Task surfaces (determined by task parameter)
Y-axis labels: Multi-line scenario names with automatic line breaking
Column titles: Multi-line task/surface labels with parenthetical information
Annotations: Statistical summaries below each subplot

Statistical Annotations¶

Each subplot includes embedded statistics:

Min: 0.234 | Mean: 0.567 | Max: 0.891
Tests: 12.3k | Tokens: 45.6M

Accuracy Range: Minimum, mean, and maximum accuracy across the surface
Test Volume: Total number of test cases (in thousands)
Token Usage: Total tokens consumed (in millions)

FFT Comparison Tool (compare_project.py)¶

The FFT Comparison tool creates systematic grids of token-frequency spectral analysis plots, revealing cognitive architecture patterns across models and reasoning tasks.

FFT Comparison Example

Key Features¶

Spectral Analysis Grids: Compare token-frequency patterns across models and projections
Consistent Color Mapping: Fixed color palette ensures parameter values have identical colors across all subplots
Integrated Legends: Dedicated legend column with parameter value mappings
Multi-Task Support: Analyze all projections from specified tasks or all tasks
Statistical Confidence: Mean spectra with standard deviation bands

Usage¶

All Tasks FFT Analysis¶

Compare token-frequency patterns across all available projections:

python compare_project.py data/dataset-m6.json --output fft_all_tasks.png

Specific Task Focus¶

Analyze FFT patterns for specific tasks only:

python compare_project.py data/dataset-m6.json --task "arithmetic,boolean" --output fft_math_tasks.png

Group-Filtered FFT Analysis¶

Focus on specific model families:

python compare_project.py data/dataset-m6.json --groups "proprietary" --output fft_proprietary_models.png

Command Line Arguments¶

config: Dataset configuration JSON file (required)
--task: Comma-separated list of tasks to include (default: all tasks)
--output: Output PNG filename (default: fft_comparison.png)
--groups: Comma-separated list of scenario groups to include

Grid Layout¶

The tool generates a grid where:

Rows: Task projections (each projection becomes a row)
Columns: Legend column + model scenarios
Y-axis labels: Multi-line task/projection labels
Column titles: Multi-line scenario names
Legend column: Parameter value mappings with consistent colors

FFT Analysis Components¶

Each FFT subplot shows:

Mean spectrum lines: Average token-frequency response for each parameter value
Confidence bands: Standard deviation regions around mean spectra
Consistent colors: Parameter values use identical colors across all models
Frequency range: Typically 0-64 Hz covering linguistic complexity patterns

Color Consistency¶

The tool uses a fixed 10-color palette that cycles consistently:

Same parameter values → Same colors across all models
Legend integration → Clear parameter-to-color mapping
Cross-model comparison → Visual patterns immediately apparent

Integration with Dataset Configuration¶

Both tools rely on dataset configuration files that define visualization parameters and data organization. For detailed information about the dataset configuration format, see the Configuration Documentation.

The key components include:

Surface Definitions (for compare_surface.py)¶

{
  "basetasks": {
    "arithmetic": {
      "surfaces": [
        {
          "label": "Length x Depth (Random Whitespace, -9 to 9)",
          "filter": {"min_number": -9, "max_number": 9, "prob_dewhitespace": 0.5},
          "x_data": "length", "x_values": [8, 16, 24, 32, 40, 48],
          "y_data": "max_depth", "y_values": [0, 1, 4]
        }
      ]
    }
  }
}

Projection Definitions (for compare_project.py)¶

{
  "basetasks": {
    "arithmetic": {
      "projections": [
        {
          "label": "Length (depth=0, whitespace=50%)",
          "axis": "length",
          "filter": {"max_depth": 0, "prob_dewhitespace": 0.5},
          "values": [8, 16, 32, 48],
          "labels": ["8","16","32","48"]
        }
      ]
    }
  }
}

Scenario and Group Management¶

{
  "scenarios": {
    "gpt-4o+zerocot-nosys+greedy-4k": { 
      "label": "OpenAI GPT-4o", 
      "groups": ["openai", "proprietary"] 
    },
    "llama-3.1-8b+zerocot-nosys+greedy-4k": { 
      "label": "Meta Llama-3.1-8B", 
      "groups": ["meta", "opensource"] 
    }
  }
}

Research Applications¶

Cross-Model Performance Analysis¶

Use compare_surface.py to:

Identify model strengths: Which models excel at specific reasoning challenges?
Find failure modes: Where do models consistently break down?
Compare architectures: How do different model families handle complexity?
Validate improvements: Do model updates improve performance across all difficulty dimensions?

Cognitive Architecture Investigation¶

Use compare_project.py to:

Detect processing patterns: How do models tokenize and process reasoning problems?
Compare linguistic strategies: Do models use different frequency patterns for identical tasks?
Identify architectural signatures: Can you distinguish model families by their spectral patterns?
Analyze reasoning evolution: How do spectral patterns change with problem difficulty?

Publication Workflow¶

Generate comparison grids using both tools with consistent dataset configurations
Export high-resolution PNGs with embedded statistics and clear labeling
Combine with leaderboard data for comprehensive model evaluation
Use group filtering to create focused comparisons for specific research questions

Best Practices¶

Dataset Configuration¶

Consistent labeling: Use clear, hierarchical labels for tasks and scenarios
Logical groupings: Organize models by architecture, size, or training approach
Complete coverage: Ensure all models have data for meaningful comparisons

Analysis Strategy¶

Start broad: Use overview comparisons to identify interesting patterns
Focus deep: Use task-specific analysis to investigate specific phenomena
Cross-validate: Compare surface performance with FFT patterns for comprehensive understanding
Statistical awareness: Pay attention to test counts and confidence intervals

Visualization Optimization¶

Appropriate resolution: Use high DPI settings for publication-quality exports
Consistent parameters: Use identical dataset configurations across comparison runs
Clear documentation: Include analysis parameters and filtering criteria in research notes

The ReasonScape Compare tools transform individual model evaluation into systematic comparative analysis, enabling researchers to identify patterns, validate hypotheses, and advance understanding of AI reasoning capabilities across multiple dimensions of analysis.