Skip to content

PointsDB Architecture

PointsDB is ReasonScape's DuckDB-based storage layer that replaces the legacy JSON bucket system with a proper database abstraction. It provides simple, composable primitives for storing and querying evaluation results with native support for multi-valued dimensions.

Design Philosophy

Simple, composable primitives over opinionated operations.

  • No high-level abstractions that hide what's happening
  • Explicit filtering at the database level
  • Native DuckDB features (LIST types, JSON queries, SQL macros)
  • Separate read vs write operations clearly

Schema

The Two-Plane Structure

PointsDB organizes evaluation data using a two-plane structure where each point exists in both an Evaluation Plane and a Task Complexity Plane. Each plane has identity dimensions (what defines a point) and facet dimensions (how to organize points).

EVALUATION TASK-COMPLEXITY
IDENTITY (5D) - model
- template
- sampler
- base_task
- params
FACETS - eval_id
- groups[]
- tiers[]
- surfaces[]
- projections[]

Dimension Types

Identity Dimensions (5D) - Uniquely identify a point on the manifold

  • model, template, sampler, base_task, params
  • These define WHAT was evaluated
  • A point is evaluated once; points equal along these 5D are interchangeable

Facet Dimensions - Multi-valued organizational views

  • tiers[] - Difficulty levels (e.g., "easy", "medium", "hard")
  • surfaces[] - 2D visualization slices (computed via param filter intersection)
  • projections[] - 1D analysis sweeps (computed via param filter intersection)
  • groups[] - Model categories (e.g., "arch:moe", "size:large")
  • Points can belong to multiple facet values!
  • Useful for aggregation, filtering, and visualization

Task Dimensions - Difficulty knobs (folded into params JSON)

  • length, max_depth, prob_dewhitespace (arithmetic)
  • depth, width (jsonpath)
  • Task-specific and complex to query, prefer using Facets instead.

Points Table

CREATE TABLE points (
    -- Primary key
    id INTEGER PRIMARY KEY,

    -- Index of eval that this point came from
    eval_id INTEGER,

    -- Identity dimensions (5D unique space)
    model VARCHAR NOT NULL,
    template VARCHAR NOT NULL,
    sampler VARCHAR NOT NULL,
    base_task VARCHAR NOT NULL,
    params JSON NOT NULL,

    -- Full task name (for reference)
    task VARCHAR NOT NULL,

    -- Facet dimensions (stored as native LISTs)
    tiers VARCHAR[],
    surfaces VARCHAR[],
    projections VARCHAR[],
    groups VARCHAR[],

    -- Evaluation results (immutable per unique point)
    adjusted_successes FLOAT NOT NULL,
    adjusted_trials FLOAT NOT NULL,
    adjusted_center FLOAT NOT NULL,
    adjusted_margin FLOAT NOT NULL,

    -- Raw counts
    correct INTEGER NOT NULL,
    invalid INTEGER NOT NULL,
    total INTEGER NOT NULL,
    truncated INTEGER NOT NULL,
    hard_terminated INTEGER NOT NULL,

    -- Derived ratios
    invalid_ratio FLOAT,
    truncated_ratio FLOAT,

    -- Token statistics (pre-aggregated at point level)
    prompt_tokens_mean FLOAT,
    completion_tokens_mean FLOAT,
    completion_tokens_correct_mean FLOAT,
    completion_tokens_incorrect_mean FLOAT,
    total_tokens BIGINT,

    -- Per-sample arrays (for compression/hazard analysis)
    completion_tokens_list INTEGER[],
    compressed_sizes_list INTEGER[],
    answer_status_list INTEGER[],

    -- FFT data (optional, for spectral analysis)
    fft_mean_list FLOAT[],
    fft_std_list FLOAT[],

    -- Metadata
    evaluated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Uniqueness constraint on identity dimensions
    UNIQUE(model, template, sampler, base_task, params)
);

Key Schema Decisions

  • No 'precision' column - Runner artifact that only affects sample count
  • No 'degrees'/'densities' columns - Runner artifacts; mapped to semantic tiers[] during evaluation
  • Facet dimensions as VARCHAR[] - Native lists enable proper membership queries
  • JSON for params - DuckDB JSON support allows path queries
  • eval_id column - Index into config.evals[] for display metadata

The PointsDB API (7 Methods)

Write Operations (4 methods)

# Universal counting/existence check (with optional explosion)
count_points(filters: dict, explode: list[str] = None) -> int

# Bulk insert/update points efficiently
bulk_upsert_points(points: List[dict], replace_filters: dict) -> int

# Overwrite columns matching filter
update_points_set(filters: dict, updates: dict) -> int

# Append to list columns matching filter
update_points_append(filters: dict, appends: dict) -> int

Read Operations (2 methods)

# Generic aggregation with flexible grouping
aggregate(filters: dict, group_by: list[str], explode: list[str] = None) -> pd.DataFrame

# Raw point records with flexible filtering and optional list explosion
query_points(filters: dict, columns: list = None, explode: list[str] = None) -> pd.DataFrame

Helper Operations (1 method)

# Discover unique values across columns (for building sub-queries)
# Note: Automatically explodes list-type columns (tiers, surfaces, projections, groups)
unique_values(filters: dict, columns: list[str]) -> pd.DataFrame

Filter Syntax

All methods accept a filters dict with uniform DNF (Disjunctive Normal Form) syntax across all dimensions.

Uniform Filter Semantics

The same syntax applies to both scalar dimensions (model, template, sampler, base_task, task, eval_id) and set dimensions (tiers, surfaces, projections, groups):

# Single value - direct match
filters = {"base_task": "arithmetic"}           # scalar: base_task == 'arithmetic'
filters = {"groups": "arch:moe"}                # set: 'arch:moe' in groups

# Flat list - OR semantics
filters = {"base_task": ["arithmetic", "json"]} # scalar: base_task IN ('arithmetic', 'json')
filters = {"groups": ["arch:moe", "size:mid"]}  # set: 'arch:moe' in groups OR 'size:mid' in groups

# Nested list - AND semantics (inner lists become AND clauses)
filters = {"groups": [["arch:moe", "size:mid"]]}  # set: 'arch:moe' in groups AND 'size:mid' in groups

# Mixed nested - DNF (outer OR, inner AND)
filters = {
    "groups": [["arch:moe", "size:mid"], ["arch:dense"]]
    # ('arch:moe' in groups AND 'size:mid' in groups) OR ('arch:dense' in groups)
}

Examples by Dimension Type

Scalar dimensions (identity filters - equality/IN/DNF):

# Single value
filters = {"model": "Seed-OSS-36B"}

# Multiple values (OR)
filters = {
    "eval_id": [0, 1, 2],                      # eval_id IN (0, 1, 2)
    "model": ["Seed-OSS-36B", "MiniMax-M2"],   # model IN (...)
    "base_task": ["arithmetic", "json"]        # base_task IN (...)
}

# DNF (rarely useful for scalars - AND on scalars is usually nonsensical)
filters = {
    "eval_id": [[0], [1]]  # (eval_id=0) OR (eval_id=1) - equivalent to [0, 1]
}

Set dimensions (facet filters - list membership with OR/AND/DNF):

# Single value
filters = {"tiers": "easy"}                    # 'easy' in tiers

# Multiple values (OR - point has ANY of these)
filters = {
    "tiers": ["easy", "medium"],               # 'easy' in tiers OR 'medium' in tiers
    "groups": ["arch:moe", "size:mid"],        # 'arch:moe' in groups OR 'size:mid' in groups
    "surfaces": ["arithmetic_easy", "arithmetic_hard"]  # membership in either surface
}

# AND semantics (point has ALL of these)
filters = {
    "groups": [["arch:moe", "size:mid"]]       # 'arch:moe' in groups AND 'size:mid' in groups
}

# DNF (complex boolean logic)
filters = {
    "groups": [
        ["arch:moe", "size:mid"],              # ('arch:moe' in groups AND 'size:mid' in groups)
        ["arch:dense"]                         # OR ('arch:dense' in groups)
    ]
}

filters = {
    "surfaces": [
        ["arithmetic_easy", "surface_a"],      # ('easy' in surfaces AND 'a' in surfaces)
        ["arithmetic_hard"]                    # OR ('hard' in surfaces)
    ]
}

Param filters (JSON queries - always AND):

filters = {
    "params": {"max_depth": 2}  # json_extract_string(params, '$.max_depth') = '2'
}

filters = {
    "params": {"max_depth": 2, "length": 20}  # max_depth = 2 AND length = 20
}

Combining Filters Across Dimensions

All top-level filter keys are combined with AND:

filters = {
    "eval_id": [0, 1],                      # (eval 0 OR eval 1)
    "base_task": ["arithmetic", "json"],    # AND (arithmetic OR json)
    "groups": [["arch:moe", "size:mid"]],   # AND (has moe AND has mid)
    "tiers": ["easy"]                       # AND (has easy tier)
}

# Equivalent SQL WHERE clause:
# WHERE (eval_id IN (0, 1))
#   AND (base_task IN ('arithmetic', 'json'))
#   AND (list_contains(groups, 'arch:moe') AND list_contains(groups, 'size:mid'))
#   AND (list_contains(tiers, 'easy'))

Key Design Notes

  • Short and CLI-friendly: [a,b] for OR, [[a,b]] for AND
  • Uniform: Same syntax for scalars and sets
  • No validation: Nonsense queries (e.g., base_task: [['a','b']]) produce empty results, not errors
  • DNF expressiveness: Supports complex (A AND B) OR (C AND D) patterns when needed

Data Flow Architecture

1. Aggregate Processing Layer (evaluate.py)

Purpose: Process raw interview results and write to database

De-duplication:

  • Points are de-duplicated on 5D identity (model, template, sampler, base_task, params)
  • Multiple buckets with same identity but different degree/density are merged
  • Tier mapping: (degree, density) pairs mapped to semantic tier labels (e.g., "easy", "medium", "hard")
  • Uses bulk_upsert_points() with delete-first approach per scenario

Key Operations:

# Dataset mode (recommended) - processes all evals from config
python evaluate.py --dataset data/dataset-m12x.json --parallel 16

# Interview mode (legacy) - for ad-hoc evaluation without database
python evaluate.py \
    --interview "data/m12x/*Seed-OSS*/*" \
    --tokenizer "seedai/Seed-OSS-36B" \
    --output buckets.json

What gets written:

  • All identity dimensions (model, template, sampler, base_task, params)
  • Computed tiers[] from (degree, density) mapping via dataset config
  • Wilson CI statistics (adjusted_successes, adjusted_trials, center, margin)
  • Token statistics (means and totals)
  • Optional: compression data (completion_tokens_list, compressed_sizes_list, answer_status_list)
  • Optional: FFT data (fft_mean_list, fft_std_list)
  • Initially empty: surfaces[], projections[], groups[] (populated during post-processing)

2. Enrich Layer (evaluate.py --dataset mode)

Purpose: Orchestrate evaluation and update group/surface memberships

Operations:

  1. Discover tasks - Parse config.evals[] to find missing evaluations
  2. Process evals - Process each eval into database with parallel bucket processing
  3. Update groups[] - Apply scenario metadata to points matching eval filters
  4. Update surfaces[] - Apply surface labels to points matching task param filters

Key Operations:

# Process all missing evals and update tags with 16 parallel workers
python evaluate.py --dataset data/dataset-m12x.json

Group Update Pattern:

# Step 1: Clear all groups
db.update_points_set(filters={}, updates={"groups": []})

# Step 2: For each eval in config.evals[]
for eval_id, eval_info in enumerate(config['evals']):
    filters = eval_info['filters']  # e.g., {"model": "X", "template": "Y", "sampler": "Z"}
    groups = eval_info['groups']    # e.g., ["arch:moe", "size:mid"]

    db.update_points_append(filters=filters, appends={"groups": groups})

Surface Update Pattern:

# Step 1: Clear all surfaces
db.update_points_set(filters={}, updates={"surfaces": []})

# Step 2: For each surface in config.basetasks[task]['surfaces']
for surface in task_config['surfaces']:
    filters = {
        "base_task": base_task,
        "params": surface['filter']  # e.g., {"max_depth": 2}
    }
    db.update_points_append(filters=filters, appends={"surfaces": [surface['id']]})

Projection Update Pattern:

# Step 1: Clear all projections
db.update_points_set(filters={}, updates={"projections": []})

# Step 2: For each projection in config.basetasks[task]['projections']
for projection in task_config['projections']:
    projection_id = projection['id']  # e.g., "arithmetic_length"
    filters = {
        "base_task": base_task,
        "params": projection.get('filter', {})  # Optional additional param constraints
    }
    db.update_points_append(filters=filters, appends={"projections": [projection_id]})

3. ReasonScore Layer (src/scores.py)

Purpose: Load evaluation data and aggregate into ranked leaderboard format

Key Function:

def aggregate_leaderboard(
    db_path: str,
    config: Dict,
    filters: Optional[Dict] = None
)

What it does:

  • Uses PointsDB.aggregate() with explode=['tiers'] to get one row per (eval_id, tier, base_task)
  • Adds eval_label from config.evals[]
  • Computes adjusted_score = center + margin - truncated_ratio
  • Aggregates into leaderboard structure with tier and task details
  • Applies fair-sort ranking algorithm and assigns ranks

Returns:

List of dicts, one per eval (in rank order):
{
    'rank': int,
    'model': str (eval_label),
    'reasonscore': float (overall reasonscore across all tiers),
    'tiers': {
        <tier_label>: {
            'reasonscore': float,
            'any_incomplete': bool,
            'tasks': {
                <base_task>: {
                    'center', 'margin', 'truncated_ratio',
                    'completion_tokens_mean', 'adjusted_score',
                    'expected_points', 'is_incomplete', ...
                }
            }
        }
    }
}

Usage Patterns

Pattern 1: Check if scenario data exists

# Count unique points (no explosion)
count = db.count_points(
    filters={
        "model": "Seed-OSS-36B",
        "template": "zerocot-nosys",
        "sampler": "seedoss-0k"
    }
)
if count == 0:
    run_evaluate_py()

# Count exploded rows (e.g., total easy tier occurrences)
easy_count = db.count_points(
    filters={"tiers": ["easy"]},
    explode=["tiers"]
)
# This counts "how many times 'easy' tier appears across all points"
# A point with tiers=['easy','medium'] contributes 1 to this count

Pattern 2: Update group memberships

# Step 1: Clear all groups for a base_task
db.update_points_set(
    filters={"base_task": "arithmetic"},
    updates={"groups": []}
)

# Step 2: Append group labels based on scenario identity
db.update_points_append(
    filters={"model": "Seed-OSS-36B", "template": "zerocot-nosys", "sampler": "seedoss-0k"},
    appends={"groups": ["arch:moe", "size:mid"]}
)

Pattern 3: Update surface and projection memberships

# Step 1: Clear all surfaces for arithmetic task
db.update_points_set(
    filters={"base_task": "arithmetic"},
    updates={"surfaces": []}
)

# Step 2: Append surface labels based on param filters
db.update_points_append(
    filters={"base_task": "arithmetic", "params": {"max_depth": 1}},
    appends={"surfaces": ["arithmetic_easy"]}
)

db.update_points_append(
    filters={"base_task": "arithmetic", "params": {"max_depth": 3}},
    appends={"surfaces": ["arithmetic_hard"]}
)

# Step 3: Update projections for FFT analysis
db.update_points_set(
    filters={"base_task": "arithmetic"},
    updates={"projections": []}
)

# Projection 0: Length sweep (no additional filter)
db.update_points_append(
    filters={"base_task": "arithmetic"},
    appends={"projections": ["arithmetic_projection_0"]}
)

# Projection 1: Depth sweep (filter to specific params)
db.update_points_append(
    filters={"base_task": "arithmetic", "params": {"length": 20}},
    appends={"projections": ["arithmetic_projection_1"]}
)

Pattern 4: Leaderboard (multi-tier)

# Explode tiers to get one row per tier
df = db.aggregate(
    filters={},
    group_by=["eval_id", "base_task", "tier"],
    explode=["tiers"]
)
# Returns one row per (eval_id, base_task, tier) combination
# Post-process: Look up eval_id in config.evals[] for labels
# tier column already has semantic labels ("easy", "medium", "hard")

Pattern 5: Leaderboard (single tier filter)

# Filter by tier, no explosion
df = db.aggregate(
    filters={
        "groups": ["arch:moe", "size:mid"],  # Scenario group filters
        "tiers": ["easy"]
    },
    group_by=["eval_id", "base_task"]
)
# Returns one row per (eval_id, base_task) combination
# Post-process: pivot by base_task to create leaderboard table

Pattern 6: Surface visualization (raw points)

points = db.query_points(
    filters={
        "eval_id": 0,
        "base_task": "arithmetic",
        "surfaces": ["arithmetic_easy"]
    },
    columns=["params", "adjusted_center", "completion_tokens_mean"]
)

# Extract x, y, z from params for 3D plot

Pattern 7: Compression/Hazard analysis (per-sample arrays with explosion)

# Get points with exploded tiers for compression analysis
points = db.query_points(
    filters={
        "eval_id": 0,
        "base_task": "arithmetic"
    },
    columns=[
        "completion_tokens_list",
        "compressed_sizes_list",
        "answer_status_list",
        "tiers",
        "tier"  # Scalar column added by explosion
    ],
    explode=["tiers"]  # Explodes tiers list into one row per tier
)

# Result: Each point is duplicated for each tier it belongs to
# Row 1: point_id=1, tiers=['easy','medium'], tier='easy', ...
# Row 2: point_id=1, tiers=['easy','medium'], tier='medium', ...
# Row 3: point_id=2, tiers=['medium','hard'], tier='medium', ...
# Row 4: point_id=2, tiers=['medium','hard'], tier='hard', ...

# Now expand per-sample arrays into individual records in Python
records = []
for _, row in points.iterrows():
    for token, compressed, status in zip(
        row['completion_tokens_list'],
        row['compressed_sizes_list'],
        row['answer_status_list']
    ):
        records.append({
            'tier': row['tier'],  # Use the exploded scalar value
            'completion_tokens': token,
            'compressed_size': compressed,
            'answer_status': status
        })

Pattern 8: Discovery - Building sub-queries

# Discover all unique combinations of eval_id and base_task
combinations = db.unique_values(
    filters={
        "groups": ["arch:moe"],
        "tiers": ["easy"]
    },
    columns=["eval_id", "base_task"]
)

# Returns DataFrame with unique (eval_id, base_task) combinations
# Use these combinations to build sub-queries
for _, row in combinations.iterrows():
    process_scenario(row["eval_id"], row["base_task"])

Pattern 9: FFT Analysis with Projections

# Get points belonging to specific projection for FFT analysis
points = db.query_points(
    filters={
        "eval_id": 0,
        "base_task": "arithmetic",
        "projections": ["arithmetic_projection_0"]  # Filter to specific projection
    },
    columns=[
        "params",
        "fft_mean_list",
        "fft_std_list",
        "adjusted_center"
    ]
)

# Extract parameter value for grouping (e.g., length)
# Use projection config to know which param is the axis
axis_param = "length"  # From projection config
for _, row in points.iterrows():
    param_val = row['params'][axis_param]
    fft_mean = row['fft_mean_list']
    fft_std = row['fft_std_list']
    # Plot FFT spectrum for this parameter value

Implementation Notes

Filter Building

All methods use _build_where_clause(filters) internally: - Constructs WHERE clause dynamically - Returns (where_clause, params) tuple - Handles identity, group (lists), and JSON filters

Wilson Scores are SQL Macros

CREATE MACRO wilson_center(successes, trials) AS ...
CREATE MACRO wilson_margin(successes, trials) AS ...
  • Pure SQL, no Python round-trip
  • Used directly in aggregate() queries
  • Composable in complex queries

List Operations are Native

  • list_contains(degrees, '0') - Single membership
  • list_has_any(degrees, ['0', '1']) - Multiple membership (OR)
  • list_append(surfaces, 'new_label') - Append to list
  • list_concat(surfaces, ['label1', 'label2']) - Append multiple

JSON Operations are Native

  • json_extract_string(params, '$.max_depth') - Extract field
  • json_extract_string(params, '$.length') = '20' - Filter condition
  • DuckDB optimizes JSON queries well

Explode for Multi-Value Dimensions

Both aggregate() and query_points() support the explode parameter for handling list-valued facet dimensions.

Supported explosions: - explode=["tiers"] → adds scalar tier column - explode=["surfaces"] → adds scalar surface column - explode=["projections"] → adds scalar projection column - explode=["groups"] → adds scalar group column

In aggregate() - Creates grouped statistics per exploded value:

# Without explode: filters force single tier
df = db.aggregate(
    filters={"tiers": ["easy"]},
    group_by=["eval_id", "base_task"]
)
# Returns one row per (eval_id, base_task)

# With explode: get all tiers
df = db.aggregate(
    filters={},
    group_by=["eval_id", "base_task", "tier"],
    explode=["tiers"]
)
# Returns one row per (eval_id, base_task, tier)

In query_points() - Duplicates points for per-sample analysis:

# Without explode: points may belong to multiple tiers
df = db.query_points(
    filters={"eval_id": 0},
    columns=["tiers", "completion_tokens_list"]
)
# Row 1: tiers=['easy','medium'], completion_tokens_list=[45, 52, ...]
# Row 2: tiers=['medium','hard'], completion_tokens_list=[78, 91, ...]

# With explode: one row per tier per point
df = db.query_points(
    filters={"eval_id": 0},
    columns=["tiers", "tier", "completion_tokens_list"],
    explode=["tiers"]
)
# Row 1: tiers=['easy','medium'], tier='easy', completion_tokens_list=[45, 52, ...]
# Row 2: tiers=['easy','medium'], tier='medium', completion_tokens_list=[45, 52, ...]
# Row 3: tiers=['medium','hard'], tier='medium', completion_tokens_list=[78, 91, ...]
# Row 4: tiers=['medium','hard'], tier='hard', completion_tokens_list=[78, 91, ...]

Why explode in query_points()? Essential for compression/hazard analysis where you need to: 1. Explode list dimensions (tiers) at the database level for efficiency 2. Then expand per-sample arrays (completion_tokens_list) at the Python level 3. Use the exploded scalar value (tier) to group samples correctly

Summary Table

Method Purpose Explode Support Returns
count_points() Count/existence Yes (NEW) int
bulk_upsert_points() Bulk insert/update N/A (write op) int (row count)
update_points_set() Overwrite columns N/A (write op) int (row count)
update_points_append() Append to lists N/A (write op) int (row count)
aggregate() Grouped stats Yes DataFrame
query_points() Raw points Yes (NEW) DataFrame
unique_values() Discover unique combinations Always (implicit) DataFrame

Testing

The implementation can be validated via the unit tests in tests/test_points_db.py covering:

  • Basic CRUD operations
  • Filtering patterns (identity, groups, params)
  • Aggregation with and without explode
  • Bulk operations
  • Context manager support