Skip to content

PointsDB Architecture

PointsDB is ReasonScape's DuckDB-based storage layer for evaluation results. It provides simple, composable primitives for storing and querying points with native support for scalar JSON dimensions and query-time statistical computation.

Design Philosophy

Simple, composable primitives over opinionated operations.

  • No high-level abstractions that hide what's happening
  • Explicit filtering at the database level
  • Native DuckDB features (LIST types, JSON queries, SQL macros)
  • Separate read vs write operations clearly
  • Statistical adjustments computed at query time from raw counters, not stored

Schema

The Three-Layer Structure

PointsDB organizes evaluation data using a two-plane, three-layer structure. Each plane has a definition layer (how it was designed), a generation layer (actual parameters), and an analysis layer (researcher views).

                  LAYER 1              LAYER 2                LAYER 3
                 (definition)         (generation)           (analysis)
                ──────────────       ──────────────         ──────────────
TASK PLANE      manifold{}           base_task              views: (config)
                which grid entry     params{}               query recipes

EVAL PLANE      (none)               model    \              facets{}
                                     template | eval_id      how to cut models
                                     sampler  /

Dimension Types

Identity Dimensions (5D) — Uniquely identify a point on the manifold

  • model, template, sampler, base_task, params
  • These define WHAT was evaluated
  • A point is evaluated once; points equal along these 5D are interchangeable

Scalar JSON Dimensions — Researcher-defined analysis views

  • manifold{} — Task Layer 1: which named grid entry produced this point (id, target_tokens)
  • facets{} — Eval Layer 3: model classification tags decomposed from groups[] at construction time; ["arch:moe", "size:large"]{"arch": "moe", "size": "large"}
  • groups[] — Eval Layer 3: original VARCHAR[] tags; kept as backwards-compatible read path until deprecated

Task Dimensions — Difficulty knobs (folded into params JSON)

  • length, max_depth, prob_dewhitespace (arithmetic)
  • depth, width (jsonpath)
  • Task-specific; prefer facets.* or manifold.* for analysis

Points Table

CREATE TABLE points (
    -- Primary key
    id INTEGER PRIMARY KEY,

    -- Hash-based identifier: sha256(model|template|sampler)[:6]
    eval_id VARCHAR(6),

    -- Identity dimensions (5D unique space)
    model VARCHAR NOT NULL,
    template VARCHAR NOT NULL,
    sampler VARCHAR NOT NULL,
    base_task VARCHAR NOT NULL,
    params JSON NOT NULL,

    -- Reference to original .ndjson step file (for direct inspection)
    step VARCHAR,

    -- Task Layer 1: which named grid entry produced this point
    manifold JSON DEFAULT '{}',

    -- Eval Layer 3: model classification tags (scalar JSON, decomposed from groups[] at construction time)
    facets JSON DEFAULT '{}',

    -- Eval Layer 3: original colon-encoded tags (backwards-compatible read path, deprecated)
    groups VARCHAR[],

    -- Guess accumulator: Σ(guess_chance) over completed (non-truncated) trials.
    -- Enables query-time computation of all three statistical exits.
    guess_accum FLOAT NOT NULL DEFAULT 0.0,

    -- Raw counts (truncated excluded from total)
    correct INTEGER NOT NULL,
    invalid INTEGER NOT NULL,
    total INTEGER NOT NULL,       -- non-truncated sample count
    truncated INTEGER NOT NULL,   -- count of length-truncated samples

    -- Token statistics (pre-aggregated at point level)
    prompt_tokens_mean FLOAT,
    prompt_tokens_list INTEGER[],      -- per-sample prompt token counts
    completion_tokens_mean FLOAT,
    completion_tokens_correct_mean FLOAT,
    completion_tokens_incorrect_mean FLOAT,
    total_tokens BIGINT,

    -- Per-sample arrays (for compression/hazard analysis)
    completion_tokens_list INTEGER[],
    compressed_sizes_list INTEGER[],
    outcome_list INTEGER[],  -- 0=correct, 1=incorrect, 2=truncated, 3=invalid

    -- Metadata
    evaluated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Uniqueness constraint on identity dimensions
    UNIQUE(model, template, sampler, base_task, params)
);

Computed Columns (not stored)

The following columns are computed at query time from raw counters. They appear in query_points() and aggregate() output but are not stored in the database.

Column Formula (mode-independent) Notes
adj_succ mode-dependent (see Statistical Modes) statistically adjusted successes
adj_trials mode-dependent (see Statistical Modes) statistically adjusted trials
center mode-dependent Wilson score Point estimate
margin mode-dependent Wilson margin Half-width of 95% CI
invalid_ratio invalid / total 0.0 when total=0
truncated_ratio truncated / (total + truncated) 0.0 when both=0

See Statistical Modes for how formulas differ across modes.

Key Schema Decisions

  • No stored adjusted_* — Computed at query time from correct, total, truncated, guess_accum; eliminates staleness
  • guess_accum — Σ(guess_chance) over non-truncated trials; single stored value enables all three statistical exits
  • prompt_tokens_list — Per-sample prompt token counts for input-length hazard analysis (especially relevant for unbounded evaluations where prompt size varies across the manifold)
  • Two-tier storage (arrays + means) — Arrays for point-level workflows; means for aggregate workflows. Not redundant: unnesting a list per row in SQL is expensive and inappropriate in aggregate()
  • No task column — Redundant with base_task + params; use step for the source filename
  • No fft_mean_list/fft_std_list — FFT is computed on demand from completion_tokens_list
  • Scalar JSON for facets/manifoldfacets{} and manifold{} are scalar JSON (one value per dimension per point), enabling dotted-key filter and group_by expressions without list explosion. facets{} is decomposed from groups[] at construction time.
  • groups[] as VARCHAR[] — Kept as backwards-compatible read path; use facets.* for new consumers. Post-cleanup: drop once all consumers migrate.
  • JSON for params — DuckDB JSON support allows path queries
  • eval_id VARCHAR(6) — Stable hash of (model, template, sampler) identity triple

Statistical Modes

Both query_points() and aggregate() accept a mode parameter that controls how adjusted statistics are computed. Modes are named by two axes: exit (how truncation is handled) and correction (whether guess probability is subtracted).

Mode Exit Correction center formula Default for
'E_I' Independence None Wilson(n_e, n_u)
'E_P' Pessimism None Wilson(n_e, n)
'E_O' Optimism None Wilson(n_e + n_t, n)
'C_I' Independence Guess Wilson(n_e - g, n_u - g) query_points()
'C_P' Pessimism Guess Wilson(n_e-g, n_u-g) × Wilson(n_u, n) aggregate()
'C_O' Optimism Guess 1 - Wilson(n_u-n_e, n_u-g) × Wilson(n_u, n)

Notation: n_e = correct, n_u = total (non-truncated), n_t = truncated, n = total + truncated, g = guess_accum

Factored modes (C_P, C_O) compute center as a product of two Wilson intervals evaluated in SQL using the wilson_high/wilson_low macros; margin is (CI_high_product - CI_low_product) / 2.

C_P is the recommended mode for aggregated analysis — it applies pessimistic truncation handling and guess correction together.

The PointsDB API

Instantiation

# Preferred: open via dataset config (enables config accessor methods)
db = PointsDB(dataset_path="data/r12.json")

# Direct: open by database path (no config accessor methods)
db = PointsDB(db_path="data/pointsdb-r12.db")

# Write mode (required for evaluate.py; default is read_only=True)
db = PointsDB(dataset_path="data/r12.json", read_only=False)

Read-Only Mode (default): - Allows multiple concurrent readers without locking - Prevents accidental modifications during analysis - Macros are read from the database file (registered at write time) - Write operations raise errors if attempted

dataset_path= vs db_path=: - dataset_path= loads the full dataset config, derives the .db path from it, and enables all config accessor methods below - db_path= connects directly to the database file; config accessor methods raise RuntimeError if called

Config Accessor Methods

These methods require dataset_path= at construction; they raise RuntimeError if db_path= was used instead.

# {eval_id: eval_dict} — full eval config including filters, groups, label
evals() -> dict

# {eval_id: label} — convenience shortcut; falls back to eval_id if no label set
eval_labels() -> dict

# {task_name: task_dict} — base task configs with facet definitions
tasks() -> dict

# Dataset display name (from config)
name() -> str

Write Operations

# Universal counting/existence check
count_points(filters: dict) -> int

# Bulk insert/update points efficiently
bulk_upsert_points(points: List[dict], replace_filters: dict) -> int

# Overwrite columns matching filter
update_points_set(filters: dict, updates: dict) -> int

# Append to list columns matching filter
update_points_append(filters: dict, appends: dict) -> int

# Set a scalar value inside a JSON column (json_merge_patch semantics)
update_points_json_set(filters: dict, column: str, key: str, value: str) -> int

update_points_json_set merges {key: value} into the named JSON column using json_merge_patch, which preserves all existing keys. Use this to write individual facets.* or manifold.* values without overwriting sibling keys:

# Set facets.operations = "sum_mode_median" for matching points
db.update_points_json_set(
    filters={"base_task": "tables", "params": {"op_family": "sum"}},
    column="facets",
    key="operations",
    value="sum_mode_median"
)

Read Operations

# Generic aggregation with flexible grouping and Wilson CI computed in SQL
aggregate(filters: dict, group_by: list[str], mode: str = 'C_P') -> pd.DataFrame

# Raw point records with flexible filtering
query_points(filters: dict, columns: list = None,
             order_by: list[str] = None, mode: str = 'C_I') -> pd.DataFrame

Both methods return the 6 computed columns (adj_succ, adj_trials, center, margin, invalid_ratio, truncated_ratio) in addition to stored columns. For query_points(), computed columns are included when columns=None (default). For aggregate(), they are always included.

Note on round-trips: If you read points with query_points() and re-insert them via bulk_upsert_points(), exclude the computed columns — they are not stored in the schema.

Helper Operations

# Discover unique values across columns (for building sub-queries)
# Note: Automatically explodes list-type columns (groups[]); prefer facets.* for new consumers
unique_values(filters: dict, columns: list[str]) -> pd.DataFrame

Filter Syntax

All methods accept a filters dict with uniform DNF (Disjunctive Normal Form) syntax across all dimensions.

Uniform Filter Semantics

The same syntax applies to scalar dimensions (model, template, sampler, base_task, eval_id), list dimensions (groups[] — deprecated), JSON path dimensions (params.*, facets.*, manifold.*):

# Single value - direct match
filters = {"base_task": "arithmetic"}           # scalar: base_task == 'arithmetic'
filters = {"groups": "arch:moe"}                # list: 'arch:moe' in groups

# Flat list - OR semantics
filters = {"base_task": ["arithmetic", "json"]} # scalar: base_task IN ('arithmetic', 'json')
filters = {"groups": ["arch:moe", "size:mid"]}  # list: 'arch:moe' in groups OR 'size:mid' in groups

# Nested list - AND semantics (inner lists become AND clauses)
filters = {"groups": [["arch:moe", "size:mid"]]}  # list: 'arch:moe' in groups AND 'size:mid' in groups

# Mixed nested - DNF (outer OR, inner AND)
filters = {
    "groups": [["arch:moe", "size:mid"], ["arch:dense"]]
    # ('arch:moe' in groups AND 'size:mid' in groups) OR ('arch:dense' in groups)
}

Examples by Dimension Type

Scalar dimensions (identity filters — equality/IN/DNF):

# Single value
filters = {"model": "Seed-OSS-36B"}

# Multiple values (OR)
filters = {
    "eval_id": ["<eval_id_1>", "<eval_id_2>"],  # eval_id IN (...)
    "model": ["Seed-OSS-36B", "MiniMax-M2"],
    "base_task": ["arithmetic", "json"]
}

JSON path dimensions (dotted-key equality — always AND across keys):

# facets.* — eval-plane grouping (preferred over groups[])
filters = {"facets.arch": "moe"}
filters = {"facets.arch": "moe", "facets.size": "large"}  # both conditions (AND)

# params.*
filters = {"params.max_depth": 2}               # json_extract params.max_depth = '2'
filters = {"params.max_depth": 2, "params.length": 20}  # both conditions (AND)

# manifold.*
filters = {"manifold.id": "csv_tall"}
filters = {"manifold.id": ["csv_tall", "markdown_wide"]}  # OR semantics

List dimensions (groups[] — deprecated, use facets.* for new consumers):

# Legacy membership filter (backwards-compatible)
filters = {"groups": "arch:moe"}               # 'arch:moe' in groups
filters = {"groups": [["arch:moe", "size:mid"]]} # 'arch:moe' in groups AND 'size:mid' in groups

Note on facets.* group_by: Points where facets is {} are excluded from the result automatically — the IS NOT NULL guard is implicit. This means group_by=["eval_id", "facets.arch"] only returns rows for points that have that facet key set.

Combining Filters Across Dimensions

All top-level filter keys are combined with AND:

filters = {
    "eval_id": ["<eval_id_1>", "<eval_id_2>"],  # (eval 1 OR eval 2)
    "base_task": ["arithmetic", "json"],         # AND (arithmetic OR json)
    "facets.arch": "moe",                        # AND facets.arch = 'moe'
    "manifold.id": "csv_tall",                   # AND manifold.id = 'csv_tall'
}

Key Design Notes

  • Short and CLI-friendly: [a,b] for OR, [[a,b]] for AND
  • Uniform: Same syntax for scalars and lists
  • Dotted-key for JSON: params.*, facets.*, manifold.* use the same syntax
  • No validation: Nonsense queries produce empty results, not errors
  • DNF expressiveness: Supports complex (A AND B) OR (C AND D) patterns when needed

Usage Patterns

Pattern 1: Check if scenario data exists

# Read-only access for checking (allows concurrent readers)
with PointsDB(db_path="data/dataset.db", read_only=True) as db:
    count = db.count_points(
        filters={
            "model": "Seed-OSS-36B",
            "template": "zerocot-nosys",
            "sampler": "seedoss-0k"
        }
    )
    if count == 0:
        print("No data found, need to run evaluate.py")

Pattern 2: Leaderboard (standard)

# C_P is preferred at task level — pessimistic truncation handling with guess correction
df = db.aggregate(
    filters={},
    group_by=["eval_id", "base_task"],
    mode='C_P'
)

Pattern 3: Leaderboard sliced by manifold

# Score per named grid entry (e.g., csv_tall vs markdown_wide)
df = db.aggregate(
    filters={},
    group_by=["eval_id", "manifold.id"],
    mode='C_P'
)

Pattern 4: Leaderboard sliced by facet dimension

# Score per operation family
df = db.aggregate(
    filters={},
    group_by=["eval_id", "facets.operations"],
    mode='C_P'
)
# Points without facets.operations set are excluded automatically

Pattern 5: Filter by manifold and facets

# Points in a specific grid entry, specific operation family
points = db.query_points(
    filters={
        "eval_id": ["<eval_id>"],
        "manifold.id": "csv_tall",
        "facets.operations": "sum_mode_median"
    },
    columns=["params", "center", "completion_tokens_mean"],
    order_by=["id"]
)

Pattern 6: Surface visualization (raw points)

# Query points in a named grid entry for surface plotting
points = db.query_points(
    filters={
        "eval_id": ["<eval_id>"],
        "base_task": "arithmetic",
        "manifold.id": "csv_tall"
    },
    columns=["params", "center", "completion_tokens_mean"],
    order_by=["id"]
)
# Extract x, y, z from params for 3D plot

Pattern 7: Compression/Hazard analysis (per-sample arrays)

# Filter by facet dimension for per-sample analysis
points = db.query_points(
    filters={
        "eval_id": ["<eval_id>"],
        "base_task": "arithmetic",
        "facets.difficulty": "hard"
    },
    columns=[
        "completion_tokens_list",
        "compressed_sizes_list",
        "outcome_list"
    ]
)

# Expand per-sample arrays in Python
records = []
for _, row in points.iterrows():
    for token, compressed, status in zip(
        row['completion_tokens_list'],
        row['compressed_sizes_list'],
        row['outcome_list']
    ):
        records.append({
            'completion_tokens': token,
            'compressed_size': compressed,
            'outcome': status
        })

Pattern 8: Input-length hazard analysis (prompt_tokens_list)

# prompt_tokens_list enables detecting where input length causes model degradation.
# Especially useful for unbounded evaluations (inf-tables) where prompt size
# varies substantially across the manifold.
points = db.query_points(
    filters={
        "eval_id": ["<eval_id>"],
        "base_task": "tables"
    },
    columns=["params", "prompt_tokens_list", "outcome_list"]
)

for _, row in points.iterrows():
    for prompt_len, outcome in zip(row['prompt_tokens_list'], row['outcome_list']):
        records.append({'prompt_tokens': prompt_len, 'outcome': outcome})

Pattern 9: Discovery — Building sub-queries

combinations = db.unique_values(
    filters={"facets.arch": "moe"},
    columns=["eval_id", "base_task"]
)

for _, row in combinations.iterrows():
    process_scenario(row["eval_id"], row["base_task"])

Implementation Notes

Filter Building

All methods use _build_where_clause(filters) internally: - Constructs WHERE clause dynamically - Returns (where_clause, params) tuple - Handles identity, list (groups), and JSON filters - JSON column prefixes: params., facets., manifold. route to json_extract_string queries - JSON group_by prefixes: same — facets.* and manifold.* are valid group_by keys

Wilson Scores are SQL Macros

CREATE MACRO wilson_center(successes, trials) AS ...
CREATE MACRO wilson_margin(successes, trials) AS ...
CREATE MACRO wilson_low(successes, trials) AS
    wilson_center(successes, trials) - wilson_margin(successes, trials);
CREATE MACRO wilson_high(successes, trials) AS
    wilson_center(successes, trials) + wilson_margin(successes, trials);
  • Pure SQL, no Python round-trip
  • Registered at DB-open time and persisted in the file
  • wilson_low/wilson_high enable the CI product multiplication in joint mode
  • Read-only connections use the macros already registered in the file

Computed Column Injection

query_points() with columns=None runs SELECT *, <computed_exprs> where the expressions are built by _build_select_expressions(mode). aggregate() inlines the same formulas directly in SUM context.

This means: - All stored columns are returned - The 6 computed columns are appended - Consumers that don't know the schema changed see nothing different - No staleness risk — values always reflect current raw counters

List Operations are Native

  • list_contains(groups, 'arch:moe') — Single membership
  • list_has_any(groups, ['arch:moe', 'size:mid']) — Multiple membership (OR)
  • list_append(groups, 'new_label') — Append to list
  • list_concat(groups, ['label1', 'label2']) — Append multiple

JSON Operations are Native

  • json_extract_string(params, '$.max_depth') — Extract field
  • json_extract_string(params, '$.length') = '20' — Filter condition
  • json_merge_patch(facets, '{"operations": "sum"}') — Scalar JSON write (used by update_points_json_set)
  • DuckDB optimizes JSON queries well

Summary Table

Config Accessors (require dataset_path=)

Method Returns
evals() {eval_id: eval_dict}
eval_labels() {eval_id: label}
tasks() {task_name: task_dict}
name() dataset display name

Data Methods

Method Purpose Returns
count_points() Count/existence int
bulk_upsert_points() Bulk insert/update int (row count)
update_points_set() Overwrite columns int (row count)
update_points_append() Append to list columns int (row count)
update_points_json_set() Set scalar key in JSON column int (row count)
aggregate() Grouped stats with Wilson CI DataFrame
query_points() Raw points DataFrame
unique_values() Discover unique combinations DataFrame

Testing

The implementation is validated by tests/test_points_db.py covering:

  • Basic CRUD operations
  • Filtering patterns (identity, groups, params)
  • facets.* and manifold.* filter and group_by
  • update_points_json_set scalar JSON write semantics
  • Aggregation with flexible group_by expressions
  • Bulk operations
  • Context manager support
  • Computed column correctness across all three statistical modes
  • GREATEST(0, ...) clamp when guess_accum > correct
  • prompt_tokens_list storage and retrieval