PointsDB Architecture¶

PointsDB is ReasonScape's DuckDB-based storage layer for evaluation results. It provides simple, composable primitives for storing and querying points with native support for scalar JSON dimensions and query-time statistical computation.

Design Philosophy¶

Simple, composable primitives over opinionated operations.

No high-level abstractions that hide what's happening
Explicit filtering at the database level
Native DuckDB features (LIST types, JSON queries, SQL macros)
Separate read vs write operations clearly
Statistical adjustments computed at query time from raw counters, not stored

Schema¶

The Three-Layer Structure¶

PointsDB organizes evaluation data using a two-plane, three-layer structure. Each plane has a definition layer (how it was designed), a generation layer (actual parameters), and an analysis layer (researcher views).

                  LAYER 1              LAYER 2                LAYER 3
                 (definition)         (generation)           (analysis)
                ──────────────       ──────────────         ──────────────
TASK PLANE      manifold{}           base_task              views: (config)
                which grid entry     params{}               query recipes

EVAL PLANE      (none)               model    \              facets{}
                                     template | eval_id      how to cut models
                                     sampler  /

Dimension Types¶

Identity Dimensions (5D) — Uniquely identify a point on the manifold

model, template, sampler, base_task, params
These define WHAT was evaluated
A point is evaluated once; points equal along these 5D are interchangeable

Scalar JSON Dimensions — Researcher-defined analysis views

manifold{} — Task Layer 1: which named grid entry produced this point (id, target_tokens)
facets{} — Eval Layer 3: model classification tags decomposed from groups[] at construction time; ["arch:moe", "size:large"] → {"arch": "moe", "size": "large"}
groups[] — Eval Layer 3: original VARCHAR[] tags; kept as backwards-compatible read path until deprecated

Task Dimensions — Difficulty knobs (folded into params JSON)

length, max_depth, prob_dewhitespace (arithmetic)
depth, width (jsonpath)
Task-specific; prefer facets.* or manifold.* for analysis

Points Table¶

CREATE TABLE points (
    -- Primary key
    id INTEGER PRIMARY KEY,

    -- Hash-based identifier: sha256(model|template|sampler)[:6]
    eval_id VARCHAR(6),

    -- Identity dimensions (5D unique space)
    model VARCHAR NOT NULL,
    template VARCHAR NOT NULL,
    sampler VARCHAR NOT NULL,
    base_task VARCHAR NOT NULL,
    params JSON NOT NULL,

    -- Reference to original .ndjson step file (for direct inspection)
    step VARCHAR,

    -- Task Layer 1: which named grid entry produced this point
    manifold JSON DEFAULT '{}',

    -- Eval Layer 3: model classification tags (scalar JSON, decomposed from groups[] at construction time)
    facets JSON DEFAULT '{}',

    -- Eval Layer 3: original colon-encoded tags (backwards-compatible read path, deprecated)
    groups VARCHAR[],

    -- Guess accumulator: Σ(guess_chance) over completed (non-truncated) trials.
    -- Enables query-time computation of all three statistical exits.
    guess_accum FLOAT NOT NULL DEFAULT 0.0,

    -- Raw counts (truncated excluded from total)
    correct INTEGER NOT NULL,
    invalid INTEGER NOT NULL,
    total INTEGER NOT NULL,       -- non-truncated sample count
    truncated INTEGER NOT NULL,   -- count of length-truncated samples

    -- Token statistics (pre-aggregated at point level)
    prompt_tokens_mean FLOAT,
    prompt_tokens_list INTEGER[],      -- per-sample prompt token counts
    completion_tokens_mean FLOAT,
    completion_tokens_correct_mean FLOAT,
    completion_tokens_incorrect_mean FLOAT,
    total_tokens BIGINT,

    -- Per-sample arrays (for compression/hazard analysis)
    completion_tokens_list INTEGER[],
    compressed_sizes_list INTEGER[],
    outcome_list INTEGER[],  -- 0=correct, 1=incorrect, 2=truncated, 3=invalid

    -- Metadata
    evaluated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    -- Uniqueness constraint on identity dimensions
    UNIQUE(model, template, sampler, base_task, params)
);

Computed Columns (not stored)¶

The following columns are computed at query time from raw counters. They appear in query_points() and aggregate() output but are not stored in the database.

Column	Formula (mode-independent)	Notes
`adj_succ`	mode-dependent (see Statistical Modes)	statistically adjusted successes
`adj_trials`	mode-dependent (see Statistical Modes)	statistically adjusted trials
`center`	mode-dependent Wilson score	Point estimate
`margin`	mode-dependent Wilson margin	Half-width of 95% CI
`invalid_ratio`	`invalid / total`	0.0 when total=0
`truncated_ratio`	`truncated / (total + truncated)`	0.0 when both=0

See Statistical Modes for how formulas differ across modes.

Key Schema Decisions¶

No stored adjusted_* — Computed at query time from correct, total, truncated, guess_accum; eliminates staleness
guess_accum — Σ(guess_chance) over non-truncated trials; single stored value enables all three statistical exits
prompt_tokens_list — Per-sample prompt token counts for input-length hazard analysis (especially relevant for unbounded evaluations where prompt size varies across the manifold)
Two-tier storage (arrays + means) — Arrays for point-level workflows; means for aggregate workflows. Not redundant: unnesting a list per row in SQL is expensive and inappropriate in aggregate()
No task column — Redundant with base_task + params; use step for the source filename
No fft_mean_list/fft_std_list — FFT is computed on demand from completion_tokens_list
Scalar JSON for facets/manifold — facets{} and manifold{} are scalar JSON (one value per dimension per point), enabling dotted-key filter and group_by expressions without list explosion. facets{} is decomposed from groups[] at construction time.
groups[] as VARCHAR[] — Kept as backwards-compatible read path; use facets.* for new consumers. Post-cleanup: drop once all consumers migrate.
JSON for params — DuckDB JSON support allows path queries
eval_id VARCHAR(6) — Stable hash of (model, template, sampler) identity triple

Statistical Modes¶

Both query_points() and aggregate() accept a mode parameter that controls how adjusted statistics are computed. Modes are named by two axes: exit (how truncation is handled) and correction (whether guess probability is subtracted).

Mode	Exit	Correction	`center` formula	Default for
`'E_I'`	Independence	None	`Wilson(n_e, n_u)`	—
`'E_P'`	Pessimism	None	`Wilson(n_e, n)`	—
`'E_O'`	Optimism	None	`Wilson(n_e + n_t, n)`	—
`'C_I'`	Independence	Guess	`Wilson(n_e - g, n_u - g)`	`query_points()`
`'C_P'`	Pessimism	Guess	`Wilson(n_e-g, n_u-g) × Wilson(n_u, n)`	`aggregate()`
`'C_O'`	Optimism	Guess	`1 - Wilson(n_u-n_e, n_u-g) × Wilson(n_u, n)`	—

Notation: n_e = correct, n_u = total (non-truncated), n_t = truncated, n = total + truncated, g = guess_accum

Factored modes (C_P, C_O) compute center as a product of two Wilson intervals evaluated in SQL using the wilson_high/wilson_low macros; margin is (CI_high_product - CI_low_product) / 2.

C_P is the recommended mode for aggregated analysis — it applies pessimistic truncation handling and guess correction together.

The PointsDB API¶

Instantiation¶

# Preferred: open via dataset config (enables config accessor methods)
db = PointsDB(dataset_path="data/r12.json")

# Direct: open by database path (no config accessor methods)
db = PointsDB(db_path="data/pointsdb-r12.db")

# Write mode (required for evaluate.py; default is read_only=True)
db = PointsDB(dataset_path="data/r12.json", read_only=False)

Read-Only Mode (default): - Allows multiple concurrent readers without locking - Prevents accidental modifications during analysis - Macros are read from the database file (registered at write time) - Write operations raise errors if attempted

dataset_path= vs db_path=: - dataset_path= loads the full dataset config, derives the .db path from it, and enables all config accessor methods below - db_path= connects directly to the database file; config accessor methods raise RuntimeError if called

Config Accessor Methods¶

These methods require dataset_path= at construction; they raise RuntimeError if db_path= was used instead.

# {eval_id: eval_dict} — full eval config including filters, groups, label
evals() -> dict

# {eval_id: label} — convenience shortcut; falls back to eval_id if no label set
eval_labels() -> dict

# {task_name: task_dict} — base task configs with facet definitions
tasks() -> dict

# Dataset display name (from config)
name() -> str

Write Operations¶

# Universal counting/existence check
count_points(filters: dict) -> int

# Bulk insert/update points efficiently
bulk_upsert_points(points: List[dict], replace_filters: dict) -> int

# Overwrite columns matching filter
update_points_set(filters: dict, updates: dict) -> int

# Append to list columns matching filter
update_points_append(filters: dict, appends: dict) -> int

# Set a scalar value inside a JSON column (json_merge_patch semantics)
update_points_json_set(filters: dict, column: str, key: str, value: str) -> int

update_points_json_set merges {key: value} into the named JSON column using json_merge_patch, which preserves all existing keys. Use this to write individual facets.* or manifold.* values without overwriting sibling keys:

# Set facets.operations = "sum_mode_median" for matching points
db.update_points_json_set(
    filters={"base_task": "tables", "params": {"op_family": "sum"}},
    column="facets",
    key="operations",
    value="sum_mode_median"
)

Read Operations¶

# Generic aggregation with flexible grouping and Wilson CI computed in SQL
aggregate(filters: dict, group_by: list[str], mode: str = 'C_P') -> pd.DataFrame

# Raw point records with flexible filtering
query_points(filters: dict, columns: list = None,
             order_by: list[str] = None, mode: str = 'C_I') -> pd.DataFrame

Both methods return the 6 computed columns (adj_succ, adj_trials, center, margin, invalid_ratio, truncated_ratio) in addition to stored columns. For query_points(), computed columns are included when columns=None (default). For aggregate(), they are always included.

Note on round-trips: If you read points with query_points() and re-insert them via bulk_upsert_points(), exclude the computed columns — they are not stored in the schema.

Helper Operations¶

# Discover unique values across columns (for building sub-queries)
# Note: Automatically explodes list-type columns (groups[]); prefer facets.* for new consumers
unique_values(filters: dict, columns: list[str]) -> pd.DataFrame

Filter Syntax¶

All methods accept a filters dict with uniform DNF (Disjunctive Normal Form) syntax across all dimensions.

Uniform Filter Semantics¶

The same syntax applies to scalar dimensions (model, template, sampler, base_task, eval_id), list dimensions (groups[] — deprecated), JSON path dimensions (params.*, facets.*, manifold.*):

# Single value - direct match
filters = {"base_task": "arithmetic"}           # scalar: base_task == 'arithmetic'
filters = {"groups": "arch:moe"}                # list: 'arch:moe' in groups

# Flat list - OR semantics
filters = {"base_task": ["arithmetic", "json"]} # scalar: base_task IN ('arithmetic', 'json')
filters = {"groups": ["arch:moe", "size:mid"]}  # list: 'arch:moe' in groups OR 'size:mid' in groups

# Nested list - AND semantics (inner lists become AND clauses)
filters = {"groups": [["arch:moe", "size:mid"]]}  # list: 'arch:moe' in groups AND 'size:mid' in groups

# Mixed nested - DNF (outer OR, inner AND)
filters = {
    "groups": [["arch:moe", "size:mid"], ["arch:dense"]]
    # ('arch:moe' in groups AND 'size:mid' in groups) OR ('arch:dense' in groups)
}

Examples by Dimension Type¶

Scalar dimensions (identity filters — equality/IN/DNF):

# Single value
filters = {"model": "Seed-OSS-36B"}

# Multiple values (OR)
filters = {
    "eval_id": ["<eval_id_1>", "<eval_id_2>"],  # eval_id IN (...)
    "model": ["Seed-OSS-36B", "MiniMax-M2"],
    "base_task": ["arithmetic", "json"]
}

JSON path dimensions (dotted-key equality — always AND across keys):

# facets.* — eval-plane grouping (preferred over groups[])
filters = {"facets.arch": "moe"}
filters = {"facets.arch": "moe", "facets.size": "large"}  # both conditions (AND)

# params.*
filters = {"params.max_depth": 2}               # json_extract params.max_depth = '2'
filters = {"params.max_depth": 2, "params.length": 20}  # both conditions (AND)

# manifold.*
filters = {"manifold.id": "csv_tall"}
filters = {"manifold.id": ["csv_tall", "markdown_wide"]}  # OR semantics

List dimensions (groups[] — deprecated, use facets.* for new consumers):

# Legacy membership filter (backwards-compatible)
filters = {"groups": "arch:moe"}               # 'arch:moe' in groups
filters = {"groups": [["arch:moe", "size:mid"]]} # 'arch:moe' in groups AND 'size:mid' in groups

Note on facets.* group_by: Points where facets is {} are excluded from the result automatically — the IS NOT NULL guard is implicit. This means group_by=["eval_id", "facets.arch"] only returns rows for points that have that facet key set.

Combining Filters Across Dimensions¶

All top-level filter keys are combined with AND:

filters = {
    "eval_id": ["<eval_id_1>", "<eval_id_2>"],  # (eval 1 OR eval 2)
    "base_task": ["arithmetic", "json"],         # AND (arithmetic OR json)
    "facets.arch": "moe",                        # AND facets.arch = 'moe'
    "manifold.id": "csv_tall",                   # AND manifold.id = 'csv_tall'
}

Key Design Notes¶

Short and CLI-friendly: [a,b] for OR, [[a,b]] for AND
Uniform: Same syntax for scalars and lists
Dotted-key for JSON: params.*, facets.*, manifold.* use the same syntax
No validation: Nonsense queries produce empty results, not errors
DNF expressiveness: Supports complex (A AND B) OR (C AND D) patterns when needed

Usage Patterns¶

Pattern 1: Check if scenario data exists¶

# Read-only access for checking (allows concurrent readers)
with PointsDB(db_path="data/dataset.db", read_only=True) as db:
    count = db.count_points(
        filters={
            "model": "Seed-OSS-36B",
            "template": "zerocot-nosys",
            "sampler": "seedoss-0k"
        }
    )
    if count == 0:
        print("No data found, need to run evaluate.py")

Pattern 2: Leaderboard (standard)¶

# C_P is preferred at task level — pessimistic truncation handling with guess correction
df = db.aggregate(
    filters={},
    group_by=["eval_id", "base_task"],
    mode='C_P'
)

Pattern 3: Leaderboard sliced by manifold¶

# Score per named grid entry (e.g., csv_tall vs markdown_wide)
df = db.aggregate(
    filters={},
    group_by=["eval_id", "manifold.id"],
    mode='C_P'
)

# Score per operation family
df = db.aggregate(
    filters={},
    group_by=["eval_id", "facets.operations"],
    mode='C_P'
)
# Points without facets.operations set are excluded automatically

# Points in a specific grid entry, specific operation family
points = db.query_points(
    filters={
        "eval_id": ["<eval_id>"],
        "manifold.id": "csv_tall",
        "facets.operations": "sum_mode_median"
    },
    columns=["params", "center", "completion_tokens_mean"],
    order_by=["id"]
)

Pattern 6: Surface visualization (raw points)¶

# Query points in a named grid entry for surface plotting
points = db.query_points(
    filters={
        "eval_id": ["<eval_id>"],
        "base_task": "arithmetic",
        "manifold.id": "csv_tall"
    },
    columns=["params", "center", "completion_tokens_mean"],
    order_by=["id"]
)
# Extract x, y, z from params for 3D plot

Pattern 7: Compression/Hazard analysis (per-sample arrays)¶

# Filter by facet dimension for per-sample analysis
points = db.query_points(
    filters={
        "eval_id": ["<eval_id>"],
        "base_task": "arithmetic",
        "facets.difficulty": "hard"
    },
    columns=[
        "completion_tokens_list",
        "compressed_sizes_list",
        "outcome_list"
    ]
)

# Expand per-sample arrays in Python
records = []
for _, row in points.iterrows():
    for token, compressed, status in zip(
        row['completion_tokens_list'],
        row['compressed_sizes_list'],
        row['outcome_list']
    ):
        records.append({
            'completion_tokens': token,
            'compressed_size': compressed,
            'outcome': status
        })

Pattern 8: Input-length hazard analysis (prompt_tokens_list)¶

# prompt_tokens_list enables detecting where input length causes model degradation.
# Especially useful for unbounded evaluations (inf-tables) where prompt size
# varies substantially across the manifold.
points = db.query_points(
    filters={
        "eval_id": ["<eval_id>"],
        "base_task": "tables"
    },
    columns=["params", "prompt_tokens_list", "outcome_list"]
)

for _, row in points.iterrows():
    for prompt_len, outcome in zip(row['prompt_tokens_list'], row['outcome_list']):
        records.append({'prompt_tokens': prompt_len, 'outcome': outcome})

Pattern 9: Discovery — Building sub-queries¶

combinations = db.unique_values(
    filters={"facets.arch": "moe"},
    columns=["eval_id", "base_task"]
)

for _, row in combinations.iterrows():
    process_scenario(row["eval_id"], row["base_task"])

Implementation Notes¶

Filter Building¶

All methods use _build_where_clause(filters) internally: - Constructs WHERE clause dynamically - Returns (where_clause, params) tuple - Handles identity, list (groups), and JSON filters - JSON column prefixes: params., facets., manifold. route to json_extract_string queries - JSON group_by prefixes: same — facets.* and manifold.* are valid group_by keys

Wilson Scores are SQL Macros¶

CREATE MACRO wilson_center(successes, trials) AS ...
CREATE MACRO wilson_margin(successes, trials) AS ...
CREATE MACRO wilson_low(successes, trials) AS
    wilson_center(successes, trials) - wilson_margin(successes, trials);
CREATE MACRO wilson_high(successes, trials) AS
    wilson_center(successes, trials) + wilson_margin(successes, trials);

Pure SQL, no Python round-trip
Registered at DB-open time and persisted in the file
wilson_low/wilson_high enable the CI product multiplication in joint mode
Read-only connections use the macros already registered in the file

Computed Column Injection¶

query_points() with columns=None runs SELECT *, <computed_exprs> where the expressions are built by _build_select_expressions(mode). aggregate() inlines the same formulas directly in SUM context.

This means: - All stored columns are returned - The 6 computed columns are appended - Consumers that don't know the schema changed see nothing different - No staleness risk — values always reflect current raw counters

List Operations are Native¶

list_contains(groups, 'arch:moe') — Single membership
list_has_any(groups, ['arch:moe', 'size:mid']) — Multiple membership (OR)
list_append(groups, 'new_label') — Append to list
list_concat(groups, ['label1', 'label2']) — Append multiple

JSON Operations are Native¶

json_extract_string(params, '$.max_depth') — Extract field
json_extract_string(params, '$.length') = '20' — Filter condition
json_merge_patch(facets, '{"operations": "sum"}') — Scalar JSON write (used by update_points_json_set)
DuckDB optimizes JSON queries well

Summary Table¶

Config Accessors (require `dataset_path=`)¶

Method	Returns
`evals()`	`{eval_id: eval_dict}`
`eval_labels()`	`{eval_id: label}`
`tasks()`	`{task_name: task_dict}`
`name()`	dataset display name

Data Methods¶

Method	Purpose	Returns
`count_points()`	Count/existence	int
`bulk_upsert_points()`	Bulk insert/update	int (row count)
`update_points_set()`	Overwrite columns	int (row count)
`update_points_append()`	Append to list columns	int (row count)
`update_points_json_set()`	Set scalar key in JSON column	int (row count)
`aggregate()`	Grouped stats with Wilson CI	DataFrame
`query_points()`	Raw points	DataFrame
`unique_values()`	Discover unique combinations	DataFrame

Testing¶

The implementation is validated by tests/test_points_db.py covering:

Basic CRUD operations
Filtering patterns (identity, groups, params)
facets.* and manifold.* filter and group_by
update_points_json_set scalar JSON write semantics
Aggregation with flexible group_by expressions
Bulk operations
Context manager support
Computed column correctness across all three statistical modes
GREATEST(0, ...) clamp when guess_accum > correct
prompt_tokens_list storage and retrieval

PointsDB Architecture¶

Design Philosophy¶

Schema¶

The Three-Layer Structure¶

Dimension Types¶

Points Table¶

Computed Columns (not stored)¶

Key Schema Decisions¶

Statistical Modes¶

The PointsDB API¶

Instantiation¶

Config Accessor Methods¶

Write Operations¶

Read Operations¶

Helper Operations¶

Filter Syntax¶

Uniform Filter Semantics¶

Examples by Dimension Type¶

Combining Filters Across Dimensions¶

Key Design Notes¶

Usage Patterns¶

Pattern 1: Check if scenario data exists¶

Pattern 2: Leaderboard (standard)¶

Pattern 3: Leaderboard sliced by manifold¶

Pattern 4: Leaderboard sliced by facet dimension¶

Pattern 5: Filter by manifold and facets¶

Pattern 6: Surface visualization (raw points)¶

Pattern 7: Compression/Hazard analysis (per-sample arrays)¶

Pattern 8: Input-length hazard analysis (prompt_tokens_list)¶

Pattern 9: Discovery — Building sub-queries¶

Implementation Notes¶

Filter Building¶

Wilson Scores are SQL Macros¶

Computed Column Injection¶

List Operations are Native¶

JSON Operations are Native¶

Summary Table¶

Config Accessors (require `dataset_path=`)¶

Data Methods¶

Testing¶

PointsDB Architecture¶

Design Philosophy¶

Schema¶

The Three-Layer Structure¶

Dimension Types¶

Points Table¶

Computed Columns (not stored)¶

Key Schema Decisions¶

Statistical Modes¶

The PointsDB API¶

Instantiation¶

Config Accessor Methods¶

Write Operations¶

Read Operations¶

Helper Operations¶

Filter Syntax¶

Uniform Filter Semantics¶

Examples by Dimension Type¶

Combining Filters Across Dimensions¶

Key Design Notes¶

Usage Patterns¶

Pattern 1: Check if scenario data exists¶

Pattern 2: Leaderboard (standard)¶

Pattern 3: Leaderboard sliced by manifold¶

Pattern 4: Leaderboard sliced by facet dimension¶

Pattern 5: Filter by manifold and facets¶

Pattern 6: Surface visualization (raw points)¶

Pattern 7: Compression/Hazard analysis (per-sample arrays)¶

Pattern 8: Input-length hazard analysis (prompt_tokens_list)¶

Pattern 9: Discovery — Building sub-queries¶

Implementation Notes¶

Filter Building¶

Wilson Scores are SQL Macros¶

Computed Column Injection¶

List Operations are Native¶

JSON Operations are Native¶

Summary Table¶

Config Accessors (require dataset_path=)¶

Data Methods¶

Testing¶

Config Accessors (require `dataset_path=`)¶