PointsDB Architecture¶
PointsDB is ReasonScape's DuckDB-based storage layer for evaluation results. It provides simple, composable primitives for storing and querying points with native support for scalar JSON dimensions and query-time statistical computation.
Design Philosophy¶
Simple, composable primitives over opinionated operations.
- No high-level abstractions that hide what's happening
- Explicit filtering at the database level
- Native DuckDB features (LIST types, JSON queries, SQL macros)
- Separate read vs write operations clearly
- Statistical adjustments computed at query time from raw counters, not stored
Schema¶
The Three-Layer Structure¶
PointsDB organizes evaluation data using a two-plane, three-layer structure. Each plane has a definition layer (how it was designed), a generation layer (actual parameters), and an analysis layer (researcher views).
LAYER 1 LAYER 2 LAYER 3
(definition) (generation) (analysis)
────────────── ────────────── ──────────────
TASK PLANE manifold{} base_task views: (config)
which grid entry params{} query recipes
EVAL PLANE (none) model \ facets{}
template | eval_id how to cut models
sampler /
Dimension Types¶
Identity Dimensions (5D) — Uniquely identify a point on the manifold
model,template,sampler,base_task,params- These define WHAT was evaluated
- A point is evaluated once; points equal along these 5D are interchangeable
Scalar JSON Dimensions — Researcher-defined analysis views
manifold{}— Task Layer 1: which named grid entry produced this point (id,target_tokens)facets{}— Eval Layer 3: model classification tags decomposed fromgroups[]at construction time;["arch:moe", "size:large"]→{"arch": "moe", "size": "large"}groups[]— Eval Layer 3: original VARCHAR[] tags; kept as backwards-compatible read path until deprecated
Task Dimensions — Difficulty knobs (folded into params JSON)
length,max_depth,prob_dewhitespace(arithmetic)depth,width(jsonpath)- Task-specific; prefer
facets.*ormanifold.*for analysis
Points Table¶
CREATE TABLE points (
-- Primary key
id INTEGER PRIMARY KEY,
-- Hash-based identifier: sha256(model|template|sampler)[:6]
eval_id VARCHAR(6),
-- Identity dimensions (5D unique space)
model VARCHAR NOT NULL,
template VARCHAR NOT NULL,
sampler VARCHAR NOT NULL,
base_task VARCHAR NOT NULL,
params JSON NOT NULL,
-- Reference to original .ndjson step file (for direct inspection)
step VARCHAR,
-- Task Layer 1: which named grid entry produced this point
manifold JSON DEFAULT '{}',
-- Eval Layer 3: model classification tags (scalar JSON, decomposed from groups[] at construction time)
facets JSON DEFAULT '{}',
-- Eval Layer 3: original colon-encoded tags (backwards-compatible read path, deprecated)
groups VARCHAR[],
-- Guess accumulator: Σ(guess_chance) over completed (non-truncated) trials.
-- Enables query-time computation of all three statistical exits.
guess_accum FLOAT NOT NULL DEFAULT 0.0,
-- Raw counts (truncated excluded from total)
correct INTEGER NOT NULL,
invalid INTEGER NOT NULL,
total INTEGER NOT NULL, -- non-truncated sample count
truncated INTEGER NOT NULL, -- count of length-truncated samples
-- Token statistics (pre-aggregated at point level)
prompt_tokens_mean FLOAT,
prompt_tokens_list INTEGER[], -- per-sample prompt token counts
completion_tokens_mean FLOAT,
completion_tokens_correct_mean FLOAT,
completion_tokens_incorrect_mean FLOAT,
total_tokens BIGINT,
-- Per-sample arrays (for compression/hazard analysis)
completion_tokens_list INTEGER[],
compressed_sizes_list INTEGER[],
outcome_list INTEGER[], -- 0=correct, 1=incorrect, 2=truncated, 3=invalid
-- Metadata
evaluated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- Uniqueness constraint on identity dimensions
UNIQUE(model, template, sampler, base_task, params)
);
Computed Columns (not stored)¶
The following columns are computed at query time from raw counters. They appear in query_points() and aggregate() output but are not stored in the database.
| Column | Formula (mode-independent) | Notes |
|---|---|---|
adj_succ |
mode-dependent (see Statistical Modes) | statistically adjusted successes |
adj_trials |
mode-dependent (see Statistical Modes) | statistically adjusted trials |
center |
mode-dependent Wilson score | Point estimate |
margin |
mode-dependent Wilson margin | Half-width of 95% CI |
invalid_ratio |
invalid / total |
0.0 when total=0 |
truncated_ratio |
truncated / (total + truncated) |
0.0 when both=0 |
See Statistical Modes for how formulas differ across modes.
Key Schema Decisions¶
- No stored adjusted_* — Computed at query time from
correct,total,truncated,guess_accum; eliminates staleness guess_accum— Σ(guess_chance) over non-truncated trials; single stored value enables all three statistical exitsprompt_tokens_list— Per-sample prompt token counts for input-length hazard analysis (especially relevant for unbounded evaluations where prompt size varies across the manifold)- Two-tier storage (arrays + means) — Arrays for point-level workflows; means for aggregate workflows. Not redundant: unnesting a list per row in SQL is expensive and inappropriate in
aggregate() - No
taskcolumn — Redundant withbase_task+params; usestepfor the source filename - No
fft_mean_list/fft_std_list— FFT is computed on demand fromcompletion_tokens_list - Scalar JSON for facets/manifold —
facets{}andmanifold{}are scalar JSON (one value per dimension per point), enabling dotted-key filter and group_by expressions without list explosion.facets{}is decomposed fromgroups[]at construction time. groups[]as VARCHAR[] — Kept as backwards-compatible read path; usefacets.*for new consumers. Post-cleanup: drop once all consumers migrate.- JSON for params — DuckDB JSON support allows path queries
- eval_id VARCHAR(6) — Stable hash of (model, template, sampler) identity triple
Statistical Modes¶
Both query_points() and aggregate() accept a mode parameter that controls how adjusted statistics are computed. Modes are named by two axes: exit (how truncation is handled) and correction (whether guess probability is subtracted).
| Mode | Exit | Correction | center formula |
Default for |
|---|---|---|---|---|
'E_I' |
Independence | None | Wilson(n_e, n_u) |
— |
'E_P' |
Pessimism | None | Wilson(n_e, n) |
— |
'E_O' |
Optimism | None | Wilson(n_e + n_t, n) |
— |
'C_I' |
Independence | Guess | Wilson(n_e - g, n_u - g) |
query_points() |
'C_P' |
Pessimism | Guess | Wilson(n_e-g, n_u-g) × Wilson(n_u, n) |
aggregate() |
'C_O' |
Optimism | Guess | 1 - Wilson(n_u-n_e, n_u-g) × Wilson(n_u, n) |
— |
Notation: n_e = correct, n_u = total (non-truncated), n_t = truncated, n = total + truncated, g = guess_accum
Factored modes (C_P, C_O) compute center as a product of two Wilson intervals evaluated in SQL using the wilson_high/wilson_low macros; margin is (CI_high_product - CI_low_product) / 2.
C_P is the recommended mode for aggregated analysis — it applies pessimistic truncation handling and guess correction together.
The PointsDB API¶
Instantiation¶
# Preferred: open via dataset config (enables config accessor methods)
db = PointsDB(dataset_path="data/r12.json")
# Direct: open by database path (no config accessor methods)
db = PointsDB(db_path="data/pointsdb-r12.db")
# Write mode (required for evaluate.py; default is read_only=True)
db = PointsDB(dataset_path="data/r12.json", read_only=False)
Read-Only Mode (default): - Allows multiple concurrent readers without locking - Prevents accidental modifications during analysis - Macros are read from the database file (registered at write time) - Write operations raise errors if attempted
dataset_path= vs db_path=:
- dataset_path= loads the full dataset config, derives the .db path from it, and enables all config accessor methods below
- db_path= connects directly to the database file; config accessor methods raise RuntimeError if called
Config Accessor Methods¶
These methods require dataset_path= at construction; they raise RuntimeError if db_path= was used instead.
# {eval_id: eval_dict} — full eval config including filters, groups, label
evals() -> dict
# {eval_id: label} — convenience shortcut; falls back to eval_id if no label set
eval_labels() -> dict
# {task_name: task_dict} — base task configs with facet definitions
tasks() -> dict
# Dataset display name (from config)
name() -> str
Write Operations¶
# Universal counting/existence check
count_points(filters: dict) -> int
# Bulk insert/update points efficiently
bulk_upsert_points(points: List[dict], replace_filters: dict) -> int
# Overwrite columns matching filter
update_points_set(filters: dict, updates: dict) -> int
# Append to list columns matching filter
update_points_append(filters: dict, appends: dict) -> int
# Set a scalar value inside a JSON column (json_merge_patch semantics)
update_points_json_set(filters: dict, column: str, key: str, value: str) -> int
update_points_json_set merges {key: value} into the named JSON column using json_merge_patch, which preserves all existing keys. Use this to write individual facets.* or manifold.* values without overwriting sibling keys:
# Set facets.operations = "sum_mode_median" for matching points
db.update_points_json_set(
filters={"base_task": "tables", "params": {"op_family": "sum"}},
column="facets",
key="operations",
value="sum_mode_median"
)
Read Operations¶
# Generic aggregation with flexible grouping and Wilson CI computed in SQL
aggregate(filters: dict, group_by: list[str], mode: str = 'C_P') -> pd.DataFrame
# Raw point records with flexible filtering
query_points(filters: dict, columns: list = None,
order_by: list[str] = None, mode: str = 'C_I') -> pd.DataFrame
Both methods return the 6 computed columns (adj_succ, adj_trials, center, margin, invalid_ratio, truncated_ratio) in addition to stored columns. For query_points(), computed columns are included when columns=None (default). For aggregate(), they are always included.
Note on round-trips: If you read points with query_points() and re-insert them via bulk_upsert_points(), exclude the computed columns — they are not stored in the schema.
Helper Operations¶
# Discover unique values across columns (for building sub-queries)
# Note: Automatically explodes list-type columns (groups[]); prefer facets.* for new consumers
unique_values(filters: dict, columns: list[str]) -> pd.DataFrame
Filter Syntax¶
All methods accept a filters dict with uniform DNF (Disjunctive Normal Form) syntax across all dimensions.
Uniform Filter Semantics¶
The same syntax applies to scalar dimensions (model, template, sampler, base_task, eval_id), list dimensions (groups[] — deprecated), JSON path dimensions (params.*, facets.*, manifold.*):
# Single value - direct match
filters = {"base_task": "arithmetic"} # scalar: base_task == 'arithmetic'
filters = {"groups": "arch:moe"} # list: 'arch:moe' in groups
# Flat list - OR semantics
filters = {"base_task": ["arithmetic", "json"]} # scalar: base_task IN ('arithmetic', 'json')
filters = {"groups": ["arch:moe", "size:mid"]} # list: 'arch:moe' in groups OR 'size:mid' in groups
# Nested list - AND semantics (inner lists become AND clauses)
filters = {"groups": [["arch:moe", "size:mid"]]} # list: 'arch:moe' in groups AND 'size:mid' in groups
# Mixed nested - DNF (outer OR, inner AND)
filters = {
"groups": [["arch:moe", "size:mid"], ["arch:dense"]]
# ('arch:moe' in groups AND 'size:mid' in groups) OR ('arch:dense' in groups)
}
Examples by Dimension Type¶
Scalar dimensions (identity filters — equality/IN/DNF):
# Single value
filters = {"model": "Seed-OSS-36B"}
# Multiple values (OR)
filters = {
"eval_id": ["<eval_id_1>", "<eval_id_2>"], # eval_id IN (...)
"model": ["Seed-OSS-36B", "MiniMax-M2"],
"base_task": ["arithmetic", "json"]
}
JSON path dimensions (dotted-key equality — always AND across keys):
# facets.* — eval-plane grouping (preferred over groups[])
filters = {"facets.arch": "moe"}
filters = {"facets.arch": "moe", "facets.size": "large"} # both conditions (AND)
# params.*
filters = {"params.max_depth": 2} # json_extract params.max_depth = '2'
filters = {"params.max_depth": 2, "params.length": 20} # both conditions (AND)
# manifold.*
filters = {"manifold.id": "csv_tall"}
filters = {"manifold.id": ["csv_tall", "markdown_wide"]} # OR semantics
List dimensions (groups[] — deprecated, use facets.* for new consumers):
# Legacy membership filter (backwards-compatible)
filters = {"groups": "arch:moe"} # 'arch:moe' in groups
filters = {"groups": [["arch:moe", "size:mid"]]} # 'arch:moe' in groups AND 'size:mid' in groups
Note on facets.* group_by: Points where facets is {} are excluded from the result automatically — the IS NOT NULL guard is implicit. This means group_by=["eval_id", "facets.arch"] only returns rows for points that have that facet key set.
Combining Filters Across Dimensions¶
All top-level filter keys are combined with AND:
filters = {
"eval_id": ["<eval_id_1>", "<eval_id_2>"], # (eval 1 OR eval 2)
"base_task": ["arithmetic", "json"], # AND (arithmetic OR json)
"facets.arch": "moe", # AND facets.arch = 'moe'
"manifold.id": "csv_tall", # AND manifold.id = 'csv_tall'
}
Key Design Notes¶
- Short and CLI-friendly:
[a,b]for OR,[[a,b]]for AND - Uniform: Same syntax for scalars and lists
- Dotted-key for JSON:
params.*,facets.*,manifold.*use the same syntax - No validation: Nonsense queries produce empty results, not errors
- DNF expressiveness: Supports complex
(A AND B) OR (C AND D)patterns when needed
Usage Patterns¶
Pattern 1: Check if scenario data exists¶
# Read-only access for checking (allows concurrent readers)
with PointsDB(db_path="data/dataset.db", read_only=True) as db:
count = db.count_points(
filters={
"model": "Seed-OSS-36B",
"template": "zerocot-nosys",
"sampler": "seedoss-0k"
}
)
if count == 0:
print("No data found, need to run evaluate.py")
Pattern 2: Leaderboard (standard)¶
# C_P is preferred at task level — pessimistic truncation handling with guess correction
df = db.aggregate(
filters={},
group_by=["eval_id", "base_task"],
mode='C_P'
)
Pattern 3: Leaderboard sliced by manifold¶
# Score per named grid entry (e.g., csv_tall vs markdown_wide)
df = db.aggregate(
filters={},
group_by=["eval_id", "manifold.id"],
mode='C_P'
)
Pattern 4: Leaderboard sliced by facet dimension¶
# Score per operation family
df = db.aggregate(
filters={},
group_by=["eval_id", "facets.operations"],
mode='C_P'
)
# Points without facets.operations set are excluded automatically
Pattern 5: Filter by manifold and facets¶
# Points in a specific grid entry, specific operation family
points = db.query_points(
filters={
"eval_id": ["<eval_id>"],
"manifold.id": "csv_tall",
"facets.operations": "sum_mode_median"
},
columns=["params", "center", "completion_tokens_mean"],
order_by=["id"]
)
Pattern 6: Surface visualization (raw points)¶
# Query points in a named grid entry for surface plotting
points = db.query_points(
filters={
"eval_id": ["<eval_id>"],
"base_task": "arithmetic",
"manifold.id": "csv_tall"
},
columns=["params", "center", "completion_tokens_mean"],
order_by=["id"]
)
# Extract x, y, z from params for 3D plot
Pattern 7: Compression/Hazard analysis (per-sample arrays)¶
# Filter by facet dimension for per-sample analysis
points = db.query_points(
filters={
"eval_id": ["<eval_id>"],
"base_task": "arithmetic",
"facets.difficulty": "hard"
},
columns=[
"completion_tokens_list",
"compressed_sizes_list",
"outcome_list"
]
)
# Expand per-sample arrays in Python
records = []
for _, row in points.iterrows():
for token, compressed, status in zip(
row['completion_tokens_list'],
row['compressed_sizes_list'],
row['outcome_list']
):
records.append({
'completion_tokens': token,
'compressed_size': compressed,
'outcome': status
})
Pattern 8: Input-length hazard analysis (prompt_tokens_list)¶
# prompt_tokens_list enables detecting where input length causes model degradation.
# Especially useful for unbounded evaluations (inf-tables) where prompt size
# varies substantially across the manifold.
points = db.query_points(
filters={
"eval_id": ["<eval_id>"],
"base_task": "tables"
},
columns=["params", "prompt_tokens_list", "outcome_list"]
)
for _, row in points.iterrows():
for prompt_len, outcome in zip(row['prompt_tokens_list'], row['outcome_list']):
records.append({'prompt_tokens': prompt_len, 'outcome': outcome})
Pattern 9: Discovery — Building sub-queries¶
combinations = db.unique_values(
filters={"facets.arch": "moe"},
columns=["eval_id", "base_task"]
)
for _, row in combinations.iterrows():
process_scenario(row["eval_id"], row["base_task"])
Implementation Notes¶
Filter Building¶
All methods use _build_where_clause(filters) internally:
- Constructs WHERE clause dynamically
- Returns (where_clause, params) tuple
- Handles identity, list (groups), and JSON filters
- JSON column prefixes: params., facets., manifold. route to json_extract_string queries
- JSON group_by prefixes: same — facets.* and manifold.* are valid group_by keys
Wilson Scores are SQL Macros¶
CREATE MACRO wilson_center(successes, trials) AS ...
CREATE MACRO wilson_margin(successes, trials) AS ...
CREATE MACRO wilson_low(successes, trials) AS
wilson_center(successes, trials) - wilson_margin(successes, trials);
CREATE MACRO wilson_high(successes, trials) AS
wilson_center(successes, trials) + wilson_margin(successes, trials);
- Pure SQL, no Python round-trip
- Registered at DB-open time and persisted in the file
wilson_low/wilson_highenable the CI product multiplication in joint mode- Read-only connections use the macros already registered in the file
Computed Column Injection¶
query_points() with columns=None runs SELECT *, <computed_exprs> where the expressions are built by _build_select_expressions(mode). aggregate() inlines the same formulas directly in SUM context.
This means: - All stored columns are returned - The 6 computed columns are appended - Consumers that don't know the schema changed see nothing different - No staleness risk — values always reflect current raw counters
List Operations are Native¶
list_contains(groups, 'arch:moe')— Single membershiplist_has_any(groups, ['arch:moe', 'size:mid'])— Multiple membership (OR)list_append(groups, 'new_label')— Append to listlist_concat(groups, ['label1', 'label2'])— Append multiple
JSON Operations are Native¶
json_extract_string(params, '$.max_depth')— Extract fieldjson_extract_string(params, '$.length') = '20'— Filter conditionjson_merge_patch(facets, '{"operations": "sum"}')— Scalar JSON write (used byupdate_points_json_set)- DuckDB optimizes JSON queries well
Summary Table¶
Config Accessors (require dataset_path=)¶
| Method | Returns |
|---|---|
evals() |
{eval_id: eval_dict} |
eval_labels() |
{eval_id: label} |
tasks() |
{task_name: task_dict} |
name() |
dataset display name |
Data Methods¶
| Method | Purpose | Returns |
|---|---|---|
count_points() |
Count/existence | int |
bulk_upsert_points() |
Bulk insert/update | int (row count) |
update_points_set() |
Overwrite columns | int (row count) |
update_points_append() |
Append to list columns | int (row count) |
update_points_json_set() |
Set scalar key in JSON column | int (row count) |
aggregate() |
Grouped stats with Wilson CI | DataFrame |
query_points() |
Raw points | DataFrame |
unique_values() |
Discover unique combinations | DataFrame |
Testing¶
The implementation is validated by tests/test_points_db.py covering:
- Basic CRUD operations
- Filtering patterns (identity, groups, params)
facets.*andmanifold.*filter and group_byupdate_points_json_setscalar JSON write semantics- Aggregation with flexible group_by expressions
- Bulk operations
- Context manager support
- Computed column correctness across all three statistical modes
GREATEST(0, ...)clamp whenguess_accum > correctprompt_tokens_liststorage and retrieval