PointsDB Architecture¶
PointsDB is ReasonScape's DuckDB-based storage layer that replaces the legacy JSON bucket system with a proper database abstraction. It provides simple, composable primitives for storing and querying evaluation results with native support for multi-valued dimensions.
Design Philosophy¶
Simple, composable primitives over opinionated operations.
- No high-level abstractions that hide what's happening
- Explicit filtering at the database level
- Native DuckDB features (LIST types, JSON queries, SQL macros)
- Separate read vs write operations clearly
Schema¶
The Two-Plane Structure¶
PointsDB organizes evaluation data using a two-plane structure where each point exists in both an Evaluation Plane and a Task Complexity Plane. Each plane has identity dimensions (what defines a point) and facet dimensions (how to organize points).
| EVALUATION | TASK-COMPLEXITY | |
|---|---|---|
| IDENTITY (5D) | - model - template - sampler |
- base_task - params |
| FACETS | - eval_id - groups[] |
- tiers[] - surfaces[] - projections[] |
Dimension Types¶
Identity Dimensions (5D) - Uniquely identify a point on the manifold
model,template,sampler,base_task,params- These define WHAT was evaluated
- A point is evaluated once; points equal along these 5D are interchangeable
Facet Dimensions - Multi-valued organizational views
tiers[]- Difficulty levels (e.g., "easy", "medium", "hard")surfaces[]- 2D visualization slices (computed via param filter intersection)projections[]- 1D analysis sweeps (computed via param filter intersection)groups[]- Model categories (e.g., "arch:moe", "size:large")- Points can belong to multiple facet values!
- Useful for aggregation, filtering, and visualization
Task Dimensions - Difficulty knobs (folded into params JSON)
length,max_depth,prob_dewhitespace(arithmetic)depth,width(jsonpath)- Task-specific and complex to query, prefer using Facets instead.
Points Table¶
CREATE TABLE points (
-- Primary key
id INTEGER PRIMARY KEY,
-- Index of eval that this point came from
eval_id INTEGER,
-- Identity dimensions (5D unique space)
model VARCHAR NOT NULL,
template VARCHAR NOT NULL,
sampler VARCHAR NOT NULL,
base_task VARCHAR NOT NULL,
params JSON NOT NULL,
-- Full task name (for reference)
task VARCHAR NOT NULL,
-- Facet dimensions (stored as native LISTs)
tiers VARCHAR[],
surfaces VARCHAR[],
projections VARCHAR[],
groups VARCHAR[],
-- Evaluation results (immutable per unique point)
adjusted_successes FLOAT NOT NULL,
adjusted_trials FLOAT NOT NULL,
adjusted_center FLOAT NOT NULL,
adjusted_margin FLOAT NOT NULL,
-- Raw counts
correct INTEGER NOT NULL,
invalid INTEGER NOT NULL,
total INTEGER NOT NULL,
truncated INTEGER NOT NULL,
hard_terminated INTEGER NOT NULL,
-- Derived ratios
invalid_ratio FLOAT,
truncated_ratio FLOAT,
-- Token statistics (pre-aggregated at point level)
prompt_tokens_mean FLOAT,
completion_tokens_mean FLOAT,
completion_tokens_correct_mean FLOAT,
completion_tokens_incorrect_mean FLOAT,
total_tokens BIGINT,
-- Per-sample arrays (for compression/hazard analysis)
completion_tokens_list INTEGER[],
compressed_sizes_list INTEGER[],
answer_status_list INTEGER[],
-- FFT data (optional, for spectral analysis)
fft_mean_list FLOAT[],
fft_std_list FLOAT[],
-- Metadata
evaluated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- Uniqueness constraint on identity dimensions
UNIQUE(model, template, sampler, base_task, params)
);
Key Schema Decisions¶
- No 'precision' column - Runner artifact that only affects sample count
- No 'degrees'/'densities' columns - Runner artifacts; mapped to semantic
tiers[]during evaluation - Facet dimensions as VARCHAR[] - Native lists enable proper membership queries
- JSON for params - DuckDB JSON support allows path queries
- eval_id column - Index into config.evals[] for display metadata
The PointsDB API (7 Methods)¶
Write Operations (4 methods)¶
# Universal counting/existence check (with optional explosion)
count_points(filters: dict, explode: list[str] = None) -> int
# Bulk insert/update points efficiently
bulk_upsert_points(points: List[dict], replace_filters: dict) -> int
# Overwrite columns matching filter
update_points_set(filters: dict, updates: dict) -> int
# Append to list columns matching filter
update_points_append(filters: dict, appends: dict) -> int
Read Operations (2 methods)¶
# Generic aggregation with flexible grouping
aggregate(filters: dict, group_by: list[str], explode: list[str] = None) -> pd.DataFrame
# Raw point records with flexible filtering and optional list explosion
query_points(filters: dict, columns: list = None, explode: list[str] = None) -> pd.DataFrame
Helper Operations (1 method)¶
# Discover unique values across columns (for building sub-queries)
# Note: Automatically explodes list-type columns (tiers, surfaces, projections, groups)
unique_values(filters: dict, columns: list[str]) -> pd.DataFrame
Filter Syntax¶
All methods accept a filters dict with uniform DNF (Disjunctive Normal Form) syntax across all dimensions.
Uniform Filter Semantics¶
The same syntax applies to both scalar dimensions (model, template, sampler, base_task, task, eval_id) and set dimensions (tiers, surfaces, projections, groups):
# Single value - direct match
filters = {"base_task": "arithmetic"} # scalar: base_task == 'arithmetic'
filters = {"groups": "arch:moe"} # set: 'arch:moe' in groups
# Flat list - OR semantics
filters = {"base_task": ["arithmetic", "json"]} # scalar: base_task IN ('arithmetic', 'json')
filters = {"groups": ["arch:moe", "size:mid"]} # set: 'arch:moe' in groups OR 'size:mid' in groups
# Nested list - AND semantics (inner lists become AND clauses)
filters = {"groups": [["arch:moe", "size:mid"]]} # set: 'arch:moe' in groups AND 'size:mid' in groups
# Mixed nested - DNF (outer OR, inner AND)
filters = {
"groups": [["arch:moe", "size:mid"], ["arch:dense"]]
# ('arch:moe' in groups AND 'size:mid' in groups) OR ('arch:dense' in groups)
}
Examples by Dimension Type¶
Scalar dimensions (identity filters - equality/IN/DNF):
# Single value
filters = {"model": "Seed-OSS-36B"}
# Multiple values (OR)
filters = {
"eval_id": [0, 1, 2], # eval_id IN (0, 1, 2)
"model": ["Seed-OSS-36B", "MiniMax-M2"], # model IN (...)
"base_task": ["arithmetic", "json"] # base_task IN (...)
}
# DNF (rarely useful for scalars - AND on scalars is usually nonsensical)
filters = {
"eval_id": [[0], [1]] # (eval_id=0) OR (eval_id=1) - equivalent to [0, 1]
}
Set dimensions (facet filters - list membership with OR/AND/DNF):
# Single value
filters = {"tiers": "easy"} # 'easy' in tiers
# Multiple values (OR - point has ANY of these)
filters = {
"tiers": ["easy", "medium"], # 'easy' in tiers OR 'medium' in tiers
"groups": ["arch:moe", "size:mid"], # 'arch:moe' in groups OR 'size:mid' in groups
"surfaces": ["arithmetic_easy", "arithmetic_hard"] # membership in either surface
}
# AND semantics (point has ALL of these)
filters = {
"groups": [["arch:moe", "size:mid"]] # 'arch:moe' in groups AND 'size:mid' in groups
}
# DNF (complex boolean logic)
filters = {
"groups": [
["arch:moe", "size:mid"], # ('arch:moe' in groups AND 'size:mid' in groups)
["arch:dense"] # OR ('arch:dense' in groups)
]
}
filters = {
"surfaces": [
["arithmetic_easy", "surface_a"], # ('easy' in surfaces AND 'a' in surfaces)
["arithmetic_hard"] # OR ('hard' in surfaces)
]
}
Param filters (JSON queries - always AND):
filters = {
"params": {"max_depth": 2} # json_extract_string(params, '$.max_depth') = '2'
}
filters = {
"params": {"max_depth": 2, "length": 20} # max_depth = 2 AND length = 20
}
Combining Filters Across Dimensions¶
All top-level filter keys are combined with AND:
filters = {
"eval_id": [0, 1], # (eval 0 OR eval 1)
"base_task": ["arithmetic", "json"], # AND (arithmetic OR json)
"groups": [["arch:moe", "size:mid"]], # AND (has moe AND has mid)
"tiers": ["easy"] # AND (has easy tier)
}
# Equivalent SQL WHERE clause:
# WHERE (eval_id IN (0, 1))
# AND (base_task IN ('arithmetic', 'json'))
# AND (list_contains(groups, 'arch:moe') AND list_contains(groups, 'size:mid'))
# AND (list_contains(tiers, 'easy'))
Key Design Notes¶
- Short and CLI-friendly:
[a,b]for OR,[[a,b]]for AND - Uniform: Same syntax for scalars and sets
- No validation: Nonsense queries (e.g.,
base_task: [['a','b']]) produce empty results, not errors - DNF expressiveness: Supports complex
(A AND B) OR (C AND D)patterns when needed
Data Flow Architecture¶
1. Aggregate Processing Layer (evaluate.py)¶
Purpose: Process raw interview results and write to database
De-duplication:
- Points are de-duplicated on 5D identity (model, template, sampler, base_task, params)
- Multiple buckets with same identity but different degree/density are merged
- Tier mapping: (degree, density) pairs mapped to semantic tier labels (e.g., "easy", "medium", "hard")
- Uses
bulk_upsert_points()with delete-first approach per scenario
Key Operations:
# Dataset mode (recommended) - processes all evals from config
python evaluate.py --dataset data/dataset-m12x.json --parallel 16
# Interview mode (legacy) - for ad-hoc evaluation without database
python evaluate.py \
--interview "data/m12x/*Seed-OSS*/*" \
--tokenizer "seedai/Seed-OSS-36B" \
--output buckets.json
What gets written:
- All identity dimensions (model, template, sampler, base_task, params)
- Computed tiers[] from (degree, density) mapping via dataset config
- Wilson CI statistics (adjusted_successes, adjusted_trials, center, margin)
- Token statistics (means and totals)
- Optional: compression data (completion_tokens_list, compressed_sizes_list, answer_status_list)
- Optional: FFT data (fft_mean_list, fft_std_list)
- Initially empty: surfaces[], projections[], groups[] (populated during post-processing)
2. Enrich Layer (evaluate.py --dataset mode)¶
Purpose: Orchestrate evaluation and update group/surface memberships
Operations:
- Discover tasks - Parse config.evals[] to find missing evaluations
- Process evals - Process each eval into database with parallel bucket processing
- Update groups[] - Apply scenario metadata to points matching eval filters
- Update surfaces[] - Apply surface labels to points matching task param filters
Key Operations:
# Process all missing evals and update tags with 16 parallel workers
python evaluate.py --dataset data/dataset-m12x.json
Group Update Pattern:
# Step 1: Clear all groups
db.update_points_set(filters={}, updates={"groups": []})
# Step 2: For each eval in config.evals[]
for eval_id, eval_info in enumerate(config['evals']):
filters = eval_info['filters'] # e.g., {"model": "X", "template": "Y", "sampler": "Z"}
groups = eval_info['groups'] # e.g., ["arch:moe", "size:mid"]
db.update_points_append(filters=filters, appends={"groups": groups})
Surface Update Pattern:
# Step 1: Clear all surfaces
db.update_points_set(filters={}, updates={"surfaces": []})
# Step 2: For each surface in config.basetasks[task]['surfaces']
for surface in task_config['surfaces']:
filters = {
"base_task": base_task,
"params": surface['filter'] # e.g., {"max_depth": 2}
}
db.update_points_append(filters=filters, appends={"surfaces": [surface['id']]})
Projection Update Pattern:
# Step 1: Clear all projections
db.update_points_set(filters={}, updates={"projections": []})
# Step 2: For each projection in config.basetasks[task]['projections']
for projection in task_config['projections']:
projection_id = projection['id'] # e.g., "arithmetic_length"
filters = {
"base_task": base_task,
"params": projection.get('filter', {}) # Optional additional param constraints
}
db.update_points_append(filters=filters, appends={"projections": [projection_id]})
3. ReasonScore Layer (src/scores.py)¶
Purpose: Load evaluation data and aggregate into ranked leaderboard format
Key Function:
def aggregate_leaderboard(
db_path: str,
config: Dict,
filters: Optional[Dict] = None
)
What it does:
- Uses
PointsDB.aggregate()withexplode=['tiers']to get one row per (eval_id, tier, base_task) - Adds eval_label from config.evals[]
- Computes adjusted_score = center + margin - truncated_ratio
- Aggregates into leaderboard structure with tier and task details
- Applies fair-sort ranking algorithm and assigns ranks
Returns:
List of dicts, one per eval (in rank order):
{
'rank': int,
'model': str (eval_label),
'reasonscore': float (overall reasonscore across all tiers),
'tiers': {
<tier_label>: {
'reasonscore': float,
'any_incomplete': bool,
'tasks': {
<base_task>: {
'center', 'margin', 'truncated_ratio',
'completion_tokens_mean', 'adjusted_score',
'expected_points', 'is_incomplete', ...
}
}
}
}
}
Usage Patterns¶
Pattern 1: Check if scenario data exists¶
# Count unique points (no explosion)
count = db.count_points(
filters={
"model": "Seed-OSS-36B",
"template": "zerocot-nosys",
"sampler": "seedoss-0k"
}
)
if count == 0:
run_evaluate_py()
# Count exploded rows (e.g., total easy tier occurrences)
easy_count = db.count_points(
filters={"tiers": ["easy"]},
explode=["tiers"]
)
# This counts "how many times 'easy' tier appears across all points"
# A point with tiers=['easy','medium'] contributes 1 to this count
Pattern 2: Update group memberships¶
# Step 1: Clear all groups for a base_task
db.update_points_set(
filters={"base_task": "arithmetic"},
updates={"groups": []}
)
# Step 2: Append group labels based on scenario identity
db.update_points_append(
filters={"model": "Seed-OSS-36B", "template": "zerocot-nosys", "sampler": "seedoss-0k"},
appends={"groups": ["arch:moe", "size:mid"]}
)
Pattern 3: Update surface and projection memberships¶
# Step 1: Clear all surfaces for arithmetic task
db.update_points_set(
filters={"base_task": "arithmetic"},
updates={"surfaces": []}
)
# Step 2: Append surface labels based on param filters
db.update_points_append(
filters={"base_task": "arithmetic", "params": {"max_depth": 1}},
appends={"surfaces": ["arithmetic_easy"]}
)
db.update_points_append(
filters={"base_task": "arithmetic", "params": {"max_depth": 3}},
appends={"surfaces": ["arithmetic_hard"]}
)
# Step 3: Update projections for FFT analysis
db.update_points_set(
filters={"base_task": "arithmetic"},
updates={"projections": []}
)
# Projection 0: Length sweep (no additional filter)
db.update_points_append(
filters={"base_task": "arithmetic"},
appends={"projections": ["arithmetic_projection_0"]}
)
# Projection 1: Depth sweep (filter to specific params)
db.update_points_append(
filters={"base_task": "arithmetic", "params": {"length": 20}},
appends={"projections": ["arithmetic_projection_1"]}
)
Pattern 4: Leaderboard (multi-tier)¶
# Explode tiers to get one row per tier
df = db.aggregate(
filters={},
group_by=["eval_id", "base_task", "tier"],
explode=["tiers"]
)
# Returns one row per (eval_id, base_task, tier) combination
# Post-process: Look up eval_id in config.evals[] for labels
# tier column already has semantic labels ("easy", "medium", "hard")
Pattern 5: Leaderboard (single tier filter)¶
# Filter by tier, no explosion
df = db.aggregate(
filters={
"groups": ["arch:moe", "size:mid"], # Scenario group filters
"tiers": ["easy"]
},
group_by=["eval_id", "base_task"]
)
# Returns one row per (eval_id, base_task) combination
# Post-process: pivot by base_task to create leaderboard table
Pattern 6: Surface visualization (raw points)¶
points = db.query_points(
filters={
"eval_id": 0,
"base_task": "arithmetic",
"surfaces": ["arithmetic_easy"]
},
columns=["params", "adjusted_center", "completion_tokens_mean"]
)
# Extract x, y, z from params for 3D plot
Pattern 7: Compression/Hazard analysis (per-sample arrays with explosion)¶
# Get points with exploded tiers for compression analysis
points = db.query_points(
filters={
"eval_id": 0,
"base_task": "arithmetic"
},
columns=[
"completion_tokens_list",
"compressed_sizes_list",
"answer_status_list",
"tiers",
"tier" # Scalar column added by explosion
],
explode=["tiers"] # Explodes tiers list into one row per tier
)
# Result: Each point is duplicated for each tier it belongs to
# Row 1: point_id=1, tiers=['easy','medium'], tier='easy', ...
# Row 2: point_id=1, tiers=['easy','medium'], tier='medium', ...
# Row 3: point_id=2, tiers=['medium','hard'], tier='medium', ...
# Row 4: point_id=2, tiers=['medium','hard'], tier='hard', ...
# Now expand per-sample arrays into individual records in Python
records = []
for _, row in points.iterrows():
for token, compressed, status in zip(
row['completion_tokens_list'],
row['compressed_sizes_list'],
row['answer_status_list']
):
records.append({
'tier': row['tier'], # Use the exploded scalar value
'completion_tokens': token,
'compressed_size': compressed,
'answer_status': status
})
Pattern 8: Discovery - Building sub-queries¶
# Discover all unique combinations of eval_id and base_task
combinations = db.unique_values(
filters={
"groups": ["arch:moe"],
"tiers": ["easy"]
},
columns=["eval_id", "base_task"]
)
# Returns DataFrame with unique (eval_id, base_task) combinations
# Use these combinations to build sub-queries
for _, row in combinations.iterrows():
process_scenario(row["eval_id"], row["base_task"])
Pattern 9: FFT Analysis with Projections¶
# Get points belonging to specific projection for FFT analysis
points = db.query_points(
filters={
"eval_id": 0,
"base_task": "arithmetic",
"projections": ["arithmetic_projection_0"] # Filter to specific projection
},
columns=[
"params",
"fft_mean_list",
"fft_std_list",
"adjusted_center"
]
)
# Extract parameter value for grouping (e.g., length)
# Use projection config to know which param is the axis
axis_param = "length" # From projection config
for _, row in points.iterrows():
param_val = row['params'][axis_param]
fft_mean = row['fft_mean_list']
fft_std = row['fft_std_list']
# Plot FFT spectrum for this parameter value
Implementation Notes¶
Filter Building¶
All methods use _build_where_clause(filters) internally:
- Constructs WHERE clause dynamically
- Returns (where_clause, params) tuple
- Handles identity, group (lists), and JSON filters
Wilson Scores are SQL Macros¶
CREATE MACRO wilson_center(successes, trials) AS ...
CREATE MACRO wilson_margin(successes, trials) AS ...
- Pure SQL, no Python round-trip
- Used directly in
aggregate()queries - Composable in complex queries
List Operations are Native¶
list_contains(degrees, '0')- Single membershiplist_has_any(degrees, ['0', '1'])- Multiple membership (OR)list_append(surfaces, 'new_label')- Append to listlist_concat(surfaces, ['label1', 'label2'])- Append multiple
JSON Operations are Native¶
json_extract_string(params, '$.max_depth')- Extract fieldjson_extract_string(params, '$.length') = '20'- Filter condition- DuckDB optimizes JSON queries well
Explode for Multi-Value Dimensions¶
Both aggregate() and query_points() support the explode parameter for handling list-valued facet dimensions.
Supported explosions:
- explode=["tiers"] → adds scalar tier column
- explode=["surfaces"] → adds scalar surface column
- explode=["projections"] → adds scalar projection column
- explode=["groups"] → adds scalar group column
In aggregate() - Creates grouped statistics per exploded value:
# Without explode: filters force single tier
df = db.aggregate(
filters={"tiers": ["easy"]},
group_by=["eval_id", "base_task"]
)
# Returns one row per (eval_id, base_task)
# With explode: get all tiers
df = db.aggregate(
filters={},
group_by=["eval_id", "base_task", "tier"],
explode=["tiers"]
)
# Returns one row per (eval_id, base_task, tier)
In query_points() - Duplicates points for per-sample analysis:
# Without explode: points may belong to multiple tiers
df = db.query_points(
filters={"eval_id": 0},
columns=["tiers", "completion_tokens_list"]
)
# Row 1: tiers=['easy','medium'], completion_tokens_list=[45, 52, ...]
# Row 2: tiers=['medium','hard'], completion_tokens_list=[78, 91, ...]
# With explode: one row per tier per point
df = db.query_points(
filters={"eval_id": 0},
columns=["tiers", "tier", "completion_tokens_list"],
explode=["tiers"]
)
# Row 1: tiers=['easy','medium'], tier='easy', completion_tokens_list=[45, 52, ...]
# Row 2: tiers=['easy','medium'], tier='medium', completion_tokens_list=[45, 52, ...]
# Row 3: tiers=['medium','hard'], tier='medium', completion_tokens_list=[78, 91, ...]
# Row 4: tiers=['medium','hard'], tier='hard', completion_tokens_list=[78, 91, ...]
Why explode in query_points()? Essential for compression/hazard analysis where you need to: 1. Explode list dimensions (tiers) at the database level for efficiency 2. Then expand per-sample arrays (completion_tokens_list) at the Python level 3. Use the exploded scalar value (tier) to group samples correctly
Summary Table¶
| Method | Purpose | Explode Support | Returns |
|---|---|---|---|
count_points() |
Count/existence | Yes (NEW) | int |
bulk_upsert_points() |
Bulk insert/update | N/A (write op) | int (row count) |
update_points_set() |
Overwrite columns | N/A (write op) | int (row count) |
update_points_append() |
Append to lists | N/A (write op) | int (row count) |
aggregate() |
Grouped stats | Yes | DataFrame |
query_points() |
Raw points | Yes (NEW) | DataFrame |
unique_values() |
Discover unique combinations | Always (implicit) | DataFrame |
Testing¶
The implementation can be validated via the unit tests in tests/test_points_db.py covering:
- Basic CRUD operations
- Filtering patterns (identity, groups, params)
- Aggregation with and without explode
- Bulk operations
- Context manager support