Datasets: Organization, Processing, and Research¶
Overview¶
ReasonScape datasets represent research-objective sets — curated collections of model evaluations designed to answer specific questions. Datasets bridge the gap between raw evaluation data (Stage 2) and processed analysis results (Stages 4-5).
Cohorts: Re-usable Raw Data¶
What is a Cohort?¶
A cohort is a model-level directory containing:
- evals.json - Evaluation definitions (which tests to run, how to filter results)
folders - Raw NDJSON result files from runner.py_ /
Cohorts represent the raw evaluation outputs before processing. Multiple cohorts can be combined into a single dataset for comparison.
A cohort stores all variations of the evaluations performed on a particular model: quantization, samplers, templates, pruning, etc.
Cohort Structure¶
data/r12/ModelName/
├── evals.json # Evaluation definitions (Stage 2)
├── 2025-01-19-timestamp-1/ # Raw NDJSON from runner.py
│ ├── result-steps.ndjson # Individual test results
│ └── metadata.json # Evaluation metadata
├── 2025-01-19-timestamp-2/
│ └── result-steps.ndjson
└── ...
Example paths:
- data/r12/Qwen3-32B/
- data/r12/GPT-OSS-120B/
- data/r12/Seed-OSS-36B/
- data/r12/Llama-Nemotron-Super-49B/
The evals.json Format¶
Each cohort contains an evals.json file that defines which evaluation variants exist and how to identify them.
[
{
"evaluate": {
"glob": "data/r12/ModelName/*model-slug*template*sampler*/*"
},
"filters": {
"model": "Model-Version-String",
"template": "zeroshot-nosys",
"sampler": "greedy-max"
},
"label": "Human-readable description",
"groups": ["family:seedoss", "arch:dense", "size:large", "quant:fp16"],
"tags": ["leaderboard"],
"hf_id": "organization/model-name"
}
]
Fields explained:
- evaluate.glob - Pattern to find raw result directories matching this evaluation
- filters - Metadata identifying this evaluation (model, template, sampler)
- label - Human-readable description for UI/reports
- groups - Organizational tags (family:, arch:, size:, quant:, etc.)
- tags - Special tags (e.g., "leaderboard" to include in main reference)
- hf_id - Hugging Face model ID for reference
Live example from Qwen3-32B:
[
{
"evaluate": {
"glob": "data/r12/Qwen3-32B/*_Qwen3-32B-fp16-16k_zeroshot-nosys_*/*"
},
"filters": {
"model": "Qwen3-32B-fp16-16k",
"template": "zeroshot-nosys",
"sampler": "qwen3-think-max"
},
"label": "Qwen3-32B (FP16-16k) Thinking",
"groups": ["family:qwen3", "arch:dense", "size:large", "quant:fp16", "ctx:16k"],
"hf_id": "Qwen/Qwen3-32B"
}
]
Listing/Searching all Cohorts¶
Use the cohort.py tool to discover and search all cohorts in the repository:
# List all cohorts with their evaluation counts
python cohort.py list
# Search for specific cohorts by name
python cohort.py list --search qwen
# Get detailed information about a cohort
python cohort.py info data/r12/Qwen3-32B
The cohort tool also supports context processing and postprocessing operations. For complete reference, see tools.md#cohortpy
Datasets: Post-processed Research Sets¶
What is a Dataset?¶
A dataset is a data/<name>.json configuration file that:
- References a Stage 1 config (manifold definitions, tasks)
- Lists Stage 2 cohorts to include
- Specifies output database location for Stage 3 processing
- Defines cache directory for intermediate files
Datasets are research-objective groupings. A single cohort can be included in multiple datasets.
Cohort-Dataset Relationship¶
Understanding the many-to-many relationship between cohorts and datasets is key to ReasonScape's architecture:
Key Concepts: - 1 Cohort = 1 Model (all quantization variants, samplers, templates for that model) - 1 Dataset = N Cohorts (multiple models combined to answer one research question) - 1 Cohort → M Datasets (a single cohort can feed into multiple research studies)
Example Structure:
Cohorts (raw evaluation data):
├── data/r12/Qwen3-32B/ → feeds into: r12
├── data/r12/GPT-OSS-120B/ → feeds into: r12
└── data/r12/Seed-OSS-36B/ → feeds into: r12
Datasets (research configurations):
└── data/r12.json [imports all r12 cohorts]
This design enables: - Reuse: Run evaluations once, analyze in multiple contexts - Comparison: Combine different cohorts to answer specific questions - Flexibility: Add new datasets without re-running evaluations
The 100GB+ Problem: Storage Architecture¶
ReasonScape datasets contain over 100GB of raw evaluation data (NDJSON result files). Storing this in Git would be impractical, so we separate concerns:
Git tracks (small):
- evals.json - Cohort metadata (which evaluations exist)
- data/*.json - Dataset configurations (which cohorts to import)
- data/manifest.json - Blob registry mapping eval_ids to content-addressed storage
- configs/*.yaml - Experiment configurations
data.py manages (large):
- data/r12/*/2026-*/ - Raw NDJSON result folders (~100GB uncompressed)
- data/pointsdb-*.db - Processed databases (<1GB each)
- data/.blobs/ - Content-addressed blob cache
This separation allows researchers to work with processed databases (~2GB) without downloading 100GB+ of raw data, while still maintaining the ability to access raw traces when needed for deep investigation.
Dataset Configuration Format¶
{
"name": "r12",
"db": "pointsdb-r12.db",
"cache": "r12/.cache",
"config": "../configs/r12.yaml",
"cohorts": [
{
"path": "r12/MyModel/evals.json",
"filters": {
"tags": "leaderboard"
}
},
{
"path": "r12/OtherModel/evals.json",
"filters": {
"sampler": "greedy-max"
}
}
]
}
Fields explained:
- name - Unique dataset identifier (used in commands, filenames)
- db - Path to output DuckDB database (created by evaluate.py)
- cache - Directory for intermediate caches
- config - Path to Stage 1 config (defines manifolds and views)
- cohorts - List of Stage 2 cohorts to include in this dataset
The config field typically points to ../configs/r12.yaml, which defines task manifolds and analysis views. For complete configuration reference, see config.md.
Download Data and Pre-processsed DB: data.py Tool¶
After cloning the repository, cohort folders will be empty (just evals.json). Use data.py to pull evaluation data:
# Fast path: Pull just the database for analysis (<1GB)
python data.py pull dataset data/r12.json
# Ready for: analyze.py, leaderboard.py, spiderweb.py
# Selective: Pull specific cohorts for research (150MB each)
python data.py pull cohort data/r12/Qwen3-32B
python data.py pull cohort data/r12/Seed-OSS-36B
# Ready for: probe.py, re-evaluation
When to use each approach:
| Use Case | Command | Disk Space | Time |
|---|---|---|---|
| Analyzing existing results | pull dataset |
~2GB | ~5 min |
| Probing raw traces for specific models | pull cohort |
~150MB per cohort | ~2 min |
| Limited disk space | pull dataset |
Minimal | Fast |
For complete reference, see: tools.md#datapy
Process Data into DB: evaluate.py Tool¶
When you run:
python evaluate.py --dataset data/r12.json
The pipeline does:
- Read Stage 2 cohorts
- Find all matching NDJSON files from runner.py
-
Load raw step data (unaggregated test results)
-
Extract Stage 1 context
- Load config:
../configs/r12.yaml - Get manifold definitions and view recipes
-
Extract base_task and params dimensions
-
Compute facets
-
groups[] - From evals.json (family:, arch:, size:, quant:, etc.)
-
Aggregate points
- Apply excess accuracy correction
- Track truncations separately
- Compute Wilson confidence intervals
-
Pre-compute compression/FFT/hazard data
-
Write PointsDB
- DuckDB database at
.dbpath - Indexed by (eval_id, base_task, params)
- Supports fast filtering and aggregation
Discovery: What Evaluations are in a Dataset¶
# List all evaluations in dataset
python analyze.py evals data/r12.json
# List all tasks
python analyze.py tasks data/r12.json
# Get model metadata
python analyze.py modelinfo data/r12.json --eval-id <hash>
Filter during discovery:
# Search by name
python analyze.py evals data/r12.json --search qwen
# Filter by groups
python analyze.py evals data/r12.json --groups arch:moe,size:large
Available Datasets and Research Organization¶
Main Reference Dataset: r12¶
Location: data/r12.json
Purpose: Current primary evaluation set with improved difficulty calibration
Scope: - All r12 cohorts - 12 reasoning tasks
Use cases: - Ranking models (Position workflow) - Cross-model capability comparison (Profile workflow) - Establishing baseline performance
r12 research subdatasets will be added as the corpus grows.
Leaderboard Dataset: r12-leaderboard¶
Location: data/r12-leaderboard.json
Purpose: Curated leaderboard view — only cohorts tagged leaderboard in their evals.json.
Scope:
- r12 cohorts with tags: leaderboard
- 12 reasoning tasks
Use cases: - Canonical model rankings (Position workflow) - Stable reference for comparisons (excludes experimental/WIP evaluations)
See Also¶
- architecture.md - Five-stage methodology overview
- r12.md - r12 evaluation system documentation
- config.md - Configuration reference (manifolds, templates, samplers)
- pointsdb.md - PointsDB API and query patterns
- workflow.md - Four research workflows (Position, Profile, Point, Probe)
- implementation.md - Complete tool reference