Summary
Replace the fragmented consumer read surface (/api/v1/discovery/*, /api/v1/records/{srn}, /search/*) with a single /data/ URL family owned by a new data domain. Delete the empty search and export domain shells. Delete the unused index domain (vector + keyword backends, ChromaDB infra). No live archives → no backwards compatibility.
Primary JTBD: .csv.gz streaming at a stable per-schema URL, the canonical "weekly dump on a cron" pattern. JSON for paginated exploratory reads, also from the same engine.
Scope (lean v1)
/data/{schema}/records[.csv|.csv.gz] — schema-scoped table read, primary JTBD
/data/{schema}/{hook}[.csv|.csv.gz] — hook table dumps
GET /data/records/{id} — single record by internal ID (server resolves schema via PK)
- POST filter body —
give me compounds with MW < 500 as csv.gz
- Basic catalog and schema manifest (fields, hooks, counts; no example queries yet)
- Pluggable serializer registry (CSV, CSV.gz exposed; NDJSON, Parquet wired but unexposed)
- Reserved
/data/datasets/ URL slot for v2 (operator-defined frozen datasets)
API surface
GET /data node catalog
GET /data/{schema} schema manifest (basic)
GET /data/records/{id} single record by internal ID
GET /data/records/{id}@{version} pinned version
GET /data/{schema}/records[.csv|.csv.gz] schema-scoped table read
POST /data/{schema}/records[.csv|.csv.gz] filter body
GET /data/{schema}/{hook}[.csv|.csv.gz] hook table read
POST /data/{schema}/{hook}[.csv|.csv.gz] filter body
GET /data/datasets (reserved, v2)
GET /data/datasets/{name}[@version] (reserved, v2)
Schema versioning syntax: /data/{schema}@{semver}/{table}. Uses existing SchemaId.parse.
Reserved-path handling: GET /data/records and GET /data/datasets return 404. The catalog handler explicitly 404s on reserved schema names. Cross-schema records bulk read uses POST (not GET), so there's no GET to reserve for the future.
Key decisions
| Decision |
Choice |
Auth on /data/* |
All public for v1 (CDN-cacheable stable URLs) |
| Identifiers in URLs / filter DTOs |
Internal IDs (UUIDv7/ULID); SRNs sidelined |
| Identifiers in response bodies |
Include both id (bare) AND srn (full) for federation/citation |
| Filter dialect |
POST body only (FilterExpr DSL from existing discovery domain) |
| Default format (no suffix) |
JSON array, implicitly paginated (default page 50, max 1000) |
| Bulk format |
.csv.gz streams end-to-end (gzip-while-streaming via zlib.compressobj); bounded memory regardless of result size |
| Backwards compatibility |
None. Pre-release; no consumers. |
| Cross-schema bulk read |
Deferred (along with column-projection question). URL slot held. |
| Reserved words |
Hook names and schema IDs cannot be records or datasets. Enforced at registration. Constant lives at domain/shared/model/reserved.py. |
| Engine IR |
Plain AsyncIterator[RecordSummary]. No pyarrow in v1. Parquet (when it ships) owns its own row→Arrow conversion. |
Method semantics
| Method |
Body |
Use case |
Cacheability |
| GET (no params) |
none |
First JSON page, default 50 rows |
Low (cursor-dependent) |
GET (?limit, ?cursor, ?sort) |
none |
Paginated JSON read |
Per-cursor |
GET (.csv / .csv.gz) |
none |
Bulk streaming dump |
High (stable URL, CDN-friendly) |
| POST |
FilterExpr |
Filtered read, any format |
Not cached |
Internal architecture
Route handler (factory-generated for table routes, hand-written for catalog/manifest/by-id)
↓ (parse URL params or POST body)
QueryPlan (IR: schema, table, filter, pagination, sort, format)
↓
Engine.execute(plan) → AsyncIterator[RecordSummary]
↓
Serializer.write(rows) → AsyncIterator[bytes]
↓
FastAPI StreamingResponse
Service split
DataQueryService — filter validation, stream_records, stream_features. Inherits validation helpers from existing DiscoveryService.
DataCatalogService — get_node_catalog, get_schema_manifest, get_record_by_id.
Route file layout
application/api/v1/routes/data/
__init__.py # router; wire-up
catalog.py # GET /data, GET /data/{schema}
records.py # GET /data/records/{id}
tables.py # register_table_routes factory + table-shaped routes
models.py # shared Pydantic models
Format registry — DataResponseFormat + metaprogrammed routes
@dataclass(frozen=True)
class DataResponseFormat:
serializer: type[Serializer]
paginated: bool # JSON: True; CSV/CSV.gz: False
suffix: str # "", "csv", "csv.gz"
media_type: str
FORMATS = [
DataResponseFormat(JsonSerializer, paginated=True, suffix="", media_type="application/json"),
DataResponseFormat(CsvSerializer, paginated=False, suffix="csv", media_type="text/csv"),
DataResponseFormat(CsvGzipSerializer, paginated=False, suffix="csv.gz", media_type="application/gzip"),
]
def register_table_routes(router, base_path, get_handler, post_handler, resource_name):
for fmt in FORMATS:
path = f"{base_path}{('.' + fmt.suffix) if fmt.suffix else ''}"
builder = make_paginated_endpoint if fmt.paginated else make_streaming_endpoint
router.add_api_route(path, builder(fmt, get_handler), methods=["GET"], operation_id=...)
router.add_api_route(path, builder(fmt, post_handler), methods=["POST"], operation_id=...)
Adding NDJSON or Parquet later = one append to FORMATS. All resources get the format automatically.
Streaming details
.csv.gz uses zlib.compressobj(level=6, wbits=MAX_WBITS|16).
- Memory footprint: ~32KB DEFLATE sliding window + one row buffer. Constant regardless of total result size.
Content-Length not set (chunked transfer encoding).
- Client disconnect cancels the async generator; engine propagates cancellation to the Postgres cursor (try/finally +
session.stream() context manager).
- Pre-flight validation pulls the first batch before sending HTTP 200; errors before first byte → 4xx. Errors after first byte → partial corrupt download.
- Empty CSV / CSV.gz result: 200 with header row + EOF.
Runtime hardening (v1 minimum)
- Per-route Postgres timeouts via
SET LOCAL:
- Paginated JSON routes:
statement_timeout = 30s.
- Streaming CSV / CSV.gz routes:
statement_timeout = 30min.
- All routes:
idle_in_transaction_session_timeout = 5min.
- Rate limit on POST routes via
slowapi: 10 req/min per IP. Permissive default; tighten if needed.
Deeper question of inline streaming vs async export jobs (200 vs 202) and its relationship to the deferred Dataset concept is tracked separately in issue #138.
What gets renamed
domain/discovery/ → domain/data/. Services rename + split (DataQueryService + DataCatalogService).
What gets deleted
domain/index/ — vector + keyword backends, FanOutToIndexBackends handler, IndexRecord events, ChromaDB infra.
domain/search/ — empty shell.
domain/export/ — empty shell.
sdk/index/ — unused SDK package.
routes/discovery.py, routes/records.py, routes/search.py.
- The index-counts field on
/stats (replaced with /data/ schema counts).
chromadb and sentence-transformers from server/pyproject.toml.
What stays untouched
- Write side:
/depositions, /conventions, /validation, /curation, /schemas, /ontologies, /auth, /admin, /ingestions, /health, /stats.
/events?cursor=... — changefeed for mirror/federation. Different bounded context. Out of scope for /data/.
RecordPublished event still emits. Only the indexing fan-out goes away.
Deferred to future issues
Web UI + SDK migration — OUT OF SCOPE
- Web frontend (
web/) is currently broken and slated for significant rebuild. Not migrating URLs as part of this work.
- Python SDK lives in a separate repository. Required SDK changes are documented separately for the SDK team. No atomic cross-repo coordination required.
Acceptance criteria
data domain exists; absorbs discovery query engine, splits into DataQueryService + DataCatalogService.
index, search, export domains deleted; sdk/index/ deleted.
/data/ URL family covers catalog, basic schema manifest, schema-scoped records read, hook tables, single-record-by-ID lookup.
- Single shared engine produces JSON, CSV, and CSV.gz from one row stream via the
DataResponseFormat + factory pattern.
.csv.gz streams end-to-end with bounded server memory (validated: pull 100K-row table at constant ~50MB server memory).
- POST-only filter dialect; no URL operator syntax.
- All routes public; no auth dependency on
/data/*.
- Reserved-name policy (
records, datasets) enforced at hook + schema registration via domain/shared/model/reserved.py.
- Read-side reserved paths (
GET /data/records, GET /data/datasets) return 404 with explicit contract tests.
- All URLs and filter DTOs use bare internal IDs; response bodies include both
id and srn.
- Per-route
statement_timeout set; slowapi rate limit on POST routes.
- Existing
discovery query handlers and engine code are reused, not rewritten.
- Contract tests cover the URL family + reserved-path 404s + streaming memory footprint.
Summary
Replace the fragmented consumer read surface (
/api/v1/discovery/*,/api/v1/records/{srn},/search/*) with a single/data/URL family owned by a newdatadomain. Delete the emptysearchandexportdomain shells. Delete the unusedindexdomain (vector + keyword backends, ChromaDB infra). No live archives → no backwards compatibility.Primary JTBD:
.csv.gzstreaming at a stable per-schema URL, the canonical "weekly dump on a cron" pattern. JSON for paginated exploratory reads, also from the same engine.Scope (lean v1)
/data/{schema}/records[.csv|.csv.gz]— schema-scoped table read, primary JTBD/data/{schema}/{hook}[.csv|.csv.gz]— hook table dumpsGET /data/records/{id}— single record by internal ID (server resolves schema via PK)give me compounds with MW < 500 as csv.gz/data/datasets/URL slot for v2 (operator-defined frozen datasets)API surface
Schema versioning syntax:
/data/{schema}@{semver}/{table}. Uses existingSchemaId.parse.Reserved-path handling:
GET /data/recordsandGET /data/datasetsreturn 404. The catalog handler explicitly 404s on reserved schema names. Cross-schema records bulk read uses POST (not GET), so there's no GET to reserve for the future.Key decisions
/data/*id(bare) ANDsrn(full) for federation/citationFilterExprDSL from existingdiscoverydomain).csv.gzstreams end-to-end (gzip-while-streaming viazlib.compressobj); bounded memory regardless of result sizerecordsordatasets. Enforced at registration. Constant lives atdomain/shared/model/reserved.py.AsyncIterator[RecordSummary]. No pyarrow in v1. Parquet (when it ships) owns its own row→Arrow conversion.Method semantics
?limit,?cursor,?sort).csv/.csv.gz)FilterExprInternal architecture
Service split
DataQueryService— filter validation,stream_records,stream_features. Inherits validation helpers from existingDiscoveryService.DataCatalogService—get_node_catalog,get_schema_manifest,get_record_by_id.Route file layout
Format registry — DataResponseFormat + metaprogrammed routes
Adding NDJSON or Parquet later = one append to
FORMATS. All resources get the format automatically.Streaming details
.csv.gzuseszlib.compressobj(level=6, wbits=MAX_WBITS|16).Content-Lengthnot set (chunked transfer encoding).session.stream()context manager).Runtime hardening (v1 minimum)
SET LOCAL:statement_timeout = 30s.statement_timeout = 30min.idle_in_transaction_session_timeout = 5min.slowapi: 10 req/min per IP. Permissive default; tighten if needed.Deeper question of inline streaming vs async export jobs (200 vs 202) and its relationship to the deferred
Datasetconcept is tracked separately in issue #138.What gets renamed
domain/discovery/→domain/data/. Services rename + split (DataQueryService+DataCatalogService).What gets deleted
domain/index/— vector + keyword backends,FanOutToIndexBackendshandler,IndexRecordevents, ChromaDB infra.domain/search/— empty shell.domain/export/— empty shell.sdk/index/— unused SDK package.routes/discovery.py,routes/records.py,routes/search.py./stats(replaced with/data/schema counts).chromadbandsentence-transformersfromserver/pyproject.toml.What stays untouched
/depositions,/conventions,/validation,/curation,/schemas,/ontologies,/auth,/admin,/ingestions,/health,/stats./events?cursor=...— changefeed for mirror/federation. Different bounded context. Out of scope for/data/.RecordPublishedevent still emits. Only the indexing fan-out goes away.Deferred to future issues
POST /data/recordsfamily) + column projection decision./data/{schema}/openapi.json(the agent-affordance journey)./data/datasets/...(citation/artifact work; URL slot reserved). See research: production hardening for large query results + relationship to Dataset concept #138.Web UI + SDK migration — OUT OF SCOPE
web/) is currently broken and slated for significant rebuild. Not migrating URLs as part of this work.Acceptance criteria
datadomain exists; absorbsdiscoveryquery engine, splits intoDataQueryService+DataCatalogService.index,search,exportdomains deleted;sdk/index/deleted./data/URL family covers catalog, basic schema manifest, schema-scoped records read, hook tables, single-record-by-ID lookup.DataResponseFormat+ factory pattern..csv.gzstreams end-to-end with bounded server memory (validated: pull 100K-row table at constant ~50MB server memory)./data/*.records,datasets) enforced at hook + schema registration viadomain/shared/model/reserved.py.GET /data/records,GET /data/datasets) return 404 with explicit contract tests.idandsrn.statement_timeoutset;slowapirate limit on POST routes.discoveryquery handlers and engine code are reused, not rewritten.