Skip to content

jeroenflvr/datapress

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

242 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rust DuckDB DataFusionactix

datap-rs

A Rust Cargo workspace that exposes one or more Parquet / Delta datasets over a JSON HTTP API. The same surface area is implemented twice — once on top of DuckDB, once on top of Apache Arrow + DataFusion — so you can A/B the engines under identical workloads. A Python wheel (datap-rs, built with maturin + PyO3) bundles both engines and lets you configure and launch the server from Python.

Overview presentation → datap-rs.org · Documentation

  • Built on actix-web 4
  • Datasets declared in a single datasets.toml (Rust binaries) or programmatically (Python wrapper)
  • Dynamic schema inference at startup (no hard-coded columns)
  • Identical request/response shapes across both backends

Quick start

For testing, we're using this kaggle US accidents 2016-2023 dataset.

# 1. Put a parquet file somewhere (or point the config at an existing one).
ls data/accidents.parquet

# 2. Edit datasets.toml — see the example shipped in this repo.

# 3. Run a backend.
task run:duckdb        # or: task run:datafusion

# 4. Talk to it.
curl http://localhost:8080/api/v1/datasets

Taskfile.yml wraps the typical cargo build --release -p … invocations; see task --list for the full menu.

Install the prebuilt binary

The quickest way to get the unified datapress binary (both backends bundled, selected at runtime via server.backend) without a Rust toolchain:

# Linux / macOS
curl -LsSf https://datap-rs.org/install.sh | sh

# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://datap-rs.org/install.ps1 | iex"

# Homebrew (macOS / Linux)
brew install jeroenflvr/tap/datapress

# winget (Windows)
winget install datap-rs.DataPress

# Docker (mount a config that sets listen = "0.0.0.0")
docker run --rm -p 8080:8080 \
  -v "$PWD/datasets.toml:/etc/datapress/datasets.toml:ro" \
  jeroenflvr/datapress:latest

The install scripts drop the binary in a per-user directory (~/.local/bin on Unix, %LOCALAPPDATA%\datapress\bin on Windows) and tell you how to add it to your PATH. See packaging/ for details and release automation.

Prefer cargo? Install from crates.io:

cargo install datapress        # both DuckDB + DataFusion
datapress                      # reads ./datasets.toml (or $DATASETS_CONFIG)

For a slimmer single-backend build, or to opt into the docs / Swagger / metrics / auth features:

cargo install datapress --no-default-features --features duckdb
cargo install datapress --features swagger,auth,metrics

The installed binary resolves its config from (first match wins) --config <FILE>, $DATAPRESS_CONFIG_FILE, ./datasets.toml, then $HOME/datasets.toml. Generate a starter template with datapress init (writes datasets.toml.template to a directory, or $HOME when omitted):

datapress init                 # ~/datasets.toml.template
cp ~/datasets.toml.template ~/datasets.toml   # then edit and run `datapress`

From Python

The same server can be configured and launched from Python via the datapress wheel (one wheel, both engines bundled):

import asyncio
from datap_rs.datapress import DataPress, DataPressConfig, DatasetConfig

async def main():
    ds = DatasetConfig(
        name="accidents",
        source="data/accidents.parquet",
        format="parquet",   # or "delta"
        mode="auto",        # index policy: "auto" | "none" | "list"
    )
    cfg = DataPressConfig(backend="duckdb", listen="0.0.0.0", port=8000, workers=8)
    server = DataPress(cfg, datasets=[ds])
    await server.run()      # blocks until SIGINT

asyncio.run(main())

Build the wheel with task py:develop (uses uv + maturin).


Standalone clients

The datapress wheel above can both run a server and talk to it. If you only need to talk to an already-running server, three standalone clients share one lightweight Rust core (datapress-client) and pull in no server engines:

Client Package Install
Command line datapress-cli install script · cargo install datapress-cli
Python datap-rs-client uv pip install datap-rs-client[arrow]
Rust library datapress-client cargo add datapress-client
# CLI: install script (Linux / macOS)
curl -LsSf https://datap-rs.org/install-cli.sh | sh
# CLI: install script (Windows)
powershell -ExecutionPolicy ByPass -c "irm https://datap-rs.org/install-cli.ps1 | iex"

datapress-cli datasets
datapress-cli query accidents --select State,Severity --where Severity:gte:3 --page-size 1000
# Python: dict in, dict out; query_arrow() returns a pyarrow.Table for
# Polars / pandas / DuckDB / PySpark / DataFusion.
from datap_rs_client import DataPressClient

client = DataPressClient("http://127.0.0.1:8000")
table = client.query_arrow("accidents", columns=["State", "Severity"], page_size=100_000)

See the clients documentation for the full reference. The CLI install scripts behave like the server's — per-user directory, checksum-verified, no PATH edits — using DATAPRESS_CLI_VERSION / DATAPRESS_CLI_INSTALL_DIR overrides.


The two backends

Aspect datapress-duckdb datapress-datafusion
Engine DuckDB (embedded C++) Arrow compute + DataFusion (pure Rust)
Storage DuckDB in-memory table per dataset One contiguous RecordBatch per dataset
Concurrency model Connection pool, blocking → web::block Async-native, multi-threaded MemTable partitions
Predicate execution DuckDB optimiser + parallel hash/vector ops Equality index → SIMD scan → DataFusion SQL
Indexes Native DuckDB internals (zone maps, etc.) Per-dataset eq-index built at startup (configurable)
Memory profile DuckDB's own buffer manager Whole dataset resident in RAM
Binary size Bundled DuckDB ≈ tens of MB Lean — pure Rust
Startup time Fast (just read_parquet) Slower — reads all rows + builds eq-index
Best at Heterogeneous SQL, joins, aggregations Dense filter scans, low-latency point lookups

When to pick which

  • DuckDB is the right default. It handles arbitrary SQL well, has a battle-tested optimiser, manages memory itself, and starts up in milliseconds because it lazily reads parquet pages on demand.
  • DataFusion shines when:
    • the dataset fits comfortably in RAM,
    • you query the same columns repeatedly with equality/IN predicates (the in-process equality index turns those into O(1) lookups), and
    • you want a single static binary without a vendored C++ runtime.

The HTTP API is identical, so the practical comparison is "throughput and p99 on your queries" — see TEST_Q.md for a benchmark suite.


Configuration: datasets.toml

Every instance reads this file at startup. One [server] block plus one [[dataset]] entry per table you want to expose.

[server]
backend = "datafusion"   # "datafusion" (default) | "duckdb"
listen  = "127.0.0.1"    # default; set to "0.0.0.0" to expose
port    = 8080
# workers = 8            # omit for one worker per CPU
# compress = true        # negotiate gzip/brotli/zstd via Accept-Encoding (default)
# max_body_bytes     = 1048576  # 413 above this; default 1 MiB
# max_page_size      = 100000   # clamp query page_size above this
# request_timeout_ms = 30000    # 504 above this; 0 disables; default 30s
# shutdown_timeout_secs = 30    # SIGTERM/SIGINT grace period, in seconds

# DuckDB backend only: enable the experimental Quack remote protocol.
# [server.quack]
# enabled = false
# uri = "quack:localhost"
# token = "change-me"
# read_only = true

[[dataset]]
name = "accidents"                    # used in the URL: /api/datasets/accidents/...

  [dataset.source]
  kind     = "parquet"                # "parquet" | "delta"
  location = "data/accidents.parquet" # file, directory of *.parquet, or s3://…

  # Optional — DataFusion only. DuckDB ignores this block.
  [dataset.index]
  mode             = "auto"           # "auto" | "none" | "list"
  columns          = []               # required when mode = "list"
  max_cardinality  = 100000           # used by "auto" to skip wide cols

Server

Field Default Notes
backend datafusion Informational hint; logged at startup. Each binary always runs as its own backend regardless of this value.
listen 127.0.0.1 Loopback by default — the service is not exposed on a network interface unless you opt in.
port 8080
workers (unset) Actix worker threads. Unset = one per CPU.
prefix "" URL path prefix mounted in front of every route (e.g. "/datapress") — useful behind a reverse proxy that passes the path through unchanged. Must start with / and not end with /.
compress true Negotiate response compression via Accept-Encoding (gzip / brotli / zstd). Disable when sitting behind a proxy that compresses for you.
max_body_bytes 1048576 Maximum accepted JSON request body, in bytes. Bigger bodies are rejected with 413 Payload Too Large.
max_page_size 100000 Maximum rows returned by one /query page. Larger page_size values are clamped.
request_timeout_ms 30000 Per-request handler timeout, in milliseconds. Long-running handlers are cancelled and the client gets 504 Gateway Timeout. 0 disables the timeout.
shutdown_timeout_secs 30 Grace period for in-flight requests after the process receives SIGTERM / SIGINT, in seconds. The listening socket is closed immediately; existing connections then have up to this many seconds to finish before workers are force-stopped.

DuckDB builds can also opt into [server.quack], DuckDB's experimental remote protocol server. Keep it disabled unless you intentionally want DuckDB clients to attach/query this process directly. It binds to quack:localhost by default, uses token authentication, and DataPress installs a read-only authorization hook by default.

The server exposes three probe endpoints. /healthz and /readyz are mounted at the bare host root (regardless of prefix) so orchestrators don't need to know how the service is exposed. /health lives under prefix and is intended for in-app health checks.

Route Status Body
/healthz Liveness — always 200 while the process is running. {"status":"ok"}
/readyz Readiness — 200 once at least one dataset is registered, 503 otherwise. {"status":"ready","datasets":N} / {"status":"not ready","reason":"no datasets registered"}
/version Build / version metadata — always 200. {"name":"datapress-core","version":"x.y.z","backend":"DuckDB|DataFusion","profile":"debug|release", ...}
{prefix}/health App-level liveness — always 200. {"status":"ok"}

/healthz does not touch the backend, so it stays 200 even while the dataset registry is still loading at startup. Use /readyz to gate traffic until the server is actually able to serve queries.

/version also includes optional fields populated from build-time env vars when set: git_sha (DATAPRESS_GIT_SHA), build_time (DATAPRESS_BUILD_TIME, ISO-8601), and target (DATAPRESS_TARGET, e.g. aarch64-apple-darwin). Unset vars are omitted from the JSON. Example:

DATAPRESS_GIT_SHA=$(git rev-parse --short HEAD) \
DATAPRESS_BUILD_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
DATAPRESS_TARGET=$(rustc -vV | awk '/host:/ {print $2}') \
  cargo build --release -p datapress-duckdb

Online documentation

DataPress can embed two browsable sources of documentation into the binary itself:

  • An MkDocs Material site (the one you are reading) at [docs].path (default /mkdocs).
  • An interactive Swagger UI with a hand-written OpenAPI spec at [swagger].path (default /docs). The raw spec is also exposed at <path>/openapi.json.

Both are opt-in at build time (so wheels stay slim when you don't want them) and enabled by default at runtime once compiled in — set enabled = false to disable in prod.

  1. Build the MkDocs site (only needed for the docs feature):

    task docs:build
  2. Build the backend with one or both features:

    cargo build --release -p datapress-duckdb --features docs,swagger
  3. Tweak in datasets.toml if you want to relocate or disable either:

    [docs]
    enabled = true        # default: true
    path    = "/mkdocs"   # default: /mkdocs
    
    [swagger]
    enabled = true        # default: true (set to false in prod)
    path    = "/docs"     # default: /docs

Both path values must start with /, not end with /, not collide with /api, /api/v1, /health{z,}, /readyz, or /version, and must differ from each other. When the binary is built without the relevant feature but the TOML enables it, the server logs a warning at startup and continues without that surface.

Browser explorer

Build with --features explorer to embed a small browser UI at /explore for poking at datasets — list / schema browsing plus an API Query tab that issues /query calls and decodes the Arrow IPC responses client-side. Like the docs surfaces it is compiled in opt-in and served by default once present.

Authentication (OIDC / OAuth2)

Build with --features auth to enable JWT bearer enforcement against any OpenID-Connect issuer (Entra ID, Auth0, Keycloak, Okta, …). When enabled, the server fetches the issuer's JWKS at startup, refreshes it in the background, and validates Authorization: Bearer <jwt> headers against the configured issuer, audience, algorithms, and scopes.

[auth]
enabled         = true
issuer          = "https://login.microsoftonline.com/<tenant-id>/v2.0"
audience        = "api://datapress"
algorithms      = ["RS256"]
read_scopes     = ["datasets:read"]
reload_scopes   = ["datasets:reload"]
anonymous_read  = false      # set true to keep read endpoints public
tenant_claim    = "/tid"     # JSON-pointer into the JWT claims
allowed_tenants = ["<tenant-id>"]
admin_token_fallback = true  # keep X-Admin-Token working in parallel

Health probes (/healthz, /readyz, /version) stay unauthenticated so load balancers keep working. The legacy X-Admin-Token header keeps working for POST .../reload as long as admin_token_fallback = true.

To turn the Swagger UI itself into an SSO client, add an [swagger.oauth2] block — it gets rendered as an OpenIdConnect security scheme with PKCE.

Source

[dataset.source] is a tagged enum.

kind location Notes
parquet a .parquet file Read as-is.
parquet a directory Every *.parquet inside (sorted, non-recursive). No glob patterns.
parquet s3://bucket/key.parquet or s3://bucket/prefix/ Requires a [dataset.s3] block. DuckDB autoloads httpfs.
delta a local directory Pointed at the table root (the dir containing _delta_log/).
delta s3://bucket/path/to/table Requires [dataset.s3]. DuckDB autoloads delta; DataFusion uses the deltalake crate.

S3 / S3-compatible storage

[[dataset]]
name = "events"

  [dataset.source]
  kind     = "parquet"           # or "delta"
  location = "s3://events/2025/*.parquet"

  [dataset.s3]
  region            = "us-east-1"
  endpoint          = "http://localhost:9000"  # omit for AWS
  addressing_style  = "path"                   # "virtual" (default) | "path"
  allow_http        = true                     # only for non-https endpoints
Field Default Notes
region us-east-1 Falls back to AWS_REGION env, then us-east-1.
endpoint (unset) Custom S3 endpoint (MinIO, R2, Wasabi, Backblaze, …).
addressing_style virtual virtual = https://bucket.host, path = https://host/bucket (MinIO).
allow_http false Must be true if endpoint is http://….
partitioning auto Hive partition discovery: auto, hive (force on), none (force off).
endpoint_bucket_in_host auto Fold the bucket into the endpoint host: auto (follows addressing_style), true, false.
access_key_id, secret_access_key, session_token (unset) Inline creds. Discouraged for prod — use env vars instead.

Credential precedence (highest → lowest):

  1. Per-dataset env vars: ${PREFIX}_AWS_ACCESS_KEY_ID, ${PREFIX}_AWS_SECRET_ACCESS_KEY, ${PREFIX}_AWS_SESSION_TOKEN, ${PREFIX}_AWS_REGION. PREFIX is the dataset name uppercased with every non-alphanumeric character mapped to _ (e.g. accidentsACCIDENTS_AWS_…, my-bucketMY_BUCKET_AWS_…).
  2. Inline [dataset.s3] keys.
  3. Plain AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_REGION.
  4. The backend's default credential chain (~/.aws/credentials, IMDS, etc.).

Python: the S3Config binding also accepts a credentials_provider — a zero-argument callable returning an HMACKeyPair. It is invoked once when DataPress(...) is constructed, the result is cached indefinitely, and it overrides any inline access_key_id / secret_access_key. See the Python S3 docs.

When kind = "delta" and location is an s3://… URL, both backends fully materialise the table at startup. There is no incremental scan path — switch to parquet if you need on-demand page reads.

Equality-index policy (DataFusion only)

The DataFusion backend builds an in-memory value -> [row ids] map at startup so that eq / in predicates resolve in O(1).

mode Behaviour
auto Index every column whose distinct count stays below max_cardinality.
none Skip the index entirely — every query goes through DataFusion SQL.
list Index only the named columns. Useful for huge datasets.

Override the config path with DATASETS_CONFIG=/path/to/file.toml.

HTTP API

Five core routes, both backends — list / schema / query / count / reload — plus an opt-in raw-SQL endpoint (see below):

API versioning

The canonical paths live under /api/v1/.... The un-versioned /api/... paths continue to work as a legacy alias for v1, so existing clients keep running. To upgrade, replace /api/ with /api/v1/ in your URLs — nothing else changes.

POST /api/v1/datasets/accidents/query      # canonical (recommended)
POST /api/datasets/accidents/query         # legacy alias, still v1

When a breaking schema change is introduced, it will ship as /api/v2 in a sibling module (crates/core/src/handlers/v1.rs) and v1 will stay mounted alongside it for a deprecation window.

GET /api/v1/datasets

{ "datasets": [ { "name": "accidents", "columns": 47 } ] }

GET /api/v1/datasets/{name}/schema

Returns the inferred columns plus a sample row so a client can see what values look like without issuing a query.

{
  "name": "accidents",
  "columns": [
    { "name": "ID",       "logical": "utf8", "sql_type": "VARCHAR",   "nullable": false },
    { "name": "Severity", "logical": "int",  "sql_type": "INTEGER",   "nullable": true  },
    { "name": "Start_Time", "logical": "temporal", "sql_type": "TIMESTAMP", "nullable": true }
  ],
  "sample": { "ID": "A-1", "Severity": 2, "Start_Time": "2016-02-08 05:46:00", ... }
}

logical values: bool | int | float | utf8 | temporal | other. Temporal columns are returned as strings.

POST /api/v1/datasets/{name}/query

{
  "columns":   ["ID", "City", "State", "Severity"],
  "predicates": [
    { "col": "State",    "op": "eq",  "val": "TX" },
    { "col": "Severity", "op": "gte", "val": 3   }
  ],
  "order_by": [
    { "col": "Severity", "dir": "desc" },
    { "col": "ID" }
  ],
  "limit":     1000,
  "page":      1,
  "page_size": 50
}

Response:

{ "data": [ { ... }, ... ], "page": 1, "page_size": 50 }

Request fields

Field Type Default Notes
columns string[] [] Empty = all columns.
predicates Predicate[] [] ANDed together.
order_by OrderBy[] [] { col, dir? }; dir is asc (default) or desc, case-insensitive. When group_by is set, col must be a group column or aggregation alias.
group_by string[] [] Columns to group by. When set, columns is ignored. Empty aggregations implies [{ op: "count" }].
aggregations Aggregation[] [] { col?, op, alias? }; op is count|sum|avg|min|max. col may be omitted only for count (= COUNT(*)). Requires group_by.
distinct bool false Dedup the projected columns. Mutually exclusive with group_by / aggregations.
limit int >= 0 or null null Hard cap on total rows across all pages. null = unlimited.
page int >= 1 1 1-based.
page_size int >= 1 1000 Clamped to server.max_page_size (100_000 by default).

Predicate shape

{ "col": "<column>", "op": "<operator>", "val": <json value | array | omitted> }
op val Meaning
eq scalar col = val
neq scalar col <> val
gt / gte number / string col > val / col >= val
lt / lte number / string col < val / col <= val
like string with % / _ SQL LIKE
ilike string with % / _ Case-insensitive LIKE
in non-empty array col IN (v1, v2, …)
is_null omit col IS NULL
is_not_null omit col IS NOT NULL

Column names are looked up case-insensitively against the inferred schema and quoted automatically, so Temperature(F) and similar identifiers work.

Response format — JSON or Arrow IPC

/query can return its result set in two wire formats. Same body, same predicates, same pagination — only the response encoding differs.

Aspect JSON (default) Arrow IPC stream
Content-Type application/json application/vnd.apache.arrow.stream
How to ask nothing — it's the default Accept: application/vnd.apache.arrow.stream or ?format=arrow on the URL
Shape Array of row objects ([{...}, {...}, ...]) Self-describing stream: 1 schema message + N RecordBatch messages + EOS
Layout Row-oriented; column names repeated on every row Columnar; one contiguous buffer per column per batch
Types preserved Scalars become JSON (int/float/bool/string); temporals stringified to ISO-8601 Native Arrow types — Int32, Timestamp(ns), Decimal128, dictionary, etc. retained end-to-end
Page metadata In the body (just the rows, no envelope) In headers: X-Page, X-Page-Size
Empty result [] Valid stream with the schema message only, zero batches
Compression Big win — JSON is text Smaller starting point; gzip/zstd still help on wide / repetitive cols, brotli usually skipped
Client cost json.loads + per-row dict construction pyarrow.ipc.open_stream(...).read_all() → zero-copy pyarrow.Table
Best for Small responses, browsers, ad-hoc curl, dashboards Bulk data into Polars / pandas / DuckDB-on-the-client, ML feature pipelines

When to pick which. Use JSON when the consumer is JavaScript, the response is small (<~10k rows), or you're poking at the API by hand. Use Arrow IPC when you're moving result pages into a dataframe library, the schema has non-string types you want preserved, or page sizes are large enough that JSON parse time shows up in profiles.

# JSON (default)
curl -X POST http://localhost:8080/api/v1/datasets/accidents/query \
  -H 'Content-Type: application/json' \
  -d '{ "predicates": [{ "col": "State", "op": "eq", "val": "TX" }] }'

# Arrow IPC — via Accept header
curl -X POST http://localhost:8080/api/v1/datasets/accidents/query \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/vnd.apache.arrow.stream' \
  --output result.arrow \
  -d '{ "predicates": [{ "col": "State", "op": "eq", "val": "TX" }] }'

# Arrow IPC — via query string (handy when you can't set headers)
curl -X POST 'http://localhost:8080/api/v1/datasets/accidents/query?format=arrow' \
  -H 'Content-Type: application/json' \
  --output result.arrow \
  -d '{ "predicates": [{ "col": "State", "op": "eq", "val": "TX" }] }'
import requests, pyarrow.ipc as ipc
r = requests.post(url, json=req, headers={"Accept": "application/vnd.apache.arrow.stream"})
table = ipc.open_stream(r.content).read_all()  # → pyarrow.Table
page  = int(r.headers["X-Page"])
size  = int(r.headers["X-Page-Size"])

Supported on both backends — DuckDB streams batches out via its native query_arrow API, DataFusion uses its Arrow plan directly. The Compress middleware still applies. count, schema, and the dataset-listing endpoints are JSON-only.

Grouping / aggregation

When group_by is non-empty the SELECT list is derived from the group columns plus each aggregation's output alias — the top-level columns field is ignored. Supported ops: count, sum, avg, min, max (case-insensitive). col may be omitted only for count (= COUNT(*)). If aggregations is omitted an implicit COUNT(*) AS count is added.

curl -X POST http://localhost:8080/api/v1/datasets/accidents/query \
  -H 'Content-Type: application/json' \
  -d '{
    "group_by": ["State"],
    "aggregations": [
      { "op":  "count" },
      { "col": "Severity", "op": "avg", "alias": "avg_sev" }
    ],
    "order_by": [{ "col": "count", "dir": "desc" }],
    "page_size": 10
  }'
# → { "data": [ { "State": "CA", "count": 1741433, "avg_sev": 2.21 }, ... ], ... }

aggregations without group_by returns 400. order_by keys must reference a group column or an aggregation alias (no arbitrary dataset columns — they are not in scope after GROUP BY). Grouped queries always go through the SQL engine; no in-memory fast path applies.

Distinct rows

distinct: true deduplicates on the projected columns. Useful for building dropdowns / facet lists.

curl -X POST http://localhost:8080/api/v1/datasets/accidents/query \
  -H 'Content-Type: application/json' \
  -d '{
    "columns":  ["State"],
    "distinct": true,
    "order_by": [{ "col": "State" }],
    "page_size": 100
  }'
# → { "data": [ { "State": "AL" }, { "State": "AR" }, ... ], ... }

Mutually exclusive with group_by / aggregations (returns 400 if combined). Also bypasses the in-memory fast paths.

POST /api/v1/datasets/{name}/count

Returns the number of rows matching predicates. Same predicate shape as /query; only the predicates field is read. Empty body counts every row.

curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/count \
  -H 'Content-Type: application/json' -d '{}'
# → { "count": 7728394 }

curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/count \
  -H 'Content-Type: application/json' \
  -d '{
    "predicates": [
      { "col": "State",    "op": "eq",  "val": "TX" },
      { "col": "Severity", "op": "gte", "val": 3   }
    ]
  }'
# → { "count": 187423 }

On materialised DataFusion datasets the no-predicate path is O(1) (uses the resident chunk metadata, no scan); indexable predicates short-circuit through the equality index. Otherwise it runs SELECT COUNT(*) … WHERE … through the engine.

POST /api/v1/sql (raw SQL — opt-in)

Runs a single read-only SELECT / WITH … SELECT (or DESCRIBE <table>) that references exactly one registered dataset by its configured name. Disabled by default — while off the route returns 404, so its presence isn't even revealed. Every statement is parsed and validated (no file functions, ATTACH, COPY, PRAGMA, DDL or DML) before any engine sees it.

Enable it with a top-level [sql] block:

[sql]
enabled  = false     # set true to expose POST /api/v1/sql (default false)
max_rows = 100000    # server-side hard cap; result wrapped in an outer LIMIT
{ "sql": "SELECT State, COUNT(*) AS n FROM accidents GROUP BY State", "max_rows": 500 }

max_rows is clamped into [1, sql.max_rows] and can never raise the server cap; omit it to use the configured cap. Like /query, the response is content-negotiated — send Accept: application/vnd.apache.arrow.stream (or ?format=arrow) for an Arrow IPC stream instead of the JSON envelope.

POST /api/v1/datasets/{name}/reload (admin)

Rebuilds the dataset from its configured source and publishes the new contents without a server restart. Running queries finish against a consistent old snapshot; later queries see the new data. If the rebuild fails, the previously published dataset stays live.

Requires X-Admin-Token: $ADMIN_TOKEN. If ADMIN_TOKEN is unset the endpoint is disabled — the secure default. The comparison is constant-time.

curl -s -X POST \
  -H "X-Admin-Token: $ADMIN_TOKEN" \
  http://localhost:8080/api/v1/datasets/accidents/reload
# { "dataset": "accidents", "rows": 7728394, "elapsed_ms": 1842 }
Status Body Meaning
200 { dataset, rows, elapsed_ms } New data live.
403 { "error": "forbidden: …" } Token missing/wrong, or ADMIN_TOKEN not set.
404 { "error": "not found: dataset: …" } No such dataset in datasets.toml.
500 { "error": "internal error: …" } Parquet read failed — old data stays live.

Concurrent reloads of the same dataset are serialised (per-name mutex); reloads of different datasets run in parallel.

Backend-specific reload semantics

  • DataFusion uses a service-level double buffer. The backend builds a fresh DatasetState off to the side (parquet/Delta read, Arrow RecordBatch chunks, equality indexes, partition metadata), registers the new provider, then publishes it with an ArcSwap snapshot update. Queries that already captured the old Arc keep running; later queries see the new state. The old buffers are dropped once the last reader releases its reference. Trade-off: for materialised datasets, peak RSS can approach roughly twice the dataset size plus index overhead during reload.
  • DuckDB delegates publication to the database engine. Reload runs CREATE OR REPLACE TABLE ... AS SELECT ... against the dataset source. DuckDB treats that as an ACID transaction over the table/catalog replacement: if the source read or table creation fails, the existing table remains live; if it succeeds, later queries see the replacement atomically. In-flight queries continue against the snapshot they started with through DuckDB's transaction/MVCC semantics. DataPress then refreshes only the small cached schema and row-count metadata.

The HTTP contract is the same for both backends: clients observe either the old dataset or the new dataset, never a partially loaded one. The resource profile differs: DataFusion owns the Arrow buffers in process; DuckDB relies on DuckDB's storage engine and buffer manager.


Examples

# Discovery
curl -s http://localhost:8080/api/v1/datasets | jq
curl -s http://localhost:8080/api/v1/datasets/accidents/schema | jq

# Equality + range
curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/query \
  -H 'Content-Type: application/json' \
  -d '{
    "columns": ["ID","Severity","City","State","Start_Time"],
    "predicates": [
      { "col": "State",    "op": "eq",  "val": "TX" },
      { "col": "Severity", "op": "gte", "val": 3 }
    ],
    "page": 1, "page_size": 5
  }' | jq

# Substring + numeric range
curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/query \
  -H 'Content-Type: application/json' \
  -d '{
    "predicates": [
      { "col": "Description",    "op": "ilike", "val": "%fog%" },
      { "col": "Temperature(F)", "op": "lt",    "val": 32 }
    ],
    "page_size": 10
  }' | jq

# IN list
curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/query \
  -H 'Content-Type: application/json' \
  -d '{
    "predicates": [
      { "col": "State", "op": "in", "val": ["NY","NJ","CT"] }
    ]
  }' | jq

For a deeper benchmark catalogue (light load + CPU/memory stress tests), see TEST_Q.md.


Project specifics

Core re-exports compile without any backend; each backend crate adds the feature flag it needs on datapress-core. The Python crate depends on both backends, so the wheel can dispatch between them at runtime based on DataPressConfig(backend=...).


Build flags

# DuckDB only
cargo build --release -p datapress-duckdb

# DataFusion only
cargo build --release -p datapress-datafusion

# Both Rust binaries
task build

# Python wheel (compiles both backends into one extension)
task py:develop     # editable install into ./.venv (uses uv + maturin)
task py:build       # release wheel into ./target/wheels/

Release builds use thin LTO (see [profile.release] in Cargo.toml); fat LTO was dropped because it OOM-killed rustc when cross-building the aarch64 wheel under QEMU. Expect somewhat longer link times in exchange for tighter inner loops.


Environment variables

Variable Default Purpose
DATASETS_CONFIG datasets.toml Path to the dataset registry file.
ADMIN_TOKEN (unset) Enables POST /api/v1/datasets/{name}/reload. Unset = admin endpoints disabled.
DB_POOL_SIZE num_cpus DuckDB connection pool size (DuckDB only).
RUST_LOG info Standard env_logger filter.
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN (unset) Fallback S3 credentials used by any dataset that doesn't override them.
AWS_REGION us-east-1 Fallback S3 region.
${PREFIX}_AWS_* (unset) Per-dataset overrides for the four AWS_* vars above. See "Credential precedence" under [dataset.s3].

Bind address, port, worker count and backend selection live in [server] in datasets.toml, not in env vars.


Status / non-goals

  • No rate-limiting on query routes — put this behind your own gateway. Authentication is opt-in: build with --features auth for OIDC / OAuth2 bearer enforcement (see "Authentication" above). The reload admin route is additionally gated by a shared-secret header (X-Admin-Token) and disabled unless ADMIN_TOKEN is set.
  • No write path: parquet sources are read-only. The only mutation is reloading a dataset from disk via the admin route.
  • No cursor pagination — pagination is plain OFFSET / LIMIT, so deep pages get expensive (see H5 in TEST_Q.md). ORDER BY is supported via the order_by field, but sorted queries always go through the SQL engine (no in-memory fast path).
  • DataFusion backend keeps the whole dataset in memory. DuckDB does not.

About

A config-driven Rust server that publishes Parquet and Delta datasets as fast, typed HTTP APIs from local disk or object storage, with interchangeable DuckDB or Arrow+DataFusion backends, JSON and Arrow IPC output, and production-ready features like auth, metrics, and hot reloads.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors