datap-rs

A Rust Cargo workspace that exposes one or more Parquet / Delta datasets over a JSON HTTP API. The same surface area is implemented twice — once on top of DuckDB, once on top of Apache Arrow + DataFusion — so you can A/B the engines under identical workloads. A Python wheel (datap-rs, built with maturin + PyO3) bundles both engines and lets you configure and launch the server from Python.

Overview presentation → datap-rs.org · Documentation

Built on actix-web 4
Datasets declared in a single datasets.toml (Rust binaries) or programmatically (Python wrapper)
Dynamic schema inference at startup (no hard-coded columns)
Identical request/response shapes across both backends

Quick start

For testing, we're using this kaggle US accidents 2016-2023 dataset.

# 1. Put a parquet file somewhere (or point the config at an existing one).
ls data/accidents.parquet

# 2. Edit datasets.toml — see the example shipped in this repo.

# 3. Run a backend.
task run:duckdb        # or: task run:datafusion

# 4. Talk to it.
curl http://localhost:8080/api/v1/datasets

Taskfile.yml wraps the typical cargo build --release -p … invocations; see task --list for the full menu.

Install the prebuilt binary

The quickest way to get the unified datapress binary (both backends bundled, selected at runtime via server.backend) without a Rust toolchain:

# Linux / macOS
curl -LsSf https://datap-rs.org/install.sh | sh

# Windows (PowerShell)
powershell -ExecutionPolicy ByPass -c "irm https://datap-rs.org/install.ps1 | iex"

# Homebrew (macOS / Linux)
brew install jeroenflvr/tap/datapress

# winget (Windows)
winget install datap-rs.DataPress

# Docker (mount a config that sets listen = "0.0.0.0")
docker run --rm -p 8080:8080 \
  -v "$PWD/datasets.toml:/etc/datapress/datasets.toml:ro" \
  jeroenflvr/datapress:latest

The install scripts drop the binary in a per-user directory (~/.local/bin on Unix, %LOCALAPPDATA%\datapress\bin on Windows) and tell you how to add it to your PATH. See packaging/ for details and release automation.

Prefer cargo? Install from crates.io:

cargo install datapress        # both DuckDB + DataFusion
datapress                      # reads ./datasets.toml (or $DATASETS_CONFIG)

For a slimmer single-backend build, or to opt into the docs / Swagger / metrics / auth features:

cargo install datapress --no-default-features --features duckdb
cargo install datapress --features swagger,auth,metrics

The installed binary resolves its config from (first match wins) --config <FILE>, $DATAPRESS_CONFIG_FILE, ./datasets.toml, then $HOME/datasets.toml. Generate a starter template with datapress init (writes datasets.toml.template to a directory, or $HOME when omitted):

datapress init                 # ~/datasets.toml.template
cp ~/datasets.toml.template ~/datasets.toml   # then edit and run `datapress`

From Python

The same server can be configured and launched from Python via the datapress wheel (one wheel, both engines bundled):

import asyncio
from datap_rs.datapress import DataPress, DataPressConfig, DatasetConfig

async def main():
    ds = DatasetConfig(
        name="accidents",
        source="data/accidents.parquet",
        format="parquet",   # or "delta"
        mode="auto",        # index policy: "auto" | "none" | "list"
    )
    cfg = DataPressConfig(backend="duckdb", listen="0.0.0.0", port=8000, workers=8)
    server = DataPress(cfg, datasets=[ds])
    await server.run()      # blocks until SIGINT

asyncio.run(main())

Build the wheel with task py:develop (uses uv + maturin).

Standalone clients

The datapress wheel above can both run a server and talk to it. If you only need to talk to an already-running server, three standalone clients share one lightweight Rust core (datapress-client) and pull in no server engines:

Client	Package	Install
Command line	`datapress-cli`	install script · `cargo install datapress-cli`
Python	`datap-rs-client`	`uv pip install datap-rs-client[arrow]`
Rust library	`datapress-client`	`cargo add datapress-client`

# CLI: install script (Linux / macOS)
curl -LsSf https://datap-rs.org/install-cli.sh | sh
# CLI: install script (Windows)
powershell -ExecutionPolicy ByPass -c "irm https://datap-rs.org/install-cli.ps1 | iex"

datapress-cli datasets
datapress-cli query accidents --select State,Severity --where Severity:gte:3 --page-size 1000

# Python: dict in, dict out; query_arrow() returns a pyarrow.Table for
# Polars / pandas / DuckDB / PySpark / DataFusion.
from datap_rs_client import DataPressClient

client = DataPressClient("http://127.0.0.1:8000")
table = client.query_arrow("accidents", columns=["State", "Severity"], page_size=100_000)

See the clients documentation for the full reference. The CLI install scripts behave like the server's — per-user directory, checksum-verified, no PATH edits — using DATAPRESS_CLI_VERSION / DATAPRESS_CLI_INSTALL_DIR overrides.

The two backends

Aspect	`datapress-duckdb`	`datapress-datafusion`
Engine	DuckDB (embedded C++)	Arrow compute + DataFusion (pure Rust)
Storage	DuckDB in-memory table per dataset	One contiguous `RecordBatch` per dataset
Concurrency model	Connection pool, blocking → `web::block`	Async-native, multi-threaded `MemTable` partitions
Predicate execution	DuckDB optimiser + parallel hash/vector ops	Equality index → SIMD scan → DataFusion SQL
Indexes	Native DuckDB internals (zone maps, etc.)	Per-dataset eq-index built at startup (configurable)
Memory profile	DuckDB's own buffer manager	Whole dataset resident in RAM
Binary size	Bundled DuckDB ≈ tens of MB	Lean — pure Rust
Startup time	Fast (just `read_parquet`)	Slower — reads all rows + builds eq-index
Best at	Heterogeneous SQL, joins, aggregations	Dense filter scans, low-latency point lookups

When to pick which

DuckDB is the right default. It handles arbitrary SQL well, has a battle-tested optimiser, manages memory itself, and starts up in milliseconds because it lazily reads parquet pages on demand.
DataFusion shines when:
- the dataset fits comfortably in RAM,
- you query the same columns repeatedly with equality/IN predicates (the in-process equality index turns those into O(1) lookups), and
- you want a single static binary without a vendored C++ runtime.

The HTTP API is identical, so the practical comparison is "throughput and p99 on your queries" — see TEST_Q.md for a benchmark suite.

Configuration: `datasets.toml`

Every instance reads this file at startup. One [server] block plus one [[dataset]] entry per table you want to expose.

[server]
backend = "datafusion"   # "datafusion" (default) | "duckdb"
listen  = "127.0.0.1"    # default; set to "0.0.0.0" to expose
port    = 8080
# workers = 8            # omit for one worker per CPU
# compress = true        # negotiate gzip/brotli/zstd via Accept-Encoding (default)
# max_body_bytes     = 1048576  # 413 above this; default 1 MiB
# max_page_size      = 100000   # clamp query page_size above this
# request_timeout_ms = 30000    # 504 above this; 0 disables; default 30s
# shutdown_timeout_secs = 30    # SIGTERM/SIGINT grace period, in seconds

# DuckDB backend only: enable the experimental Quack remote protocol.
# [server.quack]
# enabled = false
# uri = "quack:localhost"
# token = "change-me"
# read_only = true

[[dataset]]
name = "accidents"                    # used in the URL: /api/datasets/accidents/...

  [dataset.source]
  kind     = "parquet"                # "parquet" | "delta"
  location = "data/accidents.parquet" # file, directory of *.parquet, or s3://…

  # Optional — DataFusion only. DuckDB ignores this block.
  [dataset.index]
  mode             = "auto"           # "auto" | "none" | "list"
  columns          = []               # required when mode = "list"
  max_cardinality  = 100000           # used by "auto" to skip wide cols

Server

Field	Default	Notes
`backend`	`datafusion`	Informational hint; logged at startup. Each binary always runs as its own backend regardless of this value.
`listen`	`127.0.0.1`	Loopback by default — the service is not exposed on a network interface unless you opt in.
`port`	`8080`
`workers`	(unset)	Actix worker threads. Unset = one per CPU.
`prefix`	`""`	URL path prefix mounted in front of every route (e.g. `"/datapress"`) — useful behind a reverse proxy that passes the path through unchanged. Must start with `/` and not end with `/`.
`compress`	`true`	Negotiate response compression via `Accept-Encoding` (gzip / brotli / zstd). Disable when sitting behind a proxy that compresses for you.
`max_body_bytes`	`1048576`	Maximum accepted JSON request body, in bytes. Bigger bodies are rejected with `413 Payload Too Large`.
`max_page_size`	`100000`	Maximum rows returned by one `/query` page. Larger `page_size` values are clamped.
`request_timeout_ms`	`30000`	Per-request handler timeout, in milliseconds. Long-running handlers are cancelled and the client gets `504 Gateway Timeout`. `0` disables the timeout.
`shutdown_timeout_secs`	`30`	Grace period for in-flight requests after the process receives `SIGTERM` / `SIGINT`, in seconds. The listening socket is closed immediately; existing connections then have up to this many seconds to finish before workers are force-stopped.

DuckDB builds can also opt into [server.quack], DuckDB's experimental remote protocol server. Keep it disabled unless you intentionally want DuckDB clients to attach/query this process directly. It binds to quack:localhost by default, uses token authentication, and DataPress installs a read-only authorization hook by default.

The server exposes three probe endpoints. /healthz and /readyz are mounted at the bare host root (regardless of prefix) so orchestrators don't need to know how the service is exposed. /health lives under prefix and is intended for in-app health checks.

Route	Status	Body
`/healthz`	Liveness — always `200` while the process is running.	`{"status":"ok"}`
`/readyz`	Readiness — `200` once at least one dataset is registered, `503` otherwise.	`{"status":"ready","datasets":N}` / `{"status":"not ready","reason":"no datasets registered"}`
`/version`	Build / version metadata — always `200`.	`{"name":"datapress-core","version":"x.y.z","backend":"DuckDB\|DataFusion","profile":"debug\|release", ...}`
`{prefix}/health`	App-level liveness — always `200`.	`{"status":"ok"}`

/healthz does not touch the backend, so it stays 200 even while the dataset registry is still loading at startup. Use /readyz to gate traffic until the server is actually able to serve queries.

/version also includes optional fields populated from build-time env vars when set: git_sha (DATAPRESS_GIT_SHA), build_time (DATAPRESS_BUILD_TIME, ISO-8601), and target (DATAPRESS_TARGET, e.g. aarch64-apple-darwin). Unset vars are omitted from the JSON. Example:

DATAPRESS_GIT_SHA=$(git rev-parse --short HEAD) \
DATAPRESS_BUILD_TIME=$(date -u +%Y-%m-%dT%H:%M:%SZ) \
DATAPRESS_TARGET=$(rustc -vV | awk '/host:/ {print $2}') \
  cargo build --release -p datapress-duckdb

Online documentation

DataPress can embed two browsable sources of documentation into the binary itself:

An MkDocs Material site (the one you are reading) at [docs].path (default /mkdocs).
An interactive Swagger UI with a hand-written OpenAPI spec at [swagger].path (default /docs). The raw spec is also exposed at <path>/openapi.json.

Both are opt-in at build time (so wheels stay slim when you don't want them) and enabled by default at runtime once compiled in — set enabled = false to disable in prod.

Build the MkDocs site (only needed for the docs feature):
```
task docs:build
```

Build the backend with one or both features:

cargo build --release -p datapress-duckdb --features docs,swagger

Tweak in datasets.toml if you want to relocate or disable either:

[docs]
enabled = true        # default: true
path    = "/mkdocs"   # default: /mkdocs

[swagger]
enabled = true        # default: true (set to false in prod)
path    = "/docs"     # default: /docs

Both path values must start with /, not end with /, not collide with /api, /api/v1, /health{z,}, /readyz, or /version, and must differ from each other. When the binary is built without the relevant feature but the TOML enables it, the server logs a warning at startup and continues without that surface.

Browser explorer

Build with --features explorer to embed a small browser UI at /explore for poking at datasets — list / schema browsing plus an API Query tab that issues /query calls and decodes the Arrow IPC responses client-side. Like the docs surfaces it is compiled in opt-in and served by default once present.

Authentication (OIDC / OAuth2)

Build with --features auth to enable JWT bearer enforcement against any OpenID-Connect issuer (Entra ID, Auth0, Keycloak, Okta, …). When enabled, the server fetches the issuer's JWKS at startup, refreshes it in the background, and validates Authorization: Bearer <jwt> headers against the configured issuer, audience, algorithms, and scopes.

[auth]
enabled         = true
issuer          = "https://login.microsoftonline.com/<tenant-id>/v2.0"
audience        = "api://datapress"
algorithms      = ["RS256"]
read_scopes     = ["datasets:read"]
reload_scopes   = ["datasets:reload"]
anonymous_read  = false      # set true to keep read endpoints public
tenant_claim    = "/tid"     # JSON-pointer into the JWT claims
allowed_tenants = ["<tenant-id>"]
admin_token_fallback = true  # keep X-Admin-Token working in parallel

Health probes (/healthz, /readyz, /version) stay unauthenticated so load balancers keep working. The legacy X-Admin-Token header keeps working for POST .../reload as long as admin_token_fallback = true.

To turn the Swagger UI itself into an SSO client, add an [swagger.oauth2] block — it gets rendered as an OpenIdConnect security scheme with PKCE.

Source

[dataset.source] is a tagged enum.

`kind`	`location`	Notes
`parquet`	a `.parquet` file	Read as-is.
`parquet`	a directory	Every `*.parquet` inside (sorted, non-recursive). No glob patterns.
`parquet`	`s3://bucket/key.parquet` or `s3://bucket/prefix/`	Requires a `[dataset.s3]` block. DuckDB autoloads `httpfs`.
`delta`	a local directory	Pointed at the table root (the dir containing `_delta_log/`).
`delta`	`s3://bucket/path/to/table`	Requires `[dataset.s3]`. DuckDB autoloads `delta`; DataFusion uses the `deltalake` crate.

S3 / S3-compatible storage

[[dataset]]
name = "events"

  [dataset.source]
  kind     = "parquet"           # or "delta"
  location = "s3://events/2025/*.parquet"

  [dataset.s3]
  region            = "us-east-1"
  endpoint          = "http://localhost:9000"  # omit for AWS
  addressing_style  = "path"                   # "virtual" (default) | "path"
  allow_http        = true                     # only for non-https endpoints

Field	Default	Notes
`region`	`us-east-1`	Falls back to `AWS_REGION` env, then `us-east-1`.
`endpoint`	(unset)	Custom S3 endpoint (MinIO, R2, Wasabi, Backblaze, …).
`addressing_style`	`virtual`	`virtual` = `https://bucket.host`, `path` = `https://host/bucket` (MinIO).
`allow_http`	`false`	Must be `true` if `endpoint` is `http://…`.
`partitioning`	`auto`	Hive partition discovery: `auto`, `hive` (force on), `none` (force off).
`endpoint_bucket_in_host`	`auto`	Fold the bucket into the endpoint host: `auto` (follows `addressing_style`), `true`, `false`.
`access_key_id`, `secret_access_key`, `session_token`	(unset)	Inline creds. Discouraged for prod — use env vars instead.

Credential precedence (highest → lowest):

Per-dataset env vars: ${PREFIX}_AWS_ACCESS_KEY_ID, ${PREFIX}_AWS_SECRET_ACCESS_KEY, ${PREFIX}_AWS_SESSION_TOKEN, ${PREFIX}_AWS_REGION. PREFIX is the dataset name uppercased with every non-alphanumeric character mapped to _ (e.g. accidents → ACCIDENTS_AWS_…, my-bucket → MY_BUCKET_AWS_…).
Inline [dataset.s3] keys.
Plain AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, AWS_REGION.
The backend's default credential chain (~/.aws/credentials, IMDS, etc.).

Python: the S3Config binding also accepts a credentials_provider — a zero-argument callable returning an HMACKeyPair. It is invoked once when DataPress(...) is constructed, the result is cached indefinitely, and it overrides any inline access_key_id / secret_access_key. See the Python S3 docs.

When kind = "delta" and location is an s3://… URL, both backends fully materialise the table at startup. There is no incremental scan path — switch to parquet if you need on-demand page reads.

Equality-index policy (DataFusion only)

The DataFusion backend builds an in-memory value -> [row ids] map at startup so that eq / in predicates resolve in O(1).

`mode`	Behaviour
`auto`	Index every column whose distinct count stays below `max_cardinality`.
`none`	Skip the index entirely — every query goes through DataFusion SQL.
`list`	Index only the named `columns`. Useful for huge datasets.

Override the config path with DATASETS_CONFIG=/path/to/file.toml.

HTTP API

Five core routes, both backends — list / schema / query / count / reload — plus an opt-in raw-SQL endpoint (see below):

API versioning

The canonical paths live under /api/v1/.... The un-versioned /api/... paths continue to work as a legacy alias for v1, so existing clients keep running. To upgrade, replace /api/ with /api/v1/ in your URLs — nothing else changes.

POST /api/v1/datasets/accidents/query      # canonical (recommended)
POST /api/datasets/accidents/query         # legacy alias, still v1

When a breaking schema change is introduced, it will ship as /api/v2 in a sibling module (crates/core/src/handlers/v1.rs) and v1 will stay mounted alongside it for a deprecation window.

`GET /api/v1/datasets`

{ "datasets": [ { "name": "accidents", "columns": 47 } ] }

`GET /api/v1/datasets/{name}/schema`

Returns the inferred columns plus a sample row so a client can see what values look like without issuing a query.

{
  "name": "accidents",
  "columns": [
    { "name": "ID",       "logical": "utf8", "sql_type": "VARCHAR",   "nullable": false },
    { "name": "Severity", "logical": "int",  "sql_type": "INTEGER",   "nullable": true  },
    { "name": "Start_Time", "logical": "temporal", "sql_type": "TIMESTAMP", "nullable": true }
  ],
  "sample": { "ID": "A-1", "Severity": 2, "Start_Time": "2016-02-08 05:46:00", ... }
}

`POST /api/v1/datasets/{name}/query`

{
  "columns":   ["ID", "City", "State", "Severity"],
  "predicates": [
    { "col": "State",    "op": "eq",  "val": "TX" },
    { "col": "Severity", "op": "gte", "val": 3   }
  ],
  "order_by": [
    { "col": "Severity", "dir": "desc" },
    { "col": "ID" }
  ],
  "limit":     1000,
  "page":      1,
  "page_size": 50
}

Response:

{ "data": [ { ... }, ... ], "page": 1, "page_size": 50 }

Request fields

Field	Type	Default	Notes
`columns`	`string[]`	`[]`	Empty = all columns.
`predicates`	`Predicate[]`	`[]`	ANDed together.
`order_by`	`OrderBy[]`	`[]`	`{ col, dir? }`; `dir` is `asc` (default) or `desc`, case-insensitive. When `group_by` is set, `col` must be a group column or aggregation alias.
`group_by`	`string[]`	`[]`	Columns to group by. When set, `columns` is ignored. Empty `aggregations` implies `[{ op: "count" }]`.
`aggregations`	`Aggregation[]`	`[]`	`{ col?, op, alias? }`; `op` is `count\|sum\|avg\|min\|max`. `col` may be omitted only for `count` (= `COUNT(*)`). Requires `group_by`.
`distinct`	`bool`	`false`	Dedup the projected columns. Mutually exclusive with `group_by` / `aggregations`.
`limit`	`int >= 0` or null	`null`	Hard cap on total rows across all pages. `null` = unlimited.
`page`	`int >= 1`	`1`	1-based.
`page_size`	`int >= 1`	`1000`	Clamped to `server.max_page_size` (`100_000` by default).

Predicate shape

{ "col": "<column>", "op": "<operator>", "val": <json value | array | omitted> }

`op`	`val`	Meaning
`eq`	scalar	`col = val`
`neq`	scalar	`col <> val`
`gt` / `gte`	number / string	`col > val` / `col >= val`
`lt` / `lte`	number / string	`col < val` / `col <= val`
`like`	string with `%` / `_`	SQL `LIKE`
`ilike`	string with `%` / `_`	Case-insensitive `LIKE`
`in`	non-empty array	`col IN (v1, v2, …)`
`is_null`	omit	`col IS NULL`
`is_not_null`	omit	`col IS NOT NULL`

Column names are looked up case-insensitively against the inferred schema and quoted automatically, so Temperature(F) and similar identifiers work.

Response format — JSON or Arrow IPC

/query can return its result set in two wire formats. Same body, same predicates, same pagination — only the response encoding differs.

Aspect	JSON (default)	Arrow IPC stream
Content-Type	`application/json`	`application/vnd.apache.arrow.stream`
How to ask	nothing — it's the default	`Accept: application/vnd.apache.arrow.stream` or `?format=arrow` on the URL
Shape	Array of row objects (`[{...}, {...}, ...]`)	Self-describing stream: 1 schema message + N `RecordBatch` messages + EOS
Layout	Row-oriented; column names repeated on every row	Columnar; one contiguous buffer per column per batch
Types preserved	Scalars become JSON (`int`/`float`/`bool`/`string`); temporals stringified to ISO-8601	Native Arrow types — `Int32`, `Timestamp(ns)`, `Decimal128`, dictionary, etc. retained end-to-end
Page metadata	In the body (just the rows, no envelope)	In headers: `X-Page`, `X-Page-Size`
Empty result	`[]`	Valid stream with the schema message only, zero batches
Compression	Big win — JSON is text	Smaller starting point; gzip/zstd still help on wide / repetitive cols, brotli usually skipped
Client cost	`json.loads` + per-row dict construction	`pyarrow.ipc.open_stream(...).read_all()` → zero-copy `pyarrow.Table`
Best for	Small responses, browsers, ad-hoc `curl`, dashboards	Bulk data into Polars / pandas / DuckDB-on-the-client, ML feature pipelines

When to pick which. Use JSON when the consumer is JavaScript, the response is small (<~10k rows), or you're poking at the API by hand. Use Arrow IPC when you're moving result pages into a dataframe library, the schema has non-string types you want preserved, or page sizes are large enough that JSON parse time shows up in profiles.

# JSON (default)
curl -X POST http://localhost:8080/api/v1/datasets/accidents/query \
  -H 'Content-Type: application/json' \
  -d '{ "predicates": [{ "col": "State", "op": "eq", "val": "TX" }] }'

# Arrow IPC — via Accept header
curl -X POST http://localhost:8080/api/v1/datasets/accidents/query \
  -H 'Content-Type: application/json' \
  -H 'Accept: application/vnd.apache.arrow.stream' \
  --output result.arrow \
  -d '{ "predicates": [{ "col": "State", "op": "eq", "val": "TX" }] }'

# Arrow IPC — via query string (handy when you can't set headers)
curl -X POST 'http://localhost:8080/api/v1/datasets/accidents/query?format=arrow' \
  -H 'Content-Type: application/json' \
  --output result.arrow \
  -d '{ "predicates": [{ "col": "State", "op": "eq", "val": "TX" }] }'

import requests, pyarrow.ipc as ipc
r = requests.post(url, json=req, headers={"Accept": "application/vnd.apache.arrow.stream"})
table = ipc.open_stream(r.content).read_all()  # → pyarrow.Table
page  = int(r.headers["X-Page"])
size  = int(r.headers["X-Page-Size"])

Supported on both backends — DuckDB streams batches out via its native query_arrow API, DataFusion uses its Arrow plan directly. The Compress middleware still applies. count, schema, and the dataset-listing endpoints are JSON-only.

Grouping / aggregation

When group_by is non-empty the SELECT list is derived from the group columns plus each aggregation's output alias — the top-level columns field is ignored. Supported ops: count, sum, avg, min, max (case-insensitive). col may be omitted only for count (= COUNT(*)). If aggregations is omitted an implicit COUNT(*) AS count is added.

curl -X POST http://localhost:8080/api/v1/datasets/accidents/query \
  -H 'Content-Type: application/json' \
  -d '{
    "group_by": ["State"],
    "aggregations": [
      { "op":  "count" },
      { "col": "Severity", "op": "avg", "alias": "avg_sev" }
    ],
    "order_by": [{ "col": "count", "dir": "desc" }],
    "page_size": 10
  }'
# → { "data": [ { "State": "CA", "count": 1741433, "avg_sev": 2.21 }, ... ], ... }

aggregations without group_by returns 400. order_by keys must reference a group column or an aggregation alias (no arbitrary dataset columns — they are not in scope after GROUP BY). Grouped queries always go through the SQL engine; no in-memory fast path applies.

Distinct rows

distinct: true deduplicates on the projected columns. Useful for building dropdowns / facet lists.

curl -X POST http://localhost:8080/api/v1/datasets/accidents/query \
  -H 'Content-Type: application/json' \
  -d '{
    "columns":  ["State"],
    "distinct": true,
    "order_by": [{ "col": "State" }],
    "page_size": 100
  }'
# → { "data": [ { "State": "AL" }, { "State": "AR" }, ... ], ... }

Mutually exclusive with group_by / aggregations (returns 400 if combined). Also bypasses the in-memory fast paths.

`POST /api/v1/datasets/{name}/count`

Returns the number of rows matching predicates. Same predicate shape as /query; only the predicates field is read. Empty body counts every row.

curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/count \
  -H 'Content-Type: application/json' -d '{}'
# → { "count": 7728394 }

curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/count \
  -H 'Content-Type: application/json' \
  -d '{
    "predicates": [
      { "col": "State",    "op": "eq",  "val": "TX" },
      { "col": "Severity", "op": "gte", "val": 3   }
    ]
  }'
# → { "count": 187423 }

On materialised DataFusion datasets the no-predicate path is O(1) (uses the resident chunk metadata, no scan); indexable predicates short-circuit through the equality index. Otherwise it runs SELECT COUNT(*) … WHERE … through the engine.

`POST /api/v1/sql` (raw SQL — opt-in)

Runs a single read-only SELECT / WITH … SELECT (or DESCRIBE <table>) that references exactly one registered dataset by its configured name. Disabled by default — while off the route returns 404, so its presence isn't even revealed. Every statement is parsed and validated (no file functions, ATTACH, COPY, PRAGMA, DDL or DML) before any engine sees it.

Enable it with a top-level [sql] block:

[sql]
enabled  = false     # set true to expose POST /api/v1/sql (default false)
max_rows = 100000    # server-side hard cap; result wrapped in an outer LIMIT

{ "sql": "SELECT State, COUNT(*) AS n FROM accidents GROUP BY State", "max_rows": 500 }

max_rows is clamped into [1, sql.max_rows] and can never raise the server cap; omit it to use the configured cap. Like /query, the response is content-negotiated — send Accept: application/vnd.apache.arrow.stream (or ?format=arrow) for an Arrow IPC stream instead of the JSON envelope.

`POST /api/v1/datasets/{name}/reload` (admin)

Rebuilds the dataset from its configured source and publishes the new contents without a server restart. Running queries finish against a consistent old snapshot; later queries see the new data. If the rebuild fails, the previously published dataset stays live.

Requires X-Admin-Token: $ADMIN_TOKEN. If ADMIN_TOKEN is unset the endpoint is disabled — the secure default. The comparison is constant-time.

curl -s -X POST \
  -H "X-Admin-Token: $ADMIN_TOKEN" \
  http://localhost:8080/api/v1/datasets/accidents/reload
# { "dataset": "accidents", "rows": 7728394, "elapsed_ms": 1842 }

Status	Body	Meaning
`200`	`{ dataset, rows, elapsed_ms }`	New data live.
`403`	`{ "error": "forbidden: …" }`	Token missing/wrong, or `ADMIN_TOKEN` not set.
`404`	`{ "error": "not found: dataset: …" }`	No such dataset in `datasets.toml`.
`500`	`{ "error": "internal error: …" }`	Parquet read failed — old data stays live.

Concurrent reloads of the same dataset are serialised (per-name mutex); reloads of different datasets run in parallel.

Backend-specific reload semantics

DataFusion uses a service-level double buffer. The backend builds a fresh DatasetState off to the side (parquet/Delta read, Arrow RecordBatch chunks, equality indexes, partition metadata), registers the new provider, then publishes it with an ArcSwap snapshot update. Queries that already captured the old Arc keep running; later queries see the new state. The old buffers are dropped once the last reader releases its reference. Trade-off: for materialised datasets, peak RSS can approach roughly twice the dataset size plus index overhead during reload.
DuckDB delegates publication to the database engine. Reload runs CREATE OR REPLACE TABLE ... AS SELECT ... against the dataset source. DuckDB treats that as an ACID transaction over the table/catalog replacement: if the source read or table creation fails, the existing table remains live; if it succeeds, later queries see the replacement atomically. In-flight queries continue against the snapshot they started with through DuckDB's transaction/MVCC semantics. DataPress then refreshes only the small cached schema and row-count metadata.

The HTTP contract is the same for both backends: clients observe either the old dataset or the new dataset, never a partially loaded one. The resource profile differs: DataFusion owns the Arrow buffers in process; DuckDB relies on DuckDB's storage engine and buffer manager.

Examples

# Discovery
curl -s http://localhost:8080/api/v1/datasets | jq
curl -s http://localhost:8080/api/v1/datasets/accidents/schema | jq

# Equality + range
curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/query \
  -H 'Content-Type: application/json' \
  -d '{
    "columns": ["ID","Severity","City","State","Start_Time"],
    "predicates": [
      { "col": "State",    "op": "eq",  "val": "TX" },
      { "col": "Severity", "op": "gte", "val": 3 }
    ],
    "page": 1, "page_size": 5
  }' | jq

# Substring + numeric range
curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/query \
  -H 'Content-Type: application/json' \
  -d '{
    "predicates": [
      { "col": "Description",    "op": "ilike", "val": "%fog%" },
      { "col": "Temperature(F)", "op": "lt",    "val": 32 }
    ],
    "page_size": 10
  }' | jq

# IN list
curl -s -X POST http://localhost:8080/api/v1/datasets/accidents/query \
  -H 'Content-Type: application/json' \
  -d '{
    "predicates": [
      { "col": "State", "op": "in", "val": ["NY","NJ","CT"] }
    ]
  }' | jq

For a deeper benchmark catalogue (light load + CPU/memory stress tests), see TEST_Q.md.

Project specifics

Core re-exports compile without any backend; each backend crate adds the feature flag it needs on datapress-core. The Python crate depends on both backends, so the wheel can dispatch between them at runtime based on DataPressConfig(backend=...).

Build flags

# DuckDB only
cargo build --release -p datapress-duckdb

# DataFusion only
cargo build --release -p datapress-datafusion

# Both Rust binaries
task build

# Python wheel (compiles both backends into one extension)
task py:develop     # editable install into ./.venv (uses uv + maturin)
task py:build       # release wheel into ./target/wheels/

Release builds use thin LTO (see [profile.release] in Cargo.toml); fat LTO was dropped because it OOM-killed rustc when cross-building the aarch64 wheel under QEMU. Expect somewhat longer link times in exchange for tighter inner loops.

Environment variables

Variable	Default	Purpose
`DATASETS_CONFIG`	`datasets.toml`	Path to the dataset registry file.
`ADMIN_TOKEN`	(unset)	Enables `POST /api/v1/datasets/{name}/reload`. Unset = admin endpoints disabled.
`DB_POOL_SIZE`	`num_cpus`	DuckDB connection pool size (DuckDB only).
`RUST_LOG`	`info`	Standard `env_logger` filter.
`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_SESSION_TOKEN`	(unset)	Fallback S3 credentials used by any dataset that doesn't override them.
`AWS_REGION`	`us-east-1`	Fallback S3 region.
`${PREFIX}_AWS_*`	(unset)	Per-dataset overrides for the four `AWS_*` vars above. See "Credential precedence" under `[dataset.s3]`.

Bind address, port, worker count and backend selection live in [server] in datasets.toml, not in env vars.

Status / non-goals

No rate-limiting on query routes — put this behind your own gateway. Authentication is opt-in: build with --features auth for OIDC / OAuth2 bearer enforcement (see "Authentication" above). The reload admin route is additionally gated by a shared-secret header (X-Admin-Token) and disabled unless ADMIN_TOKEN is set.
No write path: parquet sources are read-only. The only mutation is reloading a dataset from disk via the admin route.
No cursor pagination — pagination is plain OFFSET / LIMIT, so deep pages get expensive (see H5 in TEST_Q.md). ORDER BY is supported via the order_by field, but sorted queries always go through the SQL engine (no in-memory fast path).
DataFusion backend keeps the whole dataset in memory. DuckDB does not.

Name		Name	Last commit message	Last commit date
Latest commit History 242 Commits
.github		.github
crates		crates
demo/client		demo/client
docs		docs
examples/keycloak		examples/keycloak
packaging		packaging
presentation/vendor/reveal/dist/theme		presentation/vendor/reveal/dist/theme
.dockerignore		.dockerignore
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONFIG.md		CONFIG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
QUERY.md		QUERY.md
README.md		README.md
SECURITY.md		SECURITY.md
TEST_Q.md		TEST_Q.md
TODO.md		TODO.md
TROUBLESHOOTING.md		TROUBLESHOOTING.md
Taskfile.yml		Taskfile.yml
cliff.toml		cliff.toml
example.py		example.py
flake.lock		flake.lock
flake.nix		flake.nix
install-cli.ps1		install-cli.ps1
install-cli.sh		install-cli.sh
install.ps1		install.ps1
install.sh		install.sh
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

datap-rs

Quick start

Install the prebuilt binary

From Python

Standalone clients

The two backends

When to pick which

Configuration: datasets.toml

Server

Online documentation

Browser explorer

Authentication (OIDC / OAuth2)

Source

S3 / S3-compatible storage

Equality-index policy (DataFusion only)

HTTP API

API versioning

GET /api/v1/datasets

GET /api/v1/datasets/{name}/schema

POST /api/v1/datasets/{name}/query

Request fields

Predicate shape

Response format — JSON or Arrow IPC

Grouping / aggregation

Distinct rows

POST /api/v1/datasets/{name}/count

POST /api/v1/sql (raw SQL — opt-in)

POST /api/v1/datasets/{name}/reload (admin)

Backend-specific reload semantics

Examples

Project specifics

Build flags

Environment variables

Status / non-goals

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 35

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Configuration: `datasets.toml`

`GET /api/v1/datasets`

`GET /api/v1/datasets/{name}/schema`

`POST /api/v1/datasets/{name}/query`

`POST /api/v1/datasets/{name}/count`

`POST /api/v1/sql` (raw SQL — opt-in)

`POST /api/v1/datasets/{name}/reload` (admin)

Packages