A personal data integration platform built on Cloudflare that brings your data together for querying, analysis, and dashboarding — with a privacy-preserving twist: the compute engine runs in your own browser.
| Component | Status |
|---|---|
| OAuth 2.0 worker (Google, PKCE, DCR, JWT, refresh rotation) | ✅ Done |
Email allowlist: allowed_emails table gates new signups; existing users unaffected |
✅ Done |
| D1 schema: users, oauth tables, credentials, storage_backends, allowed_emails | ✅ Done |
| Credential + storage backend vault (AES-GCM encrypted in D1) | ✅ Done |
| S3-compatible storage proxy: KV-backed proxy credentials, GET/HEAD/PUT/LIST/OPTIONS with CORS and ETags | ✅ Done |
Partitioned writes and hive-partitioned reads: DuckDB COPY TO … PARTITION_BY and read_parquet('…/**/*.parquet', hive_partitioning=true) |
✅ Done |
r2://name/key URI scheme — universal for both backend types, resolved by name first, ID fallback |
✅ Done |
| DuckDB-WASM query engine in the browser | ✅ Done |
| Query UI: URI resolver + SQL editor + results table | ✅ Done |
| Settings UI: manage credentials and storage backends; inline edit for backends | ✅ Done |
http credential type: configurable headers with {{variable}} templates |
✅ Done |
HTTP data source proxy: http-ds:// URI scheme + token endpoint for DuckDB |
✅ Done |
| Settings test dialog: call any HTTP data source path and see the raw response | ✅ Done |
| Catalog Durable Object: tables, snapshots, commits; per-user DO isolation | ✅ Done |
| Catalog UI: commit snapshots, edit URI/format, URI convention docs | ✅ Done |
| Browser auto-registers DuckDB views from catalog on startup | ✅ Done |
| Load jobs: cron-triggered HTTP→R2/S3 ETL with catalog commit, Queue-based execution | ✅ Done |
| Load Jobs UI: create/edit/delete jobs, "Run now" trigger, last-run status | ✅ Done |
| Session Durable Object: WebSocket hibernation, MCP query relay, transform job dispatch on connect | ✅ Done |
MCP server: Streamable HTTP (2025-03-26), get_warehouse_schema / list_data_sources / run_query / read_data tools |
✅ Done |
| Transform jobs + triggers: catalog DO queue, session DO dispatch, browser DuckDB execution | ✅ Done |
Multi-table trigger coordination: policy: 'any'/'all', watches as JSON array, last_completed_at freshness tracking |
✅ Done |
| Google Sheets credential type: OAuth 2.0 popup flow, encrypted refresh token in D1 | ✅ Done |
| Google Sheets load jobs: cron-triggered Sheets API v4 → NDJSON → R2 → catalog | ✅ Done |
Google Sheets S3 proxy backend: GET returns JSON (for read_json), PUT writes via Sheets values API; key = sheet tab name |
✅ Done |
Credential test endpoint for google-sheets type: verifies token refresh works |
✅ Done |
Catalog WebSocket broadcast: GET /catalog/ws + GET /catalog/snapshots-latest; commit broadcasts to all connected tabs |
✅ Done |
Live view sync: catalogWs.ts + refreshSingleView; amber navbar flash; Promise-chained catalog readiness for transform jobs |
✅ Done |
Batch catalog init: single LEFT JOIN replaces N+1 fetch; http-ds:// URIs batch-resolved before view creation |
✅ Done |
Catalog state lifted to App: persists across tab switches; QueryPanel receives state as props |
✅ Done |
Dashboard platform: submit_dashboard MCP tool (50 KB limit, sandbox iframe security model); REST list/get/delete |
✅ Done |
| Dashboard viewer: sandboxed iframe (React 18 + Babel standalone + Recharts, pinned versions); re-runs on catalog commit | ✅ Done |
Proxy mode: GET /api/table-data/:tableName streams raw JSON/NDJSON from catalog without DuckDB |
✅ Done |
URL routing with preact-iso; direct links to dashboards (/dashboards/dash_xxx) |
✅ Done |
| DuckDB session toggle: default off on coarse-pointer/mobile; persisted in localStorage | ✅ Done |
catalog://tableName URI in read_data MCP tool; get_warehouse_schema direct-access indicator per table |
✅ Done |
Dashboard slugs: auto-generated from title, PATCH /api/dashboards/:id REST endpoint, lookup by slug or id |
✅ Done |
Dashboard versioning: dashboard_snapshots table auto-snapshots on every update and delete |
✅ Done |
MCP list_dashboards, get_dashboard, update_dashboard tools — full dashboard lifecycle without leaving Claude |
✅ Done |
PWA manifest + icons: manifest.webmanifest, 192px/512px icons, theme-color meta; main app is installable |
✅ Done |
| Per-dashboard home screen shortcuts: Cloudflare Pages Functions serve per-slug HTML with dashboard title and dedicated manifest | ✅ Done |
Android PWA OAuth fix: PKCE verifier/state stored in __Host- cookies (shared across PWA + Custom Tab contexts) |
✅ Done |
| Standalone mode: navbar hidden on dashboard viewer pages; back button/title hidden when opened as PWA shortcut | ✅ Done |
Saved queries: GET/POST/DELETE /api/saved-queries, D1 saved_queries table, workbench sidebar integration |
✅ Done |
HTTP load job cursor pagination: pagination_config JSON, cursor-based multi-page fetch with streaming multipart upload |
✅ Done |
Workbench IDE: full VS Code-style IDE at /workbench; activity rail, resizable sidebar/dock, tab strip, ⌘K palette, Console REPL, History |
✅ Done |
| Workbench SQL editor: CodeMirror 6 with catalog-seeded autocomplete, per-kind tab views (SQL, table, transform, dashboard, credential, backend, job) | ✅ Done |
| Workbench dashboard editor: create/edit/delete dashboards with JS artifact + SQL query editors and sandboxed live preview | ✅ Done |
Open issues are tracked via bd list (Beads issue tracker).
A signed-in user can query files in R2 or any S3-compatible backend from DuckDB running in the browser. The frontend requests short-lived proxy credentials from the worker (POST /api/storage/proxy-credentials), configures DuckDB's httpfs with a CREATE SECRET, and all storage traffic flows through /api/storage/s3proxy on the worker.
Seed a file to try it:
wrangler r2 object put data-shack-storage/sample.ndjson --file sample.ndjson
Then query it from the Query tab:
-- Single file (r2-bound backend — use the backend name from Settings)
SELECT * FROM read_json('r2://primary-r2/sample.ndjson') LIMIT 100
-- Hive-partitioned dataset — use backend name (r2-bound or r2-s3compat)
-- Worker serves the ListObjectsV2 DuckDB needs to discover partitions
SELECT * FROM read_parquet('r2://my-s3-backend/data/**/*.parquet', hive_partitioning=true)The r2://backendName/key scheme works for both r2-bound and r2-s3compat backends. The Worker resolves the name to the appropriate backend at request time. The legacy r2-s3compat://backendId/key scheme still works for backwards compatibility.
The r2://backendName/key URI scheme is translated to s3://backendName/key internally — DuckDB never sees raw storage credentials. The Worker resolves the backend name (or ID for the legacy r2-s3compat:// scheme) at credential-vend time.
DuckDB writes natively to the S3 proxy using the same proxy credential as reads. Partitioned writes work because DuckDB drives every PUT through the credential — no special setup required:
-- Single file (r2-bound backend named "primary-r2")
COPY (SELECT * FROM read_json('r2://primary-r2/raw.ndjson'))
TO 'r2://primary-r2/output.parquet' (FORMAT PARQUET)
-- Partitioned by column — DuckDB generates one file per partition automatically
-- Works with any backend type; use the backend name from Settings
COPY (SELECT date_trunc('month', created_at) AS month, amount FROM transactions)
TO 'r2://my-s3-backend/data/spending' (FORMAT PARQUET, PARTITION_BY (month), OVERWRITE_OR_IGNORE true)No CORS policy is needed on the upstream bucket. All storage traffic goes through the worker proxy endpoint, which re-signs requests to the real backend before forwarding.
The Catalog tab lets you register files in object storage as named tables that DuckDB automatically sees on startup.
-
Go to the Catalog tab and fill in the commit form:
- Table name — e.g.
transactions - URI — e.g.
r2://primary-r2/transactions/2026-05.parquet - Storage backend — the backend name from Settings (e.g.
primary-r2) - Format — leave as Auto to infer from the file extension, or pick explicitly if the extension is misleading (e.g. a
.jsonfile that is actually NDJSON)
- Table name — e.g.
-
URI convention — use
r2://backendName/pathfor all backend types:r2://primary-r2/folder/file.parquet— backend nameprimary-r2is resolved by the Worker; forr2-bound, everything after the first/is scoped to your user namespace in the bucket.r2://my-s3-backend/folder/file.parquet— same scheme forr2-s3compatbackends; the path is relative to the bucket root.- The legacy
r2-s3compat://backendId/pathscheme still works if you have existing URIs with backend IDs.
-
Go to Settings → Storage Backends to rename or edit a backend's config. The Edit button opens a dialog pre-filled with the decrypted config.
-
Open the Query tab. The catalog is loaded automatically — registered tables appear as badges and are available as DuckDB views:
SELECT * FROM transactions LIMIT 100
Both
r2://andhttp-ds://snapshot URIs are supported.r2://tables create views via S3 proxy credentials;http-ds://tables are resolved to token URLs and createread_jsonviews. Tables with unresolvable URIs (missing files, wrong backend, expired token) show a ⚠ badge but don't block other tables from loading. -
To correct a committed snapshot, click Edit on its row in the Catalog tab to update the URI or format in-place.
External HTTP APIs can be queried directly from DuckDB using the http credential type and the http-ds:// URI scheme:
- Go to Settings → Credentials and add a credential with type
http. The form pre-fills a template — fill in yourbaseUrl, headers (using{{variableName}}for secrets), andvariables. For Akahu:{ "baseUrl": "https://api.akahu.io/v1", "headers": { "Authorization": "Bearer {{userToken}}", "X-Akahu-Id": "{{appToken}}" }, "variables": { "userToken": "user_token_...", "appToken": "app_..." } } - Use the Test button on the credential row to verify it works against a specific path.
- In the Query tab, use the credential ID in an
http-ds://URI:The worker resolves the URI to a short-lived signed proxy URL. DuckDB fetches through the Worker, which injects the configured auth headers before forwarding to the API.SELECT * FROM read_json('http-ds://cred_abc123/accounts') LIMIT 10
The Load Jobs tab lets you define cron-triggered jobs that pull from an HTTP API and write the response to a storage backend, committing a new snapshot URI to the catalog.
- Go to Settings → Storage Backends and add an
r2-boundbackend (bucket =data-shack-storage) if you want to write to the primary R2 bucket. - Go to Settings → Credentials and add an
httpcredential for your API. - Open the Load Jobs tab and click New job:
- Table name — SQL identifier for the catalog table (e.g.
accounts) - Storage path — optional custom directory within the backend (defaults to table name). For
r2-bound, this is relative to your user namespace; forr2-s3compat, relative to the bucket root. - HTTP credential — the credential to use for the upstream API call
- Storage backend — where to write the output file
- HTTP Path — path appended to the credential's base URL (e.g.
/v1/accounts) - Format —
ndjson,json,csv, orparquet - Cron schedule — standard 5-field cron expression (e.g.
0 * * * *= hourly)
- Table name — SQL identifier for the catalog table (e.g.
- Click Run now to trigger an immediate run without waiting for the cron. The job fetches the API response, streams it to storage (using
FixedLengthStreamwhenContent-Lengthis available, buffering otherwise), and commits the file URI to the catalog. - Each run creates a new timestamped file (
{tableDir}/load-{timestamp}.{ext}). Old files are retained; the catalog always points to the latest snapshot.
The Settings tab stores storage backend configs (encrypted in D1). Backends can be created, renamed (via the Edit button), and deleted. Both r2-bound and r2-s3compat backends are fully dispatched — r2-bound uses the Worker's R2 binding directly; r2-s3compat re-signs upstream SigV4 requests using config from D1. Full dispatch for s3, gcs, and azure backend types is deferred to a later stage.
The system is built on Cloudflare's stack (Workers, D1, R2, Durable Objects, Pages) with OAuth 2.0 (Google) and JWT-based authentication. SQL execution happens in a browser-local DuckDB-WASM instance rather than on the server. The server orchestrates; your browser computes.
The Worker proxy acts as an S3-compatible endpoint: the browser requests short-lived proxy credentials from POST /api/storage/proxy-credentials, configures DuckDB's httpfs with a CREATE OR REPLACE SECRET, and all subsequent storage operations (GET, HEAD, PUT, ListObjectsV2) flow through /api/storage/s3proxy on the Worker. Storage backends are pluggable — the Worker translates the S3 protocol to R2 binding calls (for r2-bound) or re-signed upstream S3 requests (for r2-s3compat), and DuckDB is unaware of which backend it's talking to.
A Cloudflare Pages frontend hosts a DuckDB-WASM instance. The browser is the query engine — all SQL execution, including joins across sources, runs locally. Raw data never passes through a central compute server.
The browser communicates with the Cloudflare stack in two ways:
- Catalog WebSocket — a long-lived connection to the catalog Durable Object. The browser receives a broadcast whenever new data commits, and re-runs any affected queries automatically.
- Session WebSocket — a connection to the session Durable Object, which enables the MCP server to route queries into the browser and stream results back.
For data access, the browser requests a resolved HTTPS URL from the Worker proxy storage resolver and reads Parquet and NDJSON files directly from the storage backend via httpfs. For transform job outputs, the browser requests a writable URL and writes Parquet directly to the storage backend. The browser never knows which backend it is talking to — it always receives a plain HTTPS URL.
Cloudflare Workers sit between all clients (browser, ETL Workers, MCP server) and the backend. The Worker proxy is responsible for:
- Validating Bearer JWTs on every request
- Decrypting source credentials and storage backend config from D1 at request time
- Resolving storage URIs to readable or writable HTTPS URLs — signed URLs for R2/S3/GCS, SAS tokens for Azure, passthrough for public HTTPS endpoints — with URI prefix validation before any writable URL is issued
- Forwarding catalog and job queue operations to the catalog Durable Object
For the primary R2 backend, ETL Workers use the native R2 Worker binding directly. For other backend types they use fetch() with credentials from D1. The catalog and browser are unaware of which mechanism was used.
The catalog DO is the source of truth for all warehouse metadata. It holds a SQLite database with:
tables— table definitions (name, description, created_at)snapshots— every storage URI that constitutes a table, with backend ID, access mode, format, and timestampcommits— a log of every change, enabling time-travel queriesjobs— the pending/claimed/done queue for transform jobstriggers— the mapping from input tables to downstream transform jobs
Each user gets their own DO instance, isolated by user ID. All writes are processed single-threadedly inside the DO isolate, eliminating race conditions between concurrent ETL Workers. After every commit, the DO broadcasts { type: "commit", ... } to all connected browser tabs over WebSocket via ctx.getWebSockets().
This is a distinct class from the session DO. The catalog DO holds persistent state that outlives any browser session. The session DO is ephemeral — it exists only to route MCP queries to an active browser tab.
The session DO pairs an MCP query request with an active browser tab. When the browser connects, it registers a WebSocket with the session DO. When the MCP server receives a query, it forwards it through the session DO to the browser, and streams the result back. If no browser tab is open, the MCP server reports that no session is available rather than hanging.
On browser connect, the session DO also checks the job queue for pending transform jobs and dispatches them to the browser.
Data enters the warehouse via two job types with a clear semantic split:
Load jobs run entirely server-side in Cloudflare Workers on a cron schedule. They have no browser dependency and are always runnable. A load job pulls from an external API, writes raw data to a configured storage backend (as NDJSON for simplicity, or Parquet for large batches), and commits the new snapshot URI to the catalog DO. The load job definition records which storage_backend ID to write to; the Worker looks up the backend config from D1. On completion, the catalog DO evaluates the triggers table and enqueues any downstream transform jobs.
Transform jobs require a browser DuckDB session. They read from storage (resolving snapshot URIs to HTTPS URLs via the Worker proxy), register each input as a named DuckDB view, run a SQL query, write the result as Parquet to a storage backend via a Worker-resolved writable URL, and commit the new snapshot URI to the catalog DO. If the browser session closes mid-job, the session DO resets the job to pending and the next browser session re-claims it. Transform jobs are idempotent by construction — the output URI and input URIs are deterministic, so re-running produces the same result.
The primary transform job type is compaction: accumulating NDJSON staging files are merged into a single sorted, ZSTD-compressed Parquet file with row group statistics that enable DuckDB to skip irrelevant row groups on filtered queries.
Trigger-on-commit is the scheduling mechanism. There is no separate DAG scheduler. After every catalog commit, the DO checks whether all watched input tables for any pending transform trigger have commits newer than the trigger's last run. If so, it enqueues the transform job. This handles both single-input triggers (compact transactions after every Akahu load) and multi-input triggers (derive monthly_spending only after both transactions and budget have fresh commits).
Supported data sources:
- Akahu — NZ banking API (load job, cron-triggered)
- Google Sheets — via OAuth (load job, cron-triggered; OAuth flow handled in the browser)
- R2 — Parquet and NDJSON files directly
- D1 — structured tables
An MCP server exposes the warehouse to AI clients like Claude. Because the catalog DO holds all schema and snapshot metadata, many operations do not require an active browser session:
Currently implemented:
get_warehouse_schema— table list, schemas, snapshot metadata; includes a direct-access indicator per table (no browser session required)list_data_sources— lists HTTP credentials with their names and base URLs (no browser session required)run_query— execute SQL in the browser DuckDB session; requires an active browser tabread_data— read JSON/NDJSON from anhttp-ds://,r2://, orcatalog://tableNameURI (no browser session required, 1 MB limit)submit_dashboard— persist a React + Recharts artifact (up to 50 KB) with bound SQL queries; accepts optionalslugfor URL-friendly names; renders in a sandboxed iframe in the Dashboards tab (no browser session required)list_dashboards— list all saved dashboards with id, slug, title, and creation date (no browser session required)get_dashboard— retrieve full source and queries for a dashboard by id or slug (no browser session required)update_dashboard— update title, artifact, queries, and/or slug; auto-snapshots the previous version (no browser session required)
Planned (not yet implemented):
list_etl_jobs— active job definitions and schedulescreate_load_job— define a new source connector and cron schedulecreate_transform_job— define a derived table with its SQL and input dependenciespause_etl_job— disable a job without deleting it
Dashboards are React artifacts with props bound to SQL queries, persisted to D1 and rendered in a sandboxed iframe.
Authoring workflow:
- The user iterates on visualisations and calculations with Claude inside Claude.ai, with live warehouse data via the MCP server and browser session.
- Once satisfied, Claude calls
submit_dashboardvia MCP — the Worker validates the artifact size (50 KB limit) and persists it to D1 alongside the bound SQL queries. - Navigate to the Dashboards tab to see the list of saved dashboards.
- Click Open to load a dashboard: the Worker fetches the detail, runs the bound queries (in DuckDB if the session toggle is on, or via the proxy endpoint if it's off), and renders the artifact in a
sandbox="allow-scripts"iframe. Theallow-same-originflag is intentionally absent — the iframe cannot access parent-origin cookies, localStorage, or sessionStorage. - The dashboard subscribes to the catalog WebSocket and re-runs its queries automatically whenever a relevant table commits.
Dashboard URL routing: Each dashboard has a direct URL (/dashboards/dash_xxx). Sharing the link opens the viewer directly.
Proxy mode: When the DuckDB session toggle is off (default on mobile), dashboards fall back to GET /api/table-data/:tableName, which streams raw JSON/NDJSON from storage through the Worker. Single-table queries only — JOINs require DuckDB.
Iframe runtime: React 18.3.1, Babel standalone 7.26.4, Recharts 2.13.3 — all pinned to exact versions to prevent silent breakage.
Storage backends are configured in D1's storage_backends table with a UNIQUE(user_id, name) constraint. The catalog stores URIs using the r2://backendName/key scheme (universal for all backend types). The Worker resolves the backend name to its configuration at credential-vend time via getStorageBackendByNameOrId (name lookup first, ID fallback for legacy URIs). DuckDB's httpfs is unaware of which backend it is reading from.
| Backend type | Proxy mechanism | Notes |
|---|---|---|
r2-bound |
S3 proxy → R2 Worker binding | Primary bucket in same Cloudflare account |
r2-s3compat |
S3 proxy → re-signed upstream S3 request | R2 in another account, or any S3-compatible service |
s3, gcs, azure |
Not yet implemented | Deferred to a later stage |
https |
Passthrough (domain allowlist check) | Publicly accessible data |
| Operation | Mechanism |
|---|---|
| ETL Worker writes (primary R2) | Native Worker binding — direct R2 access |
| ETL Worker writes (other backends) | fetch() with credentials from D1 |
| Browser reads / writes / lists storage | S3 proxy (/api/storage/s3proxy) with KV-backed proxy credentials; Worker forwards to real backend |
| Catalog and credential operations | Proxied through Worker; small payloads |
Writable URLs are only resolved for URIs matching the user's namespace and expected schema. The Worker rejects out-of-namespace requests before resolving. Adding a new storage backend requires a resolver case in the Worker proxy and a row in storage_backends — no catalog schema changes or browser changes required.
- Browser is the compute boundary. Query execution, transform jobs, and data compaction all run in DuckDB-WASM in the browser. Raw data never passes through a central compute server.
- Two DO classes with distinct jobs. The catalog DO holds persistent warehouse state. The session DO is ephemeral — no storage, exists only to route MCP queries.
- D1 for credentials and backend config only. The catalog DO owns all warehouse metadata. D1 is a credential vault and storage backend registry, not a catalog.
- Load jobs are always runnable. Transform jobs are conditionally runnable. Load jobs have no browser dependency. Transform jobs queue until a browser session claims them.
- Trigger-on-commit, not a DAG scheduler. The catalog DO fires transform jobs reactively. Chains work naturally — a transform's completion commit triggers its own downstream transforms.
- AI-first authoring. Dashboards and analyses are built conversationally with Claude, then submitted as validated artifacts.
- Storage is backend-agnostic. The catalog stores URIs, not R2 keys. The Worker proxy resolves URIs to HTTPS URLs at access time. The browser and DuckDB never know which storage backend they are talking to. New backends are added by implementing a resolver case in the Worker proxy and registering a row in
storage_backends— no catalog or browser changes required. - Encrypted at rest. Data source credentials and storage backend config are encrypted in D1; only Workers decrypt them at request time. The browser never sees raw credentials.
- Control plane through the Worker, data plane direct. The Worker proxy handles auth, metadata, and credential operations. Large data transfers go browser↔storage directly via Worker-resolved URLs.
- Cloudflare-native by default, not by constraint. R2 is the primary storage backend but not the only one. Workers, D1, Durable Objects, and Pages do the orchestration; the data layer is replaceable.
- Whether to surface pending transform jobs in the UI with progress indicators or run them silently in the background
- Compaction granularity — per-sync, per-day, or per-month Parquet files as the default partition scheme
- Whether to expose time-travel queries (querying a table at a past commit ID) via the MCP server or only in the browser UI