This repo is the home for our dlt-* data ingestion platform: Cloudflare-hosted orchestration and Python dlt runners that move operational data into BigQuery.
The first source is Cloudflare Workers KV. The intended shape is broader: up to a dozen KV namespaces, internal APIs, SaaS APIs, webhook feeds, and other batch/incremental sources managed through one Cloudflare control plane.
This is not just a script. It is the beginning of a small data platform:
- Cloudflare runs the scheduler, job queue, operational ledger, raw staging, and containerized execution.
- dlt runs extraction, schema inference/evolution, normalization, and loading.
- BigQuery remains the analytical warehouse.
- R2 stores optional raw snapshots for replay and audit.
- D1 will store source/job/run metadata so humans and agents can inspect and manage jobs.
flowchart LR
U["Operators / agents / webhooks"] --> W["dlt-orchestrator Worker"]
CRON["Cron Triggers"] --> W
W --> D1["D1: dlt_control"]
W --> Q["Queue: dlt-jobs-v2"]
Q --> C["Container: dlt-runner"]
C --> SRC["KV / APIs / SaaS sources"]
C --> R2["R2: dlt-raw-staging"]
C --> BQ["BigQuery"]
C --> D1
Q --> DLQ["Queue: dlt-jobs-v2-dlq"]
Production is deployed on Cloudflare:
- Worker/API/admin:
https://dlt-orchestrator.hgdc.workers.dev - Container app:
dlt-orchestrator-dltrunner - Queue:
dlt-jobs-v2 - Dead letter queue:
dlt-jobs-v2-dlq - D1 database:
dlt_control - BigQuery dataset:
hello-gravel-data.source_cloudflare - Schedule: hourly at minute
0UTC - Runner size:
standard-1Cloudflare Container
Implemented:
- Python package scaffold in
src/dlt_cloudflare_kv/ - local CLI command:
dlt-kv - Cloudflare KV namespace listing
- KV key/value extraction
- JSON parsing and warehouse row shaping
- BigQuery destination wiring through
dlt - D1-backed source definitions and run ledger
- Worker operator API and minimal
/adminpage - Queue-backed manual and scheduled runs
- Containerized Python
dltrunner - Batched Cloudflare KV backfills using explicit key lists
- Incremental scheduled
leads_kvloads using recent key-prefix windows - Disabled
vendor_kvsource definition ready for first vendor snapshot
python -m venv .venv
source .venv/bin/activate
pip install -e .
cp .env.example .envFill in .env:
CLOUDFLARE_ACCOUNT_IDCLOUDFLARE_API_TOKEN- BigQuery credentials through
GOOGLE_APPLICATION_CREDENTIALSor dlt'sDESTINATION__BIGQUERY__...variables
Source definitions live outside .env:
cp config/sources.example.json config/sources.jsonAdd one entry per KV namespace or source job. .env is for secrets and account-level settings; config/sources.json is for non-secret pipeline configuration like namespace IDs, prefixes, datasets, tables, and write dispositions.
List available KV namespaces:
dlt-kv namespacesSample records before loading:
dlt-kv sources
dlt-kv keys --source-id cloudflare_kv_orders --limit 20
dlt-kv profile --source-id cloudflare_kv_orders --limit 100
dlt-kv profile --source-id cloudflare_kv_leads --limit 100
dlt-kv profile --source-id cloudflare_kv_vendors --limit 100
dlt-kv sample --source-id cloudflare_kv_orders --limit 20Load records to BigQuery:
dlt-kv load --source-id cloudflare_kv_orders
dlt-kv load --source-id cloudflare_kv_leads --limit 500The current loader preserves:
cf_kv_keycf_kv_namespace_idcf_kv_namespace_titlecf_kv_metadatacf_kv_expirationcf_kv_fetched_atcf_kv_value_is_jsoncf_kv_value_raw
If the KV value is a JSON object, its fields are expanded into BigQuery columns and the cf_kv_* audit fields are protected.
Humans and agents should interact with the platform through the dlt-orchestrator Worker rather than shelling into the runner.
Set these in .env:
DLT_ORCHESTRATOR_URL=https://dlt-orchestrator.hgdc.workers.dev
DLT_OPERATOR_TOKEN=...Open:
https://dlt-orchestrator.hgdc.workers.dev/admin?token=<DLT_OPERATOR_TOKEN>
The token is converted into an HttpOnly cookie, so refreshes and form-based runs work without keeping the token in the URL.
Implemented endpoints:
GET /health
GET /sources
POST /sources
GET /sources/:source_id
PUT /sources/:source_id
POST /sources/:source_id/runs
GET /runs
GET /runs/:run_id
Create and edit source definitions through POST /sources and PUT /sources/:source_id. Triggering a run returns a queued run_id; poll GET /runs/:run_id for completion.
Status-only local check:
.venv/bin/python scripts/check_deployment_status.pyLive limited-run verification:
.venv/bin/python scripts/verify_deployment.pyThe Worker stores job definitions and run history in D1. Operators can inspect D1 directly for debugging, but writes should go through the Worker API so validation and audit logging stay consistent.
Current tables:
sourcessource_versionsrunsrun_events
dlt-jobs-v2 holds pending work. dlt-jobs-v2-dlq holds jobs that exhausted retries. Operators should use the Worker API for normal retries; direct queue inspection is for incident response. The original dlt-jobs queue is retained only as incident history from the first large KV backfill attempt.
BigQuery is where analysts consume loaded tables. It is not the job-management surface. Pipeline metadata may be copied into BigQuery later for analytics, but D1 is the operational source of truth.
The local CLI remains useful for development, source discovery, and one-off backfills:
dlt-kv namespaces
dlt-kv sources
dlt-kv keys --source-id cloudflare_kv_orders --limit 20
dlt-kv profile --source-id cloudflare_kv_orders --limit 100
dlt-kv sample --source-id cloudflare_kv_orders --limit 20
dlt-kv load --source-id cloudflare_kv_orders --limit 500For production, prefer creating/running jobs through the Worker API.
Large KV namespaces are loaded as explicit key batches. The current tested leads batch size is 500 keys:
.venv/bin/python scripts/enqueue_kv_batches.py \
--source-id cloudflare_kv_leads \
--batch-size 500 \
--skip-batches 0 \
--max-batches 1Use --skip-batches to continue a backfill without replaying earlier batches. Loads use merge on cf_kv_key, so accidental overlap is safe but wastes time.
Scheduled cloudflare_kv_leads runs are not full namespace backfills. They use configured timestamp/date key prefixes over a recent lookback window, then merge by cf_kv_key. Full leads backfills should be launched manually with scripts/enqueue_kv_batches.py.
Normal manual runs for sources with extract.incremental.enabled use the same filtered incremental window as scheduled runs. Use explicit key batches when intentionally doing a full leads backfill.
cloudflare_kv_vendors is configured as a disabled snapshot source for source_cloudflare.vendor_kv. Enable and run it after VENDOR_KV contains records.
The control plane should treat every source as a configured job:
{
"source_id": "cloudflare_kv_orders",
"source_type": "cloudflare_kv",
"enabled": true,
"schedule": "0 * * * *",
"destination": {
"type": "bigquery",
"dataset": "source_cloudflare",
"table": "orders_kv",
"write_disposition": "replace"
},
"extract": {
"namespace_id": "KV_NAMESPACE_ID",
"namespace_title": "ORDERS_KV",
"prefix": null
},
"staging": {
"raw_snapshot": true,
"r2_prefix": "cloudflare_kv/orders"
}
}Locally, these jobs live in config/sources.json. In production, the same shape will move into the dlt_control D1 database behind the dlt-orchestrator API.
See Architecture and Operations for the fuller design.
Cloudflare resources should use the dlt- namespace:
dlt-orchestratordlt-runnerdlt-jobs-v2dlt-jobs-v2-dlqdlt-raw-stagingdlt_control
Warehouse datasets should be explicit by source domain, for example cloudflare_kv, posthog_raw, or klaviyo_raw, rather than hiding all data under a single generic dataset.