Distributed observability-first streaming lakehouse platform for modern analytics and AI workloads.
PulseLake is structured as a single codebase that supports two operating modes through configuration rather than branch divergence:
- Local mode: NATS + MinIO + Trino + Grafana/Tempo/Loki/Prometheus via Docker Compose
- Cloud mode: Pub/Sub + GCS + Google Cloud observability primitives via Terraform
The current foundation implements:
- Go ingestion API with schema validation, trace and event identifiers, and broker publishing
- Distributed worker with lease acquisition, batching, retries, DLQ object landing, checkpointing, raw batch archiving, and Parquet writes
- Coordination service for partition ownership and checkpoints
- Replay CLI that republishes landed raw batches by date window and tenant filter
- Synthetic generator CLI for payment, clickstream, and IoT event pressure
- Local observability stack and Terraform/CI scaffolding
The diagram above shows the end-to-end platform across both deployment modes in one view.
What is implemented in this repository today:
- Event producers publish through the ingestion API at
/v1/events - The API validates payloads against Avro schemas and normalizes
event_id,trace_id, andtenant_id - Messages flow through NATS JetStream in local mode or Google Pub/Sub in cloud mode
- Workers batch events by partition, acquire leases, write raw NDJSON and Parquet objects, and persist checkpoints
- Replay republishes archived raw batches by date window and tenant filter
- OpenTelemetry traces and metrics are emitted from the API, worker, replay, and coordinator services
Current scope and limits:
- The worker writes Parquet objects plus
_pulsemanifest files to object storage - The local stack includes Iceberg REST and Trino wiring for the warehouse path
- This repository does not yet include a catalog commit layer that turns landed Parquet batches into managed Iceberg table snapshots
- Terraform provisions storage, messaging, IAM, and observability resources; it does not load data into BigQuery by itself
cmd/ Service entrypoints
services/ Business logic and platform adapters
deployments/local/ Local runtime stack
infrastructure/ Terraform environments and modules
observability/ Collector, metrics, tracing, and dashboard config
schemas/ Versioned Avro event schemas
datasets/ Sample, public, and synthetic dataset assets
Prerequisites:
- Docker and Docker Compose
- Go 1.23+
Start the full local platform and seed synthetic events:
make demoKey endpoints:
- API:
http://localhost:8080/v1/events - Coordinator:
http://localhost:8081/healthz - Trino:
http://localhost:8082 - Grafana:
http://localhost:3000(admin/admin) - Prometheus:
http://localhost:9090 - MinIO Console:
http://localhost:9001
Generate more synthetic load:
make generateReplay a tenant/date range:
make replaymake local or make demo starts:
apiworkercoordinatorgeneratornatsminioiceberg-resttrinootel-collectorprometheustempolokigrafana
Example event:
{
"trace_id": "uuid",
"tenant_id": "tenant_001",
"event_type": "payment",
"amount": 1500,
"currency": "THB",
"timestamp": "2026-05-14T10:00:00Z"
}Every event is normalized with:
trace_idevent_idtenant_id
Partition layout:
warehouse/tenant_id=tenant_001/event_type=payment/date=2026-05-14/batch-*.parquet
raw/tenant_id=tenant_001/event_type=payment/date=2026-05-14/batch-*.ndjson
dlq/2026-05-14/*.json
Terraform environments live under infrastructure/terraform/environments/{dev,staging,prod} and use reusable modules for:
- pubsub
- gcs
- bigquery
- iam
- observability
Validate locally:
make terraform-planThe repository includes:
ci.ymlforterraform fmt,terraform validate,tflint,checkov,go test, and Docker buildsdeploy.ymlfor Workload Identity Federation based deployment without static JSON keys
- The worker lands replayable raw batches and Parquet objects into the same object-store abstraction used by local and cloud modes.
- The local stack includes Iceberg REST catalog and Trino wiring around the warehouse storage path.
- A future catalog commit layer can consume
_pulsemanifests and promote landed Parquet batches into managed table snapshots.
