PulseLake

Distributed observability-first streaming lakehouse platform for modern analytics and AI workloads.

What This Repository Provides

PulseLake is structured as a single codebase that supports two operating modes through configuration rather than branch divergence:

Local mode: NATS + MinIO + Trino + Grafana/Tempo/Loki/Prometheus via Docker Compose
Cloud mode: Pub/Sub + GCS + Google Cloud observability primitives via Terraform

The current foundation implements:

Go ingestion API with schema validation, trace and event identifiers, and broker publishing
Distributed worker with lease acquisition, batching, retries, DLQ object landing, checkpointing, raw batch archiving, and Parquet writes
Coordination service for partition ownership and checkpoints
Replay CLI that republishes landed raw batches by date window and tenant filter
Synthetic generator CLI for payment, clickstream, and IoT event pressure
Local observability stack and Terraform/CI scaffolding

Architecture

The diagram above shows the end-to-end platform across both deployment modes in one view.

What is implemented in this repository today:

Event producers publish through the ingestion API at /v1/events
The API validates payloads against Avro schemas and normalizes event_id, trace_id, and tenant_id
Messages flow through NATS JetStream in local mode or Google Pub/Sub in cloud mode
Workers batch events by partition, acquire leases, write raw NDJSON and Parquet objects, and persist checkpoints
Replay republishes archived raw batches by date window and tenant filter
OpenTelemetry traces and metrics are emitted from the API, worker, replay, and coordinator services

Current scope and limits:

The worker writes Parquet objects plus _pulse manifest files to object storage
The local stack includes Iceberg REST and Trino wiring for the warehouse path
This repository does not yet include a catalog commit layer that turns landed Parquet batches into managed Iceberg table snapshots
Terraform provisions storage, messaging, IAM, and observability resources; it does not load data into BigQuery by itself

Repository Layout

cmd/               Service entrypoints
services/          Business logic and platform adapters
deployments/local/ Local runtime stack
infrastructure/    Terraform environments and modules
observability/     Collector, metrics, tracing, and dashboard config
schemas/           Versioned Avro event schemas
datasets/          Sample, public, and synthetic dataset assets

Local Demo

Prerequisites:

Docker and Docker Compose
Go 1.23+

Start the full local platform and seed synthetic events:

make demo

Key endpoints:

API: http://localhost:8080/v1/events
Coordinator: http://localhost:8081/healthz
Trino: http://localhost:8082
Grafana: http://localhost:3000 (admin / admin)
Prometheus: http://localhost:9090
MinIO Console: http://localhost:9001

Generate more synthetic load:

make generate

Replay a tenant/date range:

make replay

Local Runtime Services

make local or make demo starts:

api
worker
coordinator
generator
nats
minio
iceberg-rest
trino
otel-collector
prometheus
tempo
loki
grafana

Event Contract

Example event:

{
  "trace_id": "uuid",
  "tenant_id": "tenant_001",
  "event_type": "payment",
  "amount": 1500,
  "currency": "THB",
  "timestamp": "2026-05-14T10:00:00Z"
}

Every event is normalized with:

trace_id
event_id
tenant_id

Partition layout:

warehouse/tenant_id=tenant_001/event_type=payment/date=2026-05-14/batch-*.parquet
raw/tenant_id=tenant_001/event_type=payment/date=2026-05-14/batch-*.ndjson
dlq/2026-05-14/*.json

Terraform

Terraform environments live under infrastructure/terraform/environments/{dev,staging,prod} and use reusable modules for:

pubsub
gcs
bigquery
iam
observability

Validate locally:

make terraform-plan

GitHub Actions

The repository includes:

ci.yml for terraform fmt, terraform validate, tflint, checkov, go test, and Docker builds
deploy.yml for Workload Identity Federation based deployment without static JSON keys

Notes

The worker lands replayable raw batches and Parquet objects into the same object-store abstraction used by local and cloud modes.
The local stack includes Iceberg REST catalog and Trino wiring around the warehouse storage path.
A future catalog commit layer can consume _pulse manifests and promote landed Parquet batches into managed table snapshots.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
cmd		cmd
datasets		datasets
deployments/local		deployments/local
infrastructure/terraform		infrastructure/terraform
observability		observability
schemas		schemas
sdk/go		sdk/go
services		services
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
go.mod		go.mod
go.sum		go.sum
pulselake_architecture.png		pulselake_architecture.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PulseLake

What This Repository Provides

Architecture

Repository Layout

Local Demo

Local Runtime Services

Event Contract

Terraform

GitHub Actions

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PulseLake

What This Repository Provides

Architecture

Repository Layout

Local Demo

Local Runtime Services

Event Contract

Terraform

GitHub Actions

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages