Skip to content

prochalo/pulselake

Repository files navigation

PulseLake

Distributed observability-first streaming lakehouse platform for modern analytics and AI workloads.

What This Repository Provides

PulseLake is structured as a single codebase that supports two operating modes through configuration rather than branch divergence:

  • Local mode: NATS + MinIO + Trino + Grafana/Tempo/Loki/Prometheus via Docker Compose
  • Cloud mode: Pub/Sub + GCS + Google Cloud observability primitives via Terraform

The current foundation implements:

  • Go ingestion API with schema validation, trace and event identifiers, and broker publishing
  • Distributed worker with lease acquisition, batching, retries, DLQ object landing, checkpointing, raw batch archiving, and Parquet writes
  • Coordination service for partition ownership and checkpoints
  • Replay CLI that republishes landed raw batches by date window and tenant filter
  • Synthetic generator CLI for payment, clickstream, and IoT event pressure
  • Local observability stack and Terraform/CI scaffolding

Architecture

PulseLake architecture

The diagram above shows the end-to-end platform across both deployment modes in one view.

What is implemented in this repository today:

  • Event producers publish through the ingestion API at /v1/events
  • The API validates payloads against Avro schemas and normalizes event_id, trace_id, and tenant_id
  • Messages flow through NATS JetStream in local mode or Google Pub/Sub in cloud mode
  • Workers batch events by partition, acquire leases, write raw NDJSON and Parquet objects, and persist checkpoints
  • Replay republishes archived raw batches by date window and tenant filter
  • OpenTelemetry traces and metrics are emitted from the API, worker, replay, and coordinator services

Current scope and limits:

  • The worker writes Parquet objects plus _pulse manifest files to object storage
  • The local stack includes Iceberg REST and Trino wiring for the warehouse path
  • This repository does not yet include a catalog commit layer that turns landed Parquet batches into managed Iceberg table snapshots
  • Terraform provisions storage, messaging, IAM, and observability resources; it does not load data into BigQuery by itself

Repository Layout

cmd/               Service entrypoints
services/          Business logic and platform adapters
deployments/local/ Local runtime stack
infrastructure/    Terraform environments and modules
observability/     Collector, metrics, tracing, and dashboard config
schemas/           Versioned Avro event schemas
datasets/          Sample, public, and synthetic dataset assets

Local Demo

Prerequisites:

  • Docker and Docker Compose
  • Go 1.23+

Start the full local platform and seed synthetic events:

make demo

Key endpoints:

  • API: http://localhost:8080/v1/events
  • Coordinator: http://localhost:8081/healthz
  • Trino: http://localhost:8082
  • Grafana: http://localhost:3000 (admin / admin)
  • Prometheus: http://localhost:9090
  • MinIO Console: http://localhost:9001

Generate more synthetic load:

make generate

Replay a tenant/date range:

make replay

Local Runtime Services

make local or make demo starts:

  • api
  • worker
  • coordinator
  • generator
  • nats
  • minio
  • iceberg-rest
  • trino
  • otel-collector
  • prometheus
  • tempo
  • loki
  • grafana

Event Contract

Example event:

{
  "trace_id": "uuid",
  "tenant_id": "tenant_001",
  "event_type": "payment",
  "amount": 1500,
  "currency": "THB",
  "timestamp": "2026-05-14T10:00:00Z"
}

Every event is normalized with:

  • trace_id
  • event_id
  • tenant_id

Partition layout:

warehouse/tenant_id=tenant_001/event_type=payment/date=2026-05-14/batch-*.parquet
raw/tenant_id=tenant_001/event_type=payment/date=2026-05-14/batch-*.ndjson
dlq/2026-05-14/*.json

Terraform

Terraform environments live under infrastructure/terraform/environments/{dev,staging,prod} and use reusable modules for:

  • pubsub
  • gcs
  • bigquery
  • iam
  • observability

Validate locally:

make terraform-plan

GitHub Actions

The repository includes:

  • ci.yml for terraform fmt, terraform validate, tflint, checkov, go test, and Docker builds
  • deploy.yml for Workload Identity Federation based deployment without static JSON keys

Notes

  • The worker lands replayable raw batches and Parquet objects into the same object-store abstraction used by local and cloud modes.
  • The local stack includes Iceberg REST catalog and Trino wiring around the warehouse storage path.
  • A future catalog commit layer can consume _pulse manifests and promote landed Parquet batches into managed table snapshots.

About

Distributed observability-first streaming lakehouse platform built with Go, Apache Iceberg, GCP, Terraform, OpenTelemetry, and fully local POC support.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors