You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Before enabling analytics for new customers in production, we need a measured latency baseline (P95/P99) for the ingest path under realistic load, and a monitoring plan that gives us early warning if something degrades as the feature is gradually enabled.
Currently we have no load test suite and no defined SLOs for the ingest endpoint. This issue covers both gaps.
Scope
Single traffic pattern: POST /v1/event/ingest — fire-and-forget, async, bounded by the executor queue (capacity 500, max 20 threads) and the per-customer HikariCP pool. Target: high throughput, low rejection rate, consistent 202 response time.
Known failure modes to stress
Failure
Signal
Triggered by
Executor queue saturation
503 SERVICE_UNAVAILABLE
Ingest flood beyond 20 threads + 500 queued tasks
DB connection pool exhaustion
503 SERVICE_UNAVAILABLE
Concurrent async workers per customer exceeding DB_POOL_MAX_PER_CUSTOMER (default 5)
ClickHouse slow write
durationMs spike on async thread
Network latency or CH node degradation
Load test scenarios
Ingest ramp — single customer, ramp from 10 to 500 RPS over 5 min; measure throughput, 202 rate, and at what RPS the first 503 appears.
Multi-customer concurrency — 10 customers ingesting simultaneously at 50 RPS each; verify per-customer pool isolation holds and one customer's load does not affect others.
Executor saturation boundary — flood beyond 20 threads + 500 queue capacity; verify 503 fires cleanly, structured log emits "Event ingestion queue is full" with active and queued MDC fields, and the service recovers once load drops.
Production rollout monitoring plan
Rollout gates: enable analytics for 1 customer → 5 → 25 → all new customers. Hold 48 h at each gate before proceeding.
Grafana dashboards and alerts to have in place before gate 1:
Signal
Alert threshold
Action
P99 ingest durationMs (HTTP layer)
> 500 ms sustained 5 min
Hold rollout, investigate executor or CH write latency
503 rate on /v1/event/ingest
> 1% of requests over 5 min
Hold rollout; check executor queue depth and pool sizing
500 rate on /v1/event/ingest
> 0.5% over 5 min
Investigate; likely CH node or persistence failure
ClickHouse replication lag
> 30 s
Alert on-call; ingested data may not be visible yet
ClickHouse memory usage per node
> 80% sustained 10 min
Alert on-call; risk of OOM during background merges
ClickHouse disk usage per node
> 70%
Alert on-call; plan storage expansion before ingestion is blocked
Key MDC fields already in structured logs that Grafana queries should use: durationMs, endpoint, customerId, layer, requestId (for drilling into individual failures), active and queued (emitted on queue saturation).
Note on ClickHouse infrastructure metrics: beyond application-level signals, the Grafana dashboard must also cover ClickHouse node health. Memory pressure is a known risk during background merge operations (SummingMergeTree and ReplicatedMergeTree merge parts continuously as data accumulates). Disk usage grows unboundedly as more customers are onboarded — without a storage alert, the first sign of a problem would be a failed insert. Both metrics should be scraped via the ClickHouse Prometheus exporter and tracked from gate 1, not added reactively when a problem appears.
Acceptance Criteria
Load test suite (k6 or Gatling) covers all three scenarios above and is runnable with a single command against a staging environment with ClickHouse running.
P95 and P99 latency baselines are documented for the ingest endpoint under the multi-customer concurrency scenario.
SLOs are agreed and written down: target P99 for ingest, max acceptable 503 rate, max acceptable 500 rate.
Grafana dashboard panel plan is defined and agreed, covering: P95/P99 durationMs on ingest, 202/503/500 rates, executor active and queued depths, active and pending HikariCP connections per customer, ClickHouse memory usage per node, ClickHouse disk usage per node and projected runway, ClickHouse replication lag, and background merge queue depth. Implementation of the dashboard is tracked as a separate issue.
All six alert rules from the monitoring plan are configured and tested (fire + resolve) in staging before gate 1.
Rollout runbook documents the gate criteria, who approves each gate, and the rollback procedure if an alert fires.
Load test results and agreed SLOs are attached to this issue before gate 1 is opened.
Description
Before enabling analytics for new customers in production, we need a measured latency baseline (P95/P99) for the ingest path under realistic load, and a monitoring plan that gives us early warning if something degrades as the feature is gradually enabled.
Currently we have no load test suite and no defined SLOs for the ingest endpoint. This issue covers both gaps.
Scope
Single traffic pattern:
POST /v1/event/ingest— fire-and-forget, async, bounded by the executor queue (capacity 500, max 20 threads) and the per-customer HikariCP pool. Target: high throughput, low rejection rate, consistent 202 response time.Known failure modes to stress
503 SERVICE_UNAVAILABLE503 SERVICE_UNAVAILABLEDB_POOL_MAX_PER_CUSTOMER(default 5)durationMsspike on async threadLoad test scenarios
"Event ingestion queue is full"withactiveandqueuedMDC fields, and the service recovers once load drops.Production rollout monitoring plan
Rollout gates: enable analytics for 1 customer → 5 → 25 → all new customers. Hold 48 h at each gate before proceeding.
Grafana dashboards and alerts to have in place before gate 1:
durationMs(HTTP layer)503rate on/v1/event/ingest500rate on/v1/event/ingestKey MDC fields already in structured logs that Grafana queries should use:
durationMs,endpoint,customerId,layer,requestId(for drilling into individual failures),activeandqueued(emitted on queue saturation).Acceptance Criteria
durationMson ingest, 202/503/500 rates, executoractiveandqueueddepths, active and pending HikariCP connections per customer, ClickHouse memory usage per node, ClickHouse disk usage per node and projected runway, ClickHouse replication lag, and background merge queue depth. Implementation of the dashboard is tracked as a separate issue.Priority
High
Additional Context
No response