Skip to content

Load testing baseline + production monitoring plan for analytics rollout #35814

@freddyDOTCMS

Description

@freddyDOTCMS

Description

Before enabling analytics for new customers in production, we need a measured latency baseline (P95/P99) for the ingest path under realistic load, and a monitoring plan that gives us early warning if something degrades as the feature is gradually enabled.

Currently we have no load test suite and no defined SLOs for the ingest endpoint. This issue covers both gaps.

Scope

Single traffic pattern: POST /v1/event/ingest — fire-and-forget, async, bounded by the executor queue (capacity 500, max 20 threads) and the per-customer HikariCP pool. Target: high throughput, low rejection rate, consistent 202 response time.

Known failure modes to stress

Failure Signal Triggered by
Executor queue saturation 503 SERVICE_UNAVAILABLE Ingest flood beyond 20 threads + 500 queued tasks
DB connection pool exhaustion 503 SERVICE_UNAVAILABLE Concurrent async workers per customer exceeding DB_POOL_MAX_PER_CUSTOMER (default 5)
ClickHouse slow write durationMs spike on async thread Network latency or CH node degradation

Load test scenarios

  1. Ingest ramp — single customer, ramp from 10 to 500 RPS over 5 min; measure throughput, 202 rate, and at what RPS the first 503 appears.
  2. Multi-customer concurrency — 10 customers ingesting simultaneously at 50 RPS each; verify per-customer pool isolation holds and one customer's load does not affect others.
  3. Executor saturation boundary — flood beyond 20 threads + 500 queue capacity; verify 503 fires cleanly, structured log emits "Event ingestion queue is full" with active and queued MDC fields, and the service recovers once load drops.

Production rollout monitoring plan

Rollout gates: enable analytics for 1 customer → 5 → 25 → all new customers. Hold 48 h at each gate before proceeding.

Grafana dashboards and alerts to have in place before gate 1:

Signal Alert threshold Action
P99 ingest durationMs (HTTP layer) > 500 ms sustained 5 min Hold rollout, investigate executor or CH write latency
503 rate on /v1/event/ingest > 1% of requests over 5 min Hold rollout; check executor queue depth and pool sizing
500 rate on /v1/event/ingest > 0.5% over 5 min Investigate; likely CH node or persistence failure
ClickHouse replication lag > 30 s Alert on-call; ingested data may not be visible yet
ClickHouse memory usage per node > 80% sustained 10 min Alert on-call; risk of OOM during background merges
ClickHouse disk usage per node > 70% Alert on-call; plan storage expansion before ingestion is blocked

Key MDC fields already in structured logs that Grafana queries should use: durationMs, endpoint, customerId, layer, requestId (for drilling into individual failures), active and queued (emitted on queue saturation).

Note on ClickHouse infrastructure metrics: beyond application-level signals, the Grafana dashboard must also cover ClickHouse node health. Memory pressure is a known risk during background merge operations (SummingMergeTree and ReplicatedMergeTree merge parts continuously as data accumulates). Disk usage grows unboundedly as more customers are onboarded — without a storage alert, the first sign of a problem would be a failed insert. Both metrics should be scraped via the ClickHouse Prometheus exporter and tracked from gate 1, not added reactively when a problem appears.


Acceptance Criteria

  • Load test suite (k6 or Gatling) covers all three scenarios above and is runnable with a single command against a staging environment with ClickHouse running.
  • P95 and P99 latency baselines are documented for the ingest endpoint under the multi-customer concurrency scenario.
  • SLOs are agreed and written down: target P99 for ingest, max acceptable 503 rate, max acceptable 500 rate.
  • Grafana dashboard panel plan is defined and agreed, covering: P95/P99 durationMs on ingest, 202/503/500 rates, executor active and queued depths, active and pending HikariCP connections per customer, ClickHouse memory usage per node, ClickHouse disk usage per node and projected runway, ClickHouse replication lag, and background merge queue depth. Implementation of the dashboard is tracked as a separate issue.
  • All six alert rules from the monitoring plan are configured and tested (fire + resolve) in staging before gate 1.
  • Rollout runbook documents the gate criteria, who approves each gate, and the rollback procedure if an alert fires.
  • Load test results and agreed SLOs are attached to this issue before gate 1 is opened.

Priority

High

Additional Context

No response

Metadata

Metadata

Assignees

Type

No fields configured for Task.

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions