Load testing baseline + production monitoring plan for analytics rollout

### Description

Before enabling analytics for new customers in production, we need a measured latency baseline (P95/P99) for the ingest path under realistic load, and a monitoring plan that gives us early warning if something degrades as the feature is gradually enabled.

Currently we have no load test suite and no defined SLOs for the ingest endpoint. This issue covers both gaps.

## Scope

Single traffic pattern: `POST /v1/event/ingest` — fire-and-forget, async, bounded by the executor queue (capacity 500, max 20 threads) and the per-customer HikariCP pool. Target: high throughput, low rejection rate, consistent 202 response time.

## Known failure modes to stress

| Failure | Signal | Triggered by |
|---|---|---|
| Executor queue saturation | `503 SERVICE_UNAVAILABLE` | Ingest flood beyond 20 threads + 500 queued tasks |
| DB connection pool exhaustion | `503 SERVICE_UNAVAILABLE` | Concurrent async workers per customer exceeding `DB_POOL_MAX_PER_CUSTOMER` (default 5) |
| ClickHouse slow write | `durationMs` spike on async thread | Network latency or CH node degradation |

## Load test scenarios

1. **Ingest ramp** — single customer, ramp from 10 to 500 RPS over 5 min; measure throughput, 202 rate, and at what RPS the first 503 appears.
2. **Multi-customer concurrency** — 10 customers ingesting simultaneously at 50 RPS each; verify per-customer pool isolation holds and one customer's load does not affect others.
3. **Executor saturation boundary** — flood beyond 20 threads + 500 queue capacity; verify 503 fires cleanly, structured log emits `"Event ingestion queue is full"` with `active` and `queued` MDC fields, and the service recovers once load drops.

## Production rollout monitoring plan

Rollout gates: enable analytics for 1 customer → 5 → 25 → all new customers. Hold 48 h at each gate before proceeding.

Grafana dashboards and alerts to have in place before gate 1:

| Signal | Alert threshold | Action |
|---|---|---|
| P99 ingest `durationMs` (HTTP layer) | > 500 ms sustained 5 min | Hold rollout, investigate executor or CH write latency |
| `503` rate on `/v1/event/ingest` | > 1% of requests over 5 min | Hold rollout; check executor queue depth and pool sizing |
| `500` rate on `/v1/event/ingest` | > 0.5% over 5 min | Investigate; likely CH node or persistence failure |
| ClickHouse replication lag | > 30 s | Alert on-call; ingested data may not be visible yet |
| ClickHouse memory usage per node | > 80% sustained 10 min | Alert on-call; risk of OOM during background merges |
| ClickHouse disk usage per node | > 70% | Alert on-call; plan storage expansion before ingestion is blocked |

Key MDC fields already in structured logs that Grafana queries should use: `durationMs`, `endpoint`, `customerId`, `layer`, `requestId` (for drilling into individual failures), `active` and `queued` (emitted on queue saturation).

> **Note on ClickHouse infrastructure metrics:** beyond application-level signals, the Grafana dashboard must also cover ClickHouse node health. Memory pressure is a known risk during background merge operations (SummingMergeTree and ReplicatedMergeTree merge parts continuously as data accumulates). Disk usage grows unboundedly as more customers are onboarded — without a storage alert, the first sign of a problem would be a failed insert. Both metrics should be scraped via the ClickHouse Prometheus exporter and tracked from gate 1, not added reactively when a problem appears.

---

### Acceptance Criteria

- [ ] Load test suite (k6 or Gatling) covers all three scenarios above and is runnable with a single command against a staging environment with ClickHouse running.
- [ ] P95 and P99 latency baselines are documented for the ingest endpoint under the multi-customer concurrency scenario.
- [ ] SLOs are agreed and written down: target P99 for ingest, max acceptable 503 rate, max acceptable 500 rate.
- [ ] Grafana dashboard panel plan is defined and agreed, covering: P95/P99 `durationMs` on ingest, 202/503/500 rates, executor `active` and `queued` depths, active and pending HikariCP connections per customer, ClickHouse memory usage per node, ClickHouse disk usage per node and projected runway, ClickHouse replication lag, and background merge queue depth. Implementation of the dashboard is tracked as a separate issue.
- [ ] All six alert rules from the monitoring plan are configured and tested (fire + resolve) in staging before gate 1.
- [ ] Rollout runbook documents the gate criteria, who approves each gate, and the rollback procedure if an alert fires.
- [ ] Load test results and agreed SLOs are attached to this issue before gate 1 is opened.

### Priority

High

### Additional Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load testing baseline + production monitoring plan for analytics rollout #35814

Description

Scope

Known failure modes to stress

Load test scenarios

Production rollout monitoring plan

Acceptance Criteria

Priority

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Failure	Signal	Triggered by
Executor queue saturation	`503 SERVICE_UNAVAILABLE`	Ingest flood beyond 20 threads + 500 queued tasks
DB connection pool exhaustion	`503 SERVICE_UNAVAILABLE`	Concurrent async workers per customer exceeding `DB_POOL_MAX_PER_CUSTOMER` (default 5)
ClickHouse slow write	`durationMs` spike on async thread	Network latency or CH node degradation

Signal	Alert threshold	Action
P99 ingest `durationMs` (HTTP layer)	> 500 ms sustained 5 min	Hold rollout, investigate executor or CH write latency
`503` rate on `/v1/event/ingest`	> 1% of requests over 5 min	Hold rollout; check executor queue depth and pool sizing
`500` rate on `/v1/event/ingest`	> 0.5% over 5 min	Investigate; likely CH node or persistence failure
ClickHouse replication lag	> 30 s	Alert on-call; ingested data may not be visible yet
ClickHouse memory usage per node	> 80% sustained 10 min	Alert on-call; risk of OOM during background merges
ClickHouse disk usage per node	> 70%	Alert on-call; plan storage expansion before ingestion is blocked

Load testing baseline + production monitoring plan for analytics rollout #35814

Description

Description

Scope

Known failure modes to stress

Load test scenarios

Production rollout monitoring plan

Acceptance Criteria

Priority

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions