perf(apiserver): share etcd client across project control planes by scotwells · Pull Request #669 · milo-os/milo

scotwells · 2026-06-25T17:39:14Z

Why

The milo apiserver opens a separate, dedicated connection to etcd for every project × resource-type combination and never shares them. Across our ~500 project control planes that's tens of thousands of mostly-idle connections, each carrying its own buffers and background workers.

In production that fan-out — not the object cache — is what drives apiserver memory: it accounts for the large majority of a pod's ~23 GB, split between the per-connection buffers and the ~1.4M goroutines those connections spin up. It's the main contributor to the OOM churn tracked in datum-cloud/engineering#323, and a meaningful chunk of the storage pressure behind #596.

What

Share a single etcd client per transport config across all projects and resources instead of one-per-combination. Every project already talks to the same etcd with the same credentials — only the key prefix differs, and that prefix is applied at the store layer, so the connection carries no per-project state and is safe to pool. etcd connections multiplex many watches over one link, so collapsing the duplicates removes redundant overhead without changing behavior or throughput.

Clients are reference-counted and closed only when the last project storage using them is torn down — the same pattern the etcd compactor already uses.

This is implemented entirely in a new self-contained etcdshared package — no upstream or vendored changes, no fork. It builds a shared-client storage backend and wraps it with the unchanged upstream watch-cache, then points the existing project-aware decorator at it (one-line swap).

Expected outcome

Each apiserver pod should drop well below its current ~16 GB heap and shed most of those background goroutines, giving real headroom under the memory limit and quieting the recurring near-limit alerts — a low-risk stepping stone ahead of the larger storage rework in #596, not a competing effort.

Status — draft, for testing

Opened in draft to validate in a real environment before review.

Done:

Builds clean (go build ./...).
Reference-count lifecycle is unit-tested, including -race: clients are shared across projects, kept alive while any project still holds a reference, and closed exactly once at the last release.

Remaining before ready-for-review:

Real-etcd validation that two project prefixes sharing one connection read/write fully isolated keyspaces (currently argued structurally; embedded etcd isn't available in unit tests).
Compactor lifecycle guard. The shared-client compactor uses a process-global per-endpoint map that is also used by the default storage path; teardown ordering across the two owners needs a guard before this runs in prod.
Confirm steady-state memory reduction on a staging apiserver under representative project load.
Re-diff the copied storage helpers on any k8s.io/apiserver bump (currently pinned to v0.35.0).

Refs: datum-cloud/engineering#323, #596

The apiserver opened a dedicated etcd connection for every (project x resource) pair and never shared them. Across ~500 project control planes this produced tens of thousands of mostly-idle connections that dominated apiserver memory through per-connection gRPC read/write buffers and the goroutines each connection spawns. Share a single etcd client per transport config across all projects and resources. Per-project isolation is the etcd key prefix, applied at the store layer, so the connection carries no per-project state and is safe to pool. Clients are reference-counted and closed only when the last project storage using them is torn down. Implemented as a self-contained etcdshared package with no upstream or vendor changes: it builds a shared-client raw storage backend and wraps it with the unchanged upstream cacher, then points the project-aware decorator at it. Claude-Session: https://claude.ai/code/session_01PgQX8ky2mbuEieE7BR5Eu8

savme · 2026-06-26T12:41:20Z

Hey @scotwells, just a heads up: I'm testing this build in staging, so I pushed a tiny change to transportKey directly to your branch — just wanted to let you know so you're not surprised by it!

scotwells · 2026-06-26T13:04:26Z

@savme no worries! Feel free to change as needed

savme · 2026-06-26T16:52:23Z

Deployed the prototype to staging. Initial results look really promising!

❯ go tool pprof -top -inuse_space milo-apiserver-2026-06-26-staging-75fcc7dc9b-t24pg-heap.pprof
Type: inuse_space
Time: 2026-06-26 17:55:26 CEST
Showing nodes accounting for 993.56MB, 73.12% of 1358.75MB total
Dropped 1091 nodes (cum <= 6.79MB)
      flat  flat%   sum%        cum   cum%
  178.67MB 13.15% 13.15%   230.19MB 16.94%  k8s.io/apiserver/pkg/storage/cacher.NewCacherFromConfig
   96.54MB  7.11% 20.25%    96.54MB  7.11%  runtime.malg
   70.02MB  5.15% 25.41%    80.02MB  5.89%  encoding/json.(*decodeState).objectInterface
   69.67MB  5.13% 30.54%    70.67MB  5.20%  k8s.io/apiserver/pkg/storage/etcd3.(*watcher).createWatchChan
   63.01MB  4.64% 35.17%    63.01MB  4.64%  k8s.io/apimachinery/pkg/runtime.DeepCopyJSONValue
   45.01MB  3.31% 38.49%    45.01MB  3.31%  net/http.(*Request).WithContext (partial-inline)
   41.66MB  3.07% 41.55%    41.66MB  3.07%  bufio.NewWriterSize (inline)
   26.01MB  1.91% 43.47%   147.03MB 10.82%  k8s.io/apimachinery/pkg/runtime.structToUnstructured
   ...

At ~90 minutes post-deployment, we're at around 1GB of heap with no transport, bufio.Reader, or http2 in sight. For reference, a deployment of main typically climbs to 3.5GB of heap at this point.

scotwells · 2026-06-26T20:09:44Z

Deployed this branch to staging (v0.0.0-perf-shared-etcd-client-pooling-20260626-143242) and it knocks over watch-cache consistency, which in turn crashloops network-services-operator (and any client using streaming WatchList / sendInitialEvents).

Symptoms

milo-apiserver logs flooded on all 3 replicas (thousands/min):
- cacher Cache consistency check error: failed calculating etcd digest: The resourceVersion for the provided list is too old
- etcd3 Too large resource version: 1280725601, current: 1280631907
NSO's Project / DNSRecordSet informers never sync — their sendInitialEvents=true watch closes at ~3s with no end-bookmark — so the manager fails startup (unable to start: failed to wait for ... caches to sync) and CrashLoopBackOffs.
Onset was gradual: milo rolled at 14:43, NSO started crashing ~18:00, as the desync widened across compaction cycles.

Likely cause

We share one *kubernetes.Client across every project×resource cacher. That client isn't just a connection — it carries single-consumer watch/progress state, and upstream creates one per store for a reason. The cacher's consistent-list / WatchList path asks for the current global etcd revision and then waits for its cache to reach it, relying on per-watch progress notifications to advance idle per-prefix caches. With hundreds of cachers multiplexed onto one shared client, that per-store RV/progress routing gets crossed, so a cache can't reach the requested global RV → Too large resource version (and, post-compaction, resourceVersion too old).

This is the same risk flagged in the unchecked box — "two project prefixes sharing one connection read/write fully isolated keyspaces" — they don't, under the shared watch-cache.

Suggested directions

Share the gRPC ClientConn, not the kubernetes.Client — one dialed connection per transportKey, but a separate kubernetes.Client/cacher per store over it. Kills the per-connection buffers (the dominant memory cost) while keeping each watch cache's progress/RV isolated and upstream-identical. (Per-client watcher goroutines remain — worth measuring whether that alone clears the OOM headroom.)
Or skip the cacher for project storage — serve LIST/WatchList straight from etcd via the shared client. Loses cache benefits, keeps pooling, guaranteed-correct.
Compactor guard (the existing open checklist item) — startCompactorOnce keys on fmt.Sprintf("%v", c.Transport) while acquireClient keys on transportKey(), and both feed the process-global per-endpoint map shared with the default path. Two compactors racing one endpoint can compact inside the cacher's history window and accelerate the "compacted" failures.

Staging stopgap while iterating: revert milo to v0.0.0-main-20260625-165547, or set --watch-cache=false / feature-gates=WatchList=false,ConsistentListFromCache=false.

Sharing a single etcd client across every project control plane collapsed ~50k watch streams onto one client (~19.7k watchers multiplexed ~60 per stream). etcd delivery stayed healthy, but milo watch-cache p99 read-wait pegged at the 3s block-timeout and consistency-check errors climbed steadily, because per-cacher RequestWatchProgress fans out across every watcher on the shared client - O(N^2) progress amplification - so idle per-prefix caches never reach the global revision a consistent read demands. Streaming WatchList consumers (network-services-operator) then fail their initial cache sync and crashloop. Pool sharedClientPoolSize (32) connections per transport and assign stores round-robin. This cuts watchers-per-client ~32x, shrinking the progress fan-out per client by the same factor, while still collapsing the tens of thousands of per-(project x resource) connections the package replaced - the memory win is preserved (a few dozen connections, not one per resource).

Replace the hardcoded sharedClientPoolSize const with a --shared-etcd-client-pool-size flag (default 32, min 1) so the per-transport pool can be tuned per environment without rebuilding the image. The parsed value is pushed into the etcdshared package once at apiserver startup, before any storage is built. Key changes: - Add SetSharedClientPoolSize to the etcdshared package; sharedClientPoolSize becomes a package var clamped to a floor of 1 - Register --shared-etcd-client-pool-size and apply it at the top of Run

Map the new --shared-etcd-client-pool-size flag to a SHARED_ETCD_CLIENT_POOL_SIZE env var (default 32) in the apiserver deployment so the pool can be tuned per environment without rebuilding.

savme · 2026-06-29T10:38:26Z

etcdshared: LIST serialization regression

Observed. Staging showed LIST latency of item_count × 155 ms — e.g. 22 dnszones → 3.5 s, 58–71 dnsrecordsets → 9–11 s. Individual GET p50 was the expected ~150 ms (one round trip).

Likely root cause:

clientv3.Watcher's internal run() goroutine sends one WatchCreate to etcd at a time, waiting for a Created confirmation (one round trip, 155 ms) before sending the next. When 78 stores concurrently create or recreate watches, they queue and confirm serially: up to 78 × 155 ms ≈ 12 s. During that window each cacher is !Ready() and consistent LISTs fall back to raw etcd immediately.
The same run() goroutine dispatches all incoming watch events from etcd to individual per-watch goroutines via unbuffered channels. If any downstream consumer is slow, the dispatcher stalls and no events reach any of the other 77 watches. This causes watch caches to persistently lag behind the current etcd revision. Consistent LISTs (RV="") call GetCurrentResourceVersion() (one etcd round trip, 155 ms) then block in waitUntilFreshAndBlock waiting for the cache to catch up; if it doesn't within 3 s, the request falls back to raw etcd. This is the steady-state cost: 155 ms + 3 s timeout + 155 ms fallback ≈ 3.3 s per consistent LIST.

Fix options.

Option A (implemented): per-store clientv3.Watcher over the shared TCP connection — no new dial. Each store's run() goroutine is isolated; watch creations are parallel.
Option B: bump pool size — effectively defeats the point of pooling.

Implementation. Each store gets a dedicated clientv3.Watcher over the pool's existing TCP connection; KV and lease operations continue to use the shared pool client.

scotwells · 2026-06-30T18:49:50Z

@savme could we create a Grafana dashboard that we can use to monitor performance around this part of the system? I'd be curious to see how metrics have changed over time based on tweaks we're making here.

savme · 2026-07-01T09:50:53Z

@savme could we create a Grafana dashboard that we can use to monitor performance around this part of the system? I'd be curious to see how metrics have changed over time based on tweaks we're making here.

I've put together a baseline dashboard here: https://grafana.staging.env.datum.net/goto/Twc2DJfDg?orgId=1

We can iterate on it as we go

scotwells and others added 2 commits June 25, 2026 12:38

perf(apiserver): specific fields in transportKey

41e2274

perf(apiserver): always use etcdshared decorator

a777ab0

scotwells added 3 commits June 26, 2026 15:24

chore(apiserver): expose shared etcd client pool size via deployment env

f336b60

Map the new --shared-etcd-client-pool-size flag to a SHARED_ETCD_CLIENT_POOL_SIZE env var (default 32) in the apiserver deployment so the pool can be tuned per environment without rebuilding.

chore(apiserver): ensure watch gRPC streams are multiplexed

1790fe5

savme force-pushed the perf/shared-etcd-client-pooling branch from 6829f49 to 1790fe5 Compare June 29, 2026 11:12

scotwells and others added 2 commits June 29, 2026 08:20

Merge branch 'main' into perf/shared-etcd-client-pooling

4ca9e99

Merge branch 'main' into perf/shared-etcd-client-pooling

e844773

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(apiserver): share etcd client across project control planes#669

perf(apiserver): share etcd client across project control planes#669
scotwells wants to merge 9 commits into
mainfrom
perf/shared-etcd-client-pooling

scotwells commented Jun 25, 2026

Uh oh!

savme commented Jun 26, 2026

Uh oh!

scotwells commented Jun 26, 2026

Uh oh!

savme commented Jun 26, 2026

Uh oh!

scotwells commented Jun 26, 2026 •

edited

Loading

Uh oh!

savme commented Jun 29, 2026 •

edited

Loading

Uh oh!

scotwells commented Jun 30, 2026

Uh oh!

savme commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

scotwells commented Jun 25, 2026

Why

What

Expected outcome

Status — draft, for testing

Uh oh!

savme commented Jun 26, 2026

Uh oh!

scotwells commented Jun 26, 2026

Uh oh!

savme commented Jun 26, 2026

Uh oh!

scotwells commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Symptoms

Likely cause

Suggested directions

Uh oh!

savme commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

etcdshared: LIST serialization regression

Uh oh!

scotwells commented Jun 30, 2026

Uh oh!

savme commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

scotwells commented Jun 26, 2026 •

edited

Loading

savme commented Jun 29, 2026 •

edited

Loading