perf(apiserver): share etcd client across project control planes#669
perf(apiserver): share etcd client across project control planes#669scotwells wants to merge 9 commits into
Conversation
The apiserver opened a dedicated etcd connection for every (project x resource) pair and never shared them. Across ~500 project control planes this produced tens of thousands of mostly-idle connections that dominated apiserver memory through per-connection gRPC read/write buffers and the goroutines each connection spawns. Share a single etcd client per transport config across all projects and resources. Per-project isolation is the etcd key prefix, applied at the store layer, so the connection carries no per-project state and is safe to pool. Clients are reference-counted and closed only when the last project storage using them is torn down. Implemented as a self-contained etcdshared package with no upstream or vendor changes: it builds a shared-client raw storage backend and wraps it with the unchanged upstream cacher, then points the project-aware decorator at it. Claude-Session: https://claude.ai/code/session_01PgQX8ky2mbuEieE7BR5Eu8
|
Hey @scotwells, just a heads up: I'm testing this build in staging, so I pushed a tiny change to |
|
@savme no worries! Feel free to change as needed |
|
Deployed the prototype to staging. Initial results look really promising! At ~90 minutes post-deployment, we're at around 1GB of heap with no |
|
Deployed this branch to staging ( Symptoms
Likely causeWe share one This is the same risk flagged in the unchecked box — "two project prefixes sharing one connection read/write fully isolated keyspaces" — they don't, under the shared watch-cache. Suggested directions
Staging stopgap while iterating: revert milo to |
Sharing a single etcd client across every project control plane collapsed ~50k watch streams onto one client (~19.7k watchers multiplexed ~60 per stream). etcd delivery stayed healthy, but milo watch-cache p99 read-wait pegged at the 3s block-timeout and consistency-check errors climbed steadily, because per-cacher RequestWatchProgress fans out across every watcher on the shared client - O(N^2) progress amplification - so idle per-prefix caches never reach the global revision a consistent read demands. Streaming WatchList consumers (network-services-operator) then fail their initial cache sync and crashloop. Pool sharedClientPoolSize (32) connections per transport and assign stores round-robin. This cuts watchers-per-client ~32x, shrinking the progress fan-out per client by the same factor, while still collapsing the tens of thousands of per-(project x resource) connections the package replaced - the memory win is preserved (a few dozen connections, not one per resource).
Replace the hardcoded sharedClientPoolSize const with a --shared-etcd-client-pool-size flag (default 32, min 1) so the per-transport pool can be tuned per environment without rebuilding the image. The parsed value is pushed into the etcdshared package once at apiserver startup, before any storage is built. Key changes: - Add SetSharedClientPoolSize to the etcdshared package; sharedClientPoolSize becomes a package var clamped to a floor of 1 - Register --shared-etcd-client-pool-size and apply it at the top of Run
Map the new --shared-etcd-client-pool-size flag to a SHARED_ETCD_CLIENT_POOL_SIZE env var (default 32) in the apiserver deployment so the pool can be tuned per environment without rebuilding.
etcdshared: LIST serialization regressionObserved. Staging showed LIST latency of Likely root cause:
Fix options.
Implementation. Each store gets a dedicated clientv3.Watcher over the pool's existing TCP connection; KV and lease operations continue to use the shared pool client. |
6829f49 to
1790fe5
Compare
|
@savme could we create a Grafana dashboard that we can use to monitor performance around this part of the system? I'd be curious to see how metrics have changed over time based on tweaks we're making here. |
I've put together a baseline dashboard here: https://grafana.staging.env.datum.net/goto/Twc2DJfDg?orgId=1 We can iterate on it as we go |
Why
The milo apiserver opens a separate, dedicated connection to etcd for every project × resource-type combination and never shares them. Across our ~500 project control planes that's tens of thousands of mostly-idle connections, each carrying its own buffers and background workers.
In production that fan-out — not the object cache — is what drives apiserver memory: it accounts for the large majority of a pod's ~23 GB, split between the per-connection buffers and the ~1.4M goroutines those connections spin up. It's the main contributor to the OOM churn tracked in datum-cloud/engineering#323, and a meaningful chunk of the storage pressure behind #596.
What
Share a single etcd client per transport config across all projects and resources instead of one-per-combination. Every project already talks to the same etcd with the same credentials — only the key prefix differs, and that prefix is applied at the store layer, so the connection carries no per-project state and is safe to pool. etcd connections multiplex many watches over one link, so collapsing the duplicates removes redundant overhead without changing behavior or throughput.
Clients are reference-counted and closed only when the last project storage using them is torn down — the same pattern the etcd compactor already uses.
This is implemented entirely in a new self-contained
etcdsharedpackage — no upstream or vendored changes, no fork. It builds a shared-client storage backend and wraps it with the unchanged upstream watch-cache, then points the existing project-aware decorator at it (one-line swap).Expected outcome
Each apiserver pod should drop well below its current ~16 GB heap and shed most of those background goroutines, giving real headroom under the memory limit and quieting the recurring near-limit alerts — a low-risk stepping stone ahead of the larger storage rework in #596, not a competing effort.
Status — draft, for testing
Opened in draft to validate in a real environment before review.
Done:
go build ./...).-race: clients are shared across projects, kept alive while any project still holds a reference, and closed exactly once at the last release.Remaining before ready-for-review:
k8s.io/apiserverbump (currently pinned to v0.35.0).Refs: datum-cloud/engineering#323, #596