Skip to content

perf(apiserver): share etcd client across project control planes#669

Draft
scotwells wants to merge 9 commits into
mainfrom
perf/shared-etcd-client-pooling
Draft

perf(apiserver): share etcd client across project control planes#669
scotwells wants to merge 9 commits into
mainfrom
perf/shared-etcd-client-pooling

Conversation

@scotwells

Copy link
Copy Markdown
Contributor

Why

The milo apiserver opens a separate, dedicated connection to etcd for every project × resource-type combination and never shares them. Across our ~500 project control planes that's tens of thousands of mostly-idle connections, each carrying its own buffers and background workers.

In production that fan-out — not the object cache — is what drives apiserver memory: it accounts for the large majority of a pod's ~23 GB, split between the per-connection buffers and the ~1.4M goroutines those connections spin up. It's the main contributor to the OOM churn tracked in datum-cloud/engineering#323, and a meaningful chunk of the storage pressure behind #596.

What

Share a single etcd client per transport config across all projects and resources instead of one-per-combination. Every project already talks to the same etcd with the same credentials — only the key prefix differs, and that prefix is applied at the store layer, so the connection carries no per-project state and is safe to pool. etcd connections multiplex many watches over one link, so collapsing the duplicates removes redundant overhead without changing behavior or throughput.

Clients are reference-counted and closed only when the last project storage using them is torn down — the same pattern the etcd compactor already uses.

This is implemented entirely in a new self-contained etcdshared package — no upstream or vendored changes, no fork. It builds a shared-client storage backend and wraps it with the unchanged upstream watch-cache, then points the existing project-aware decorator at it (one-line swap).

Expected outcome

Each apiserver pod should drop well below its current ~16 GB heap and shed most of those background goroutines, giving real headroom under the memory limit and quieting the recurring near-limit alerts — a low-risk stepping stone ahead of the larger storage rework in #596, not a competing effort.

Status — draft, for testing

Opened in draft to validate in a real environment before review.

Done:

  • Builds clean (go build ./...).
  • Reference-count lifecycle is unit-tested, including -race: clients are shared across projects, kept alive while any project still holds a reference, and closed exactly once at the last release.

Remaining before ready-for-review:

  • Real-etcd validation that two project prefixes sharing one connection read/write fully isolated keyspaces (currently argued structurally; embedded etcd isn't available in unit tests).
  • Compactor lifecycle guard. The shared-client compactor uses a process-global per-endpoint map that is also used by the default storage path; teardown ordering across the two owners needs a guard before this runs in prod.
  • Confirm steady-state memory reduction on a staging apiserver under representative project load.
  • Re-diff the copied storage helpers on any k8s.io/apiserver bump (currently pinned to v0.35.0).

Refs: datum-cloud/engineering#323, #596

scotwells and others added 2 commits June 25, 2026 12:38
The apiserver opened a dedicated etcd connection for every
(project x resource) pair and never shared them. Across ~500 project
control planes this produced tens of thousands of mostly-idle
connections that dominated apiserver memory through per-connection gRPC
read/write buffers and the goroutines each connection spawns.

Share a single etcd client per transport config across all projects and
resources. Per-project isolation is the etcd key prefix, applied at the
store layer, so the connection carries no per-project state and is safe
to pool. Clients are reference-counted and closed only when the last
project storage using them is torn down.

Implemented as a self-contained etcdshared package with no upstream or
vendor changes: it builds a shared-client raw storage backend and wraps
it with the unchanged upstream cacher, then points the project-aware
decorator at it.

Claude-Session: https://claude.ai/code/session_01PgQX8ky2mbuEieE7BR5Eu8
@savme

savme commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Hey @scotwells, just a heads up: I'm testing this build in staging, so I pushed a tiny change to transportKey directly to your branch — just wanted to let you know so you're not surprised by it!

@scotwells

Copy link
Copy Markdown
Contributor Author

@savme no worries! Feel free to change as needed

@savme

savme commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Deployed the prototype to staging. Initial results look really promising!

❯ go tool pprof -top -inuse_space milo-apiserver-2026-06-26-staging-75fcc7dc9b-t24pg-heap.pprof
Type: inuse_space
Time: 2026-06-26 17:55:26 CEST
Showing nodes accounting for 993.56MB, 73.12% of 1358.75MB total
Dropped 1091 nodes (cum <= 6.79MB)
      flat  flat%   sum%        cum   cum%
  178.67MB 13.15% 13.15%   230.19MB 16.94%  k8s.io/apiserver/pkg/storage/cacher.NewCacherFromConfig
   96.54MB  7.11% 20.25%    96.54MB  7.11%  runtime.malg
   70.02MB  5.15% 25.41%    80.02MB  5.89%  encoding/json.(*decodeState).objectInterface
   69.67MB  5.13% 30.54%    70.67MB  5.20%  k8s.io/apiserver/pkg/storage/etcd3.(*watcher).createWatchChan
   63.01MB  4.64% 35.17%    63.01MB  4.64%  k8s.io/apimachinery/pkg/runtime.DeepCopyJSONValue
   45.01MB  3.31% 38.49%    45.01MB  3.31%  net/http.(*Request).WithContext (partial-inline)
   41.66MB  3.07% 41.55%    41.66MB  3.07%  bufio.NewWriterSize (inline)
   26.01MB  1.91% 43.47%   147.03MB 10.82%  k8s.io/apimachinery/pkg/runtime.structToUnstructured
   ...

At ~90 minutes post-deployment, we're at around 1GB of heap with no transport, bufio.Reader, or http2 in sight. For reference, a deployment of main typically climbs to 3.5GB of heap at this point.

@scotwells

scotwells commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

Deployed this branch to staging (v0.0.0-perf-shared-etcd-client-pooling-20260626-143242) and it knocks over watch-cache consistency, which in turn crashloops network-services-operator (and any client using streaming WatchList / sendInitialEvents).

Symptoms

  • milo-apiserver logs flooded on all 3 replicas (thousands/min):
    • cacher Cache consistency check error: failed calculating etcd digest: The resourceVersion for the provided list is too old
    • etcd3 Too large resource version: 1280725601, current: 1280631907
  • NSO's Project / DNSRecordSet informers never sync — their sendInitialEvents=true watch closes at ~3s with no end-bookmark — so the manager fails startup (unable to start: failed to wait for ... caches to sync) and CrashLoopBackOffs.
  • Onset was gradual: milo rolled at 14:43, NSO started crashing ~18:00, as the desync widened across compaction cycles.

Likely cause

We share one *kubernetes.Client across every project×resource cacher. That client isn't just a connection — it carries single-consumer watch/progress state, and upstream creates one per store for a reason. The cacher's consistent-list / WatchList path asks for the current global etcd revision and then waits for its cache to reach it, relying on per-watch progress notifications to advance idle per-prefix caches. With hundreds of cachers multiplexed onto one shared client, that per-store RV/progress routing gets crossed, so a cache can't reach the requested global RV → Too large resource version (and, post-compaction, resourceVersion too old).

This is the same risk flagged in the unchecked box — "two project prefixes sharing one connection read/write fully isolated keyspaces" — they don't, under the shared watch-cache.

Suggested directions

  1. Share the gRPC ClientConn, not the kubernetes.Client — one dialed connection per transportKey, but a separate kubernetes.Client/cacher per store over it. Kills the per-connection buffers (the dominant memory cost) while keeping each watch cache's progress/RV isolated and upstream-identical. (Per-client watcher goroutines remain — worth measuring whether that alone clears the OOM headroom.)
  2. Or skip the cacher for project storage — serve LIST/WatchList straight from etcd via the shared client. Loses cache benefits, keeps pooling, guaranteed-correct.
  3. Compactor guard (the existing open checklist item) — startCompactorOnce keys on fmt.Sprintf("%v", c.Transport) while acquireClient keys on transportKey(), and both feed the process-global per-endpoint map shared with the default path. Two compactors racing one endpoint can compact inside the cacher's history window and accelerate the "compacted" failures.

Staging stopgap while iterating: revert milo to v0.0.0-main-20260625-165547, or set --watch-cache=false / feature-gates=WatchList=false,ConsistentListFromCache=false.

Sharing a single etcd client across every project control plane collapsed
~50k watch streams onto one client (~19.7k watchers multiplexed ~60 per
stream). etcd delivery stayed healthy, but milo watch-cache p99 read-wait
pegged at the 3s block-timeout and consistency-check errors climbed
steadily, because per-cacher RequestWatchProgress fans out across every
watcher on the shared client - O(N^2) progress amplification - so idle
per-prefix caches never reach the global revision a consistent read
demands. Streaming WatchList consumers (network-services-operator) then
fail their initial cache sync and crashloop.

Pool sharedClientPoolSize (32) connections per transport and assign stores
round-robin. This cuts watchers-per-client ~32x, shrinking the progress
fan-out per client by the same factor, while still collapsing the tens of
thousands of per-(project x resource) connections the package replaced -
the memory win is preserved (a few dozen connections, not one per
resource).
Replace the hardcoded sharedClientPoolSize const with a
--shared-etcd-client-pool-size flag (default 32, min 1) so the
per-transport pool can be tuned per environment without rebuilding the
image. The parsed value is pushed into the etcdshared package once at
apiserver startup, before any storage is built.

Key changes:
- Add SetSharedClientPoolSize to the etcdshared package; sharedClientPoolSize
  becomes a package var clamped to a floor of 1
- Register --shared-etcd-client-pool-size and apply it at the top of Run
Map the new --shared-etcd-client-pool-size flag to a
SHARED_ETCD_CLIENT_POOL_SIZE env var (default 32) in the apiserver
deployment so the pool can be tuned per environment without rebuilding.
@savme

savme commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

etcdshared: LIST serialization regression

Observed. Staging showed LIST latency of item_count × 155 ms — e.g. 22 dnszones → 3.5 s, 58–71 dnsrecordsets → 9–11 s. Individual GET p50 was the expected ~150 ms (one round trip).

Likely root cause:

  • clientv3.Watcher's internal run() goroutine sends one WatchCreate to etcd at a time, waiting for a Created confirmation (one round trip, 155 ms) before sending the next. When 78 stores concurrently create or recreate watches, they queue and confirm serially: up to 78 × 155 ms ≈ 12 s. During that window each cacher is !Ready() and consistent LISTs fall back to raw etcd immediately.
  • The same run() goroutine dispatches all incoming watch events from etcd to individual per-watch goroutines via unbuffered channels. If any downstream consumer is slow, the dispatcher stalls and no events reach any of the other 77 watches. This causes watch caches to persistently lag behind the current etcd revision. Consistent LISTs (RV="") call GetCurrentResourceVersion() (one etcd round trip, 155 ms) then block in waitUntilFreshAndBlock waiting for the cache to catch up; if it doesn't within 3 s, the request falls back to raw etcd. This is the steady-state cost: 155 ms + 3 s timeout + 155 ms fallback ≈ 3.3 s per consistent LIST.

Fix options.

  • Option A (implemented): per-store clientv3.Watcher over the shared TCP connection — no new dial. Each store's run() goroutine is isolated; watch creations are parallel.
  • Option B: bump pool size — effectively defeats the point of pooling.

Implementation. Each store gets a dedicated clientv3.Watcher over the pool's existing TCP connection; KV and lease operations continue to use the shared pool client.

@savme savme force-pushed the perf/shared-etcd-client-pooling branch from 6829f49 to 1790fe5 Compare June 29, 2026 11:12
@scotwells

Copy link
Copy Markdown
Contributor Author

@savme could we create a Grafana dashboard that we can use to monitor performance around this part of the system? I'd be curious to see how metrics have changed over time based on tweaks we're making here.

@savme

savme commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@savme could we create a Grafana dashboard that we can use to monitor performance around this part of the system? I'd be curious to see how metrics have changed over time based on tweaks we're making here.

I've put together a baseline dashboard here: https://grafana.staging.env.datum.net/goto/Twc2DJfDg?orgId=1

We can iterate on it as we go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants