Skip to content

feat: configurable reconcile worker pool (--max-concurrent-reconciles)#379

Open
xrl wants to merge 1 commit into
etcd-io:mainfrom
xrl:pr/reconcile-pool-and-etcd-qos
Open

feat: configurable reconcile worker pool (--max-concurrent-reconciles)#379
xrl wants to merge 1 commit into
etcd-io:mainfrom
xrl:pr/reconcile-pool-and-etcd-qos

Conversation

@xrl

@xrl xrl commented Jun 18, 2026

Copy link
Copy Markdown

controller-runtime defaults to a single reconcile worker, so one slow cluster blocks progress on every other cluster. This adds a --max-concurrent-reconciles flag (default 5) threaded into SetupWithManager via builder.WithOptions; each EtcdCluster keeps its own workqueue key, so concurrency only parallelizes distinct clusters and a value <= 0 falls back to a single worker. Documented in docs/operator-flags.md, with test/e2e/STRESS.md recording the measured spinup-burst budget behind the stress-tier batching. Tested with an envtest-backed assertion that the pool size is threaded for widened, single-worker, and non-positive values.


PR series — operability fixes & TLS

Small single-purpose PRs from live kind-cluster testing of the operator. Each stands alone unless an After is listed. → = this PR.

PR Lands After
Reviewable now — small, any order
🟢 #374 Requeue instead of swallowing the client-cert provisioning error
🟢 #391 Grant events.k8s.io RBAC so operator Events are actually recorded
🟢 #392 Correlate members[]/leaderID from one health snapshot (consistent leader)
🟢 #393 Propagate user-supplied altNames.ipAddresses into certificates
🟢 #394 Accept day-suffix validityDuration (365d, 100d12h) as documented
🟢 #395 Surface early reconcile errors as a Degraded condition (was empty status)
🟢→ #379 Configurable reconcile worker pool (--max-concurrent-reconciles)
🟢 #369 kind-based stress/scale e2e harness (1/3/7 members, churn, quorum watcher)
TLS stack — in order
🟢 #376 Independent spec.tls.{peer,client} surfaces (breaking alpha API)
#377 TLSReady condition + TLS lifecycle Events #376
#378 Multi-member TLS quorum e2e + PeerCANotShared #377
Parked as drafts pending #363
#382 Per-cluster domain metrics on the operator /metrics endpoint
#384 EtcdCluster admission webhooks (consolidating with #328) #363
#386 EtcdBackup CR → object storage (S3/GCS) #363
#387 Automatic quorum-loss disaster recovery (bootstrap-latch guarded) #363

🟢 ready · ⚪ draft

@k8s-ci-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xrl
Once this PR has been reviewed and has the lgtm label, please assign ivanvc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot

Copy link
Copy Markdown

Hi @xrl. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Add a --max-concurrent-reconciles flag (default 5) and thread it into
SetupWithManager via builder.WithOptions(controller.Options{...}).

controller-runtime defaults to a single reconcile worker. Each EtcdCluster
is reconciled on its own workqueue key (deduped by namespaced name), so a
cluster is never reconciled by two workers at once -- concurrency only
parallelizes distinct clusters, with no intra-cluster races. Reconciles
here are heavy and long-running (StatefulSet patches, member-list/health
RPCs against managed etcd, certificate work), so a small pool improves
multi-cluster throughput; the cost of a larger pool is more simultaneous
apiserver and managed-etcd load. A value <= 0 falls back to the safe
default of 1.

Document the flag in its help text, a doc comment on the reconciler field,
and a new docs/operator-flags.md (linked from the README). Add an
envtest-backed test asserting SetupWithManager threads the pool size for
widened, single-worker, and non-positive fallback values.

test/e2e/STRESS.md records the measured spinup-burst budget behind the
stress e2e batching and why the worker pool, not namespace isolation, is
the lever for overlapping heavy spinups.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Xavier Lange <xrlange@gmail.com>
@xrl xrl force-pushed the pr/reconcile-pool-and-etcd-qos branch from 40cd013 to a189e28 Compare July 2, 2026 03:00
@kubernetes-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xrl
Once this PR has been reviewed and has the lgtm label, please assign ivanvc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@xrl xrl changed the title feat: configurable reconcile worker pool + Burstable etcd QoS feat: configurable reconcile worker pool (--max-concurrent-reconciles) Jul 2, 2026
@xrl

xrl commented Jul 2, 2026

Copy link
Copy Markdown
Author

Split out the etcd-container QoS change (it edits the StatefulSet template, which #363 is about to replace — parked on a separate branch until that lands). This PR is now only the worker pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants