feat: configurable reconcile worker pool (--max-concurrent-reconciles) by xrl · Pull Request #379 · etcd-io/etcd-operator

xrl · 2026-06-18T17:55:35Z

controller-runtime defaults to a single reconcile worker, so one slow cluster blocks progress on every other cluster. This adds a --max-concurrent-reconciles flag (default 5) threaded into SetupWithManager via builder.WithOptions; each EtcdCluster keeps its own workqueue key, so concurrency only parallelizes distinct clusters and a value <= 0 falls back to a single worker. Documented in docs/operator-flags.md, with test/e2e/STRESS.md recording the measured spinup-burst budget behind the stress-tier batching. Tested with an envtest-backed assertion that the pool size is threaded for widened, single-worker, and non-positive values.

PR series — operability fixes & TLS

Small single-purpose PRs from live kind-cluster testing of the operator. Each stands alone unless an After is listed. → = this PR.

	PR	Lands	After
	Reviewable now — small, any order
🟢	#374	Requeue instead of swallowing the client-cert provisioning error	—
🟢	#391	Grant `events.k8s.io` RBAC so operator Events are actually recorded	—
🟢	#392	Correlate `members[]`/`leaderID` from one health snapshot (consistent leader)	—
🟢	#393	Propagate user-supplied `altNames.ipAddresses` into certificates	—
🟢	#394	Accept day-suffix `validityDuration` (`365d`, `100d12h`) as documented	—
🟢	#395	Surface early reconcile errors as a `Degraded` condition (was empty status)	—
🟢→	#379	Configurable reconcile worker pool (`--max-concurrent-reconciles`)	—
🟢	#369	kind-based stress/scale e2e harness (1/3/7 members, churn, quorum watcher)	—
	TLS stack — in order
🟢	#376	Independent `spec.tls.{peer,client}` surfaces (breaking alpha API)	—
⚪	#377	`TLSReady` condition + TLS lifecycle Events	#376
⚪	#378	Multi-member TLS quorum e2e + `PeerCANotShared`	#377
	Parked as drafts pending #363
⚪	#382	Per-cluster domain metrics on the operator `/metrics` endpoint	—
⚪	#384	EtcdCluster admission webhooks (consolidating with #328)	#363
⚪	#386	`EtcdBackup` CR → object storage (S3/GCS)	#363
⚪	#387	Automatic quorum-loss disaster recovery (bootstrap-latch guarded)	#363

🟢 ready · ⚪ draft

k8s-ci-robot · 2026-06-18T17:55:39Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xrl
Once this PR has been reviewed and has the lgtm label, please assign ivanvc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2026-06-18T17:55:46Z

Hi @xrl. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Add a --max-concurrent-reconciles flag (default 5) and thread it into SetupWithManager via builder.WithOptions(controller.Options{...}). controller-runtime defaults to a single reconcile worker. Each EtcdCluster is reconciled on its own workqueue key (deduped by namespaced name), so a cluster is never reconciled by two workers at once -- concurrency only parallelizes distinct clusters, with no intra-cluster races. Reconciles here are heavy and long-running (StatefulSet patches, member-list/health RPCs against managed etcd, certificate work), so a small pool improves multi-cluster throughput; the cost of a larger pool is more simultaneous apiserver and managed-etcd load. A value <= 0 falls back to the safe default of 1. Document the flag in its help text, a doc comment on the reconciler field, and a new docs/operator-flags.md (linked from the README). Add an envtest-backed test asserting SetupWithManager threads the pool size for widened, single-worker, and non-positive fallback values. test/e2e/STRESS.md records the measured spinup-burst budget behind the stress e2e batching and why the worker pool, not namespace isolation, is the lever for overlapping heavy spinups. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Xavier Lange <xrlange@gmail.com>

kubernetes-prow · 2026-07-02T03:00:11Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xrl
Once this PR has been reviewed and has the lgtm label, please assign ivanvc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

xrl · 2026-07-02T03:00:46Z

Split out the etcd-container QoS change (it edits the StatefulSet template, which #363 is about to replace — parked on a separate branch until that lands). This PR is now only the worker pool.

k8s-ci-robot added the needs-ok-to-test label Jun 18, 2026

k8s-ci-robot added the size/L label Jun 18, 2026

xrl force-pushed the pr/reconcile-pool-and-etcd-qos branch from 40cd013 to a189e28 Compare July 2, 2026 03:00

xrl changed the title ~~feat: configurable reconcile worker pool + Burstable etcd QoS~~ feat: configurable reconcile worker pool (--max-concurrent-reconciles) Jul 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: configurable reconcile worker pool (--max-concurrent-reconciles)#379

feat: configurable reconcile worker pool (--max-concurrent-reconciles)#379
xrl wants to merge 1 commit into
etcd-io:mainfrom
xrl:pr/reconcile-pool-and-etcd-qos

xrl commented Jun 18, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Jun 18, 2026

Uh oh!

k8s-ci-robot commented Jun 18, 2026

Uh oh!

kubernetes-prow Bot commented Jul 2, 2026

Uh oh!

xrl commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

xrl commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR series — operability fixes & TLS

Uh oh!

k8s-ci-robot commented Jun 18, 2026

Uh oh!

k8s-ci-robot commented Jun 18, 2026

Uh oh!

kubernetes-prow Bot commented Jul 2, 2026

Uh oh!

xrl commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xrl commented Jun 18, 2026 •

edited

Loading