Skip to content

Configurable Leader Election#68

Open
JoseSzycho wants to merge 1 commit into
mainfrom
feat/configurable-leader-election
Open

Configurable Leader Election#68
JoseSzycho wants to merge 1 commit into
mainfrom
feat/configurable-leader-election

Conversation

@JoseSzycho

Copy link
Copy Markdown
Contributor

Summary

Implements configurable leader election flags and a preStop lifecycle hook to prevent transient deployment alert flapping during rolling updates.

Issue

Rolling updates triggered the alert:

🔥 BillingControllerManagerDeploymentUnavailable • 1 replica(s) unavailable for 5 minutes

Probably Root Causes

  1. CSI Certificate Mounting Latency: New pods spent ~5.5 minutes in ContainerCreating waiting for csi.cert-manager.io to issue certificates and GKE to provision nodes.
  2. Transient API Server Timeouts: Pods briefly timed out trying to renew leader leases against milo-apiserver during concurrent upgrades, causing lease failures.

Solution

  1. Configurable Flags: Added --leader-elect-lease-duration (30s), --leader-elect-renew-deadline (20s), and --leader-elect-retry-period (5s) flags to make the operator resilient to brief API server unavailability.
  2. PreStop Hook: Added sleep 15 preStop lifecycle hook and bumped terminationGracePeriodSeconds to 60 in manager.yaml to allow the terminating leader pod to finish its current lease cycle gracefully.

Warning

If firing alerts continue happening, further investigations should be made.

@JoseSzycho JoseSzycho requested review from savme and scotwells June 26, 2026 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant