Skip to content

Add update throttling to prevent rapid deployment churn#187

Merged
toelke merged 1 commit into
masterfrom
feature/backoff-issue-182
Mar 3, 2026
Merged

Add update throttling to prevent rapid deployment churn#187
toelke merged 1 commit into
masterfrom
feature/backoff-issue-182

Conversation

@toelke

@toelke toelke commented Dec 4, 2025

Copy link
Copy Markdown
Collaborator

Implements a minimum interval between updates (default: 10s, configurable) to prevent Wave from updating deployments too frequently when secrets or configmaps change rapidly.

This prevents scenarios where a buggy controller rapidly updating secrets causes Wave to rapidly update deployments, which can overwhelm the Kubernetes API server.

Key features:

  • Fixed minimum interval between updates (default: 10s)
  • Configurable via --min-update-interval flag
  • Configurable via Helm chart (minUpdateInterval value)
  • State tracked in-memory within the operator
  • Thread-safe implementation with mutex protection
  • Applies to ALL updates, even when config hashes change

Fixes #182

🤖 Generated with Claude Code

@toelke

toelke commented Dec 4, 2025

Copy link
Copy Markdown
Collaborator Author

I did not want to close #183, I just wanted to rename the branch :-(

@toelke

toelke commented Dec 4, 2025

Copy link
Copy Markdown
Collaborator Author

I decided that the semantics of an increasing backoff period are not easy: When to decrease the backoff-period again?

So this is now with a static backoff-period. I think this is enough for all workloads.

@toelke toelke marked this pull request as ready for review December 4, 2025 12:13
@toelke toelke requested a review from a team as a code owner December 4, 2025 12:13
@jabdoa2

jabdoa2 commented Dec 4, 2025

Copy link
Copy Markdown
Contributor

So this is now with a static backoff-period. I think this is enough for all workloads.

I guess that should be good enough for most cases. It would have delayed the issue in our case. Not sure if it would have prevented the issue completely though (given enough time passed). In case it happens again I would try to add more backoff.

I decided that the semantics of an increasing backoff period are not easy: When to decrease the backoff-period again?

I guess we would have to check if the last backoff passed when an update occurs (maybe with a few secs of slack). If that is the case we could reset the backoff. That would not make sure that the Deployment got ready but that updates did not come in too fast.

@toelke toelke force-pushed the feature/backoff-issue-182 branch from 3dc7cc5 to a648036 Compare December 4, 2025 12:26
@toelke

toelke commented Dec 4, 2025

Copy link
Copy Markdown
Collaborator Author

I guess we would have to check if the last backoff passed when an update occurs (maybe with a few secs of slack). If that is the case we could reset the backoff. That would not make sure that the Deployment got ready but that updates did not come in too fast.

Yes, I was thinking along those lines: if the last update was not stopped and is currentBackoff in the past, then reduce the limit for n steps. But ultimately I felt that 1 update per 10 seconds is not going to hurt in any case…

@jabdoa2

jabdoa2 commented Dec 4, 2025

Copy link
Copy Markdown
Contributor

I guess we would have to check if the last backoff passed when an update occurs (maybe with a few secs of slack). If that is the case we could reset the backoff. That would not make sure that the Deployment got ready but that updates did not come in too fast.

Yes, I was thinking along those lines: if the last update was not stopped and is currentBackoff in the past, then reduce the limit for n steps. But ultimately I felt that 1 update per 10 seconds is not going to hurt in any case…

For Deployments that depends on how fast your pods get ready. Kubernetes will only then remove old replicasets. If your pods are slow to start (i.e. Prometheus) then this might still create a lot of replica sets.

It should not explode too fast though as it would be limited to 360 RS per hour per deployment. However, it this happens to a lot of different Deployments it might still hit. Maybe we need a global max updates per minute?

@toelke

toelke commented Dec 4, 2025

Copy link
Copy Markdown
Collaborator Author

A global cool-down could break the base expectation that there is a restart for the "last" change of a configmap.

@jabdoa2

jabdoa2 commented Dec 4, 2025

Copy link
Copy Markdown
Contributor

A global cool-down could break the base expectation that there is a restart for the "last" change of a configmap.

In general, yes. However, even the Kubernetes api has a per client qps limit (with burst) to prevent ddos. I imagine something like that.

In our larger clusters when we update the trust chain for all namespaces (using trust manager) this currently causes wave to update a few thousand pods at once (as almost all of them mount ca-certs from a configmap). Initially trust manager had bugs (which changed the configmap a few times for each cert) and that actually caused the our api server to crash as well. We attributes that to trust manager but now that I think of it wave might have amplified that as well.

@toelke

toelke commented Dec 4, 2025

Copy link
Copy Markdown
Collaborator Author

In that case you require all Deployments to be restarted, you just want to do it slowly, right?

@jabdoa2

jabdoa2 commented Dec 4, 2025

Copy link
Copy Markdown
Contributor

In that case you require all Deployments to be restarted, you just want to do it slowly, right?

Exactly, kubernetes api is fragile as soon as clusters grow. Guess this would be another issue. I can write something up later.

@toelke

toelke commented Dec 4, 2025

Copy link
Copy Markdown
Collaborator Author

That sounds like it wants to be solved by wave putting the updates into a queue that is slowly drained (leaky bucket for rate limit?)

@jabdoa2

jabdoa2 commented Dec 4, 2025

Copy link
Copy Markdown
Contributor

That sounds like it wants to be solved by wave putting the updates into a queue that is slowly drained (leaky bucket for rate limit?)

Basically, we already got a queue for the reconciler. As soon as we block the reconciler we got that behavior. Naiively we could just add a custom rate limiter during controller setup. However, that would not be 100% accurate. It would probably better to manually call When on a rate.Limiter before an update.

@toelke toelke force-pushed the feature/backoff-issue-182 branch from a648036 to fc2375d Compare December 5, 2025 14:37
@toelke

toelke commented Dec 5, 2025

Copy link
Copy Markdown
Collaborator Author

Like this?

@jabdoa2

jabdoa2 commented Dec 6, 2025

Copy link
Copy Markdown
Contributor

Like this?

Yeah that is great. Simple code and should be very effective. I guess I would set burst to 100 by default. That way wave will act instantly until you do a lot of updates at once.

Implements global rate limiting using golang.org/x/time/rate.Limiter to
prevent Wave from updating deployments too frequently when secrets or
configmaps change rapidly.

This prevents scenarios where a buggy controller rapidly updating secrets
causes Wave to rapidly update deployments, which can overwhelm the
Kubernetes API server.

Key features:
- Token bucket rate limiting (default: 1 update/sec globally, burst of 10)
- Global rate limiting shared across all deployments/statefulsets/daemonsets
- Configurable via --update-rate and --update-burst flags
- Configurable via Helm chart (updateRate and updateBurst values)
- Stalls operator pipeline via Wait() instead of requeueing
- Thread-safe implementation using rate.Limiter

Technical details:
- Uses standard library golang.org/x/time/rate for token bucket algorithm
- Burst allows initial rapid updates, then enforces steady-state rate
- Infinite rate (math.Inf) disables rate limiting for testing
- All resources share single global rate limiter

Fixes #182

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@toelke toelke force-pushed the feature/backoff-issue-182 branch from fc2375d to d0776bd Compare March 2, 2026 14:17
@toelke

toelke commented Mar 2, 2026

Copy link
Copy Markdown
Collaborator Author

So should I just merge this now and do a release?

@jabdoa2

jabdoa2 commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

So should I just merge this now and do a release?

Yes do it :-)

@toelke

toelke commented Mar 2, 2026

Copy link
Copy Markdown
Collaborator Author

You are not able to do a review on github, are you? Then I will force the merge...

@jabdoa2 jabdoa2 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jabdoa2

jabdoa2 commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

Looks like I am not in the reviewer group.

@toelke toelke merged commit 1b34ce4 into master Mar 3, 2026
6 checks passed
@toelke toelke deleted the feature/backoff-issue-182 branch March 3, 2026 03:47
@toelke

toelke commented Mar 3, 2026

Copy link
Copy Markdown
Collaborator Author

Good enough for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wave is very fast to update Deployments and can DDoS the kubernetes API with a lot of ReplicaSets

2 participants