Skip to content

transparently requeue Conflict errors when using io.WithOptimisticLock()#116

Merged
kdorosh merged 1 commit into
reddit:mainfrom
SriRamanujam:push-tumvswkyxszz
Jun 2, 2026
Merged

transparently requeue Conflict errors when using io.WithOptimisticLock()#116
kdorosh merged 1 commit into
reddit:mainfrom
SriRamanujam:push-tumvswkyxszz

Conversation

@SriRamanujam

@SriRamanujam SriRamanujam commented May 27, 2026

Copy link
Copy Markdown
Contributor

💸 TL;DR

When Conflict errors occur during OutputSet applies, that error gets propagated back up through the SDK and results in an ErrorResult being emitted by the controller. This queues a retry, which is great, but it also increments error metrics for an expected failure mode.

This PR looks for the error coming from applyOutputs to be a Conflict, and if it is, it sets a custom Reason and propagates a RequeueResultWithBackoff up to the caller.

📜 Details

Jira

🧪 Testing Steps / Validation

ran make test, no errors.

✅ Checks

  • CI tests (if present) are passing
  • Adheres to code style for repo
  • Contributor License Agreement (CLA) completed if not a Reddit employee

@SriRamanujam SriRamanujam marked this pull request as ready for review May 27, 2026 21:04
@SriRamanujam SriRamanujam requested a review from a team as a code owner May 27, 2026 21:04
@DMilmont

Copy link
Copy Markdown
Contributor

Took a look at this, and I don't have the full context on the motivation of this change, but I wonder if this would be better handled within reconciler.go If the goal is truly just reducing the controller-runtime error increments.

I say this as I think this diff goes against: https://github.com/reddit/achilles-sdk/blob/main/docs/sdk-apply-objects.md#achilles-sdk-conventions

Following these two assumptions, we can optimistically apply all object updates without utilizing Kubernetes' resource version because there is no risk that any actor's update will conflict with or overwrite that of a different actor's.

An idea:

Just add something like

if k8serrors.IsConflict(err) {
    log.Infof("conflict detected, requeueing with backoff: %v", err)
    return ctrl.Result{Requeue: true}, nil
}

To

func (r *fsmReconciler[T, Obj]) Reconcile(ctx context.Context, req ctrl.Request) (res ctrl.Result, err error) {

And

func (r Result) Get(log *zap.SugaredLogger) (reconcile.Result, error) {

That should get the same retry behavior and controller-runtime's workqueue picks it up via rate-limited backoff and the next reconcile reads "fresh" state through the cache.

One other thing I noticed:
APIApplicator.Apply mutates its input via client.Get

When wrapped in retry.RetryOnConflict, the second attempts desired := current.DeepCopyObject() copies the server's version, not the caller's intent so a "successful" retry can send a no-op patch where the caller expected their change to be applied. If this did happen, it would eventually correct itself of course, but on the next reconciliation. Worth a closer look at least.

See:

// svc is mutated with state set on the server side (such as default values)
Eventually(func(g Gomega) {
actual := svc.DeepCopy()
g.Expect(c.Get(ctx, client.ObjectKeyFromObject(actual), actual)).To(Succeed())
g.Expect(actual).To(Equal(svc))
}).Should(Succeed())

Comment thread pkg/fsm/io/outputs.go Outdated
@SriRamanujam SriRamanujam force-pushed the push-tumvswkyxszz branch 2 times, most recently from f55be16 to 8f3a5cf Compare May 28, 2026 00:12
@SriRamanujam

Copy link
Copy Markdown
Contributor Author

@DMilmont @erik-ringsmuth

Thanks for the reviews guys! I took a second look at it and decided to move the error checking one layer up. I think this works out better. I also added an envtest.

// Verify that conflict was detected, Reason is set, and Ready = false
Eventually(func(g Gomega) {
actualClaim := &testv1alpha1.TestClaim{}
g.Expect(c.Get(ctx, client.ObjectKeyFromObject(claim), actualClaim)).ToNot(HaveOccurred())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use .To(Succeed()) instead of .ToNot(HaveOccurred())? It reads a little clearer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np, i was copy-pasting from the test below which also uses ToNot(HaveOccurred())

Comment thread pkg/fsm/internal/test/core/controller_test.go
Comment thread pkg/fsm/internal/reconciler.go Outdated
if k8serrors.IsConflict(err) {
condition.Status = corev1.ConditionFalse
condition.Reason = "ApplyOutputsConflict"
condition.Message = "Conflict when applying outputs"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: It might be good to include the error fmt.Sprintf("Conflict when applying outputs: %v", err)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

➕ to this - the err should be added.

@DMilmont DMilmont left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other thing worth thinking about is that there would be a behavior change with the usage of WithOptimisticLock

My understanding is that before this change: a caller using io.WithOptimisticLock() who hits a conflict would see an ErrorResult, which in some cases they may have been counting on as a signal to abort and report rather than retry-loop.

After this change: the same caller now sees an automatic requeue. The FSM will rebuild the desired object from scratch on the next reconcile, which is probably what optimistic-lock users want anyway, but it's a semantic change worth a one-line note in the PR description / release notes and perhaps an update to the sections in the docs within this repo.

https://github.com/reddit/achilles-sdk/blob/8ff412a3b9bc875609e962741fd593ca0e842f82/docs/sdk-apply-objects.md#advanced-apply-patterns - The Kubernetes Resource Lock section, this should mention that Conflict errors are now handled transparently by the FSM (requeue + ApplyOutputsConflict condition) rather than being surfaced as reconcile errors.

And probably even: https://github.com/reddit/achilles-sdk/blob/8ff412a3b9bc875609e962741fd593ca0e842f82/docs/sdk-fsm-reconciler.md

Adding something under the result type descriptions mentioning that Conflict errors from OutputSet writes get reclassified as requeue results.

Comment thread pkg/fsm/internal/reconciler.go Outdated
if k8serrors.IsConflict(err) {
condition.Status = corev1.ConditionFalse
condition.Reason = "ApplyOutputsConflict"
condition.Message = "Conflict when applying outputs"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

➕ to this - the err should be added.

Comment thread pkg/fsm/internal/reconciler.go Outdated
condition.Reason = "ApplyOutputsConflict"
condition.Message = "Conflict when applying outputs"
conditions.SetConditions(condition)
return obj, conditions, types.RequeueResultWithBackoff("conflict detected when applying outputs, requeueing")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also surface err

fmt.Sprintf("conflict detected when applying outputs, requeueing: %v", err)

This way it should show up in .status.conditions or controller logs so its easier to reason about what is happening.

@SriRamanujam SriRamanujam changed the title wrap OutputSet apply + patch operations in retry.RetryOnConflict() transparently requeue Conflict errors when using io.WithOptimisticLock() May 28, 2026
@SriRamanujam SriRamanujam force-pushed the push-tumvswkyxszz branch 3 times, most recently from 0fae78f to 37bc0e6 Compare May 28, 2026 20:16
@SriRamanujam

Copy link
Copy Markdown
Contributor Author

Feedback addressed. Thanks for the reminder to update the docs as well! Open to notes re: phrasing/language there.

My understanding is that before this change: a caller using io.WithOptimisticLock() who hits a conflict would see an ErrorResult, which in some cases they may have been counting on as a signal to abort and report rather than retry-loop.

After this change: the same caller now sees an automatic requeue. The FSM will rebuild the desired object from scratch on the next reconcile, which is probably what optimistic-lock users want anyway, but it's a semantic change worth a one-line note in the PR description / release notes and perhaps an update to the sections in the docs within this repo.

My understanding is slightly different - OutputSet apply errors happen in the transition between states, so the caller-controlled FSM state code has no opportunity to intercept or act on them. And since an ErrorResult queues a retry with backoff regardless, my hope is that this is a very roundabout way to change the log level of Conflict errors and nothing more 😆

Definitely still worth a callout in the release notes though, that's a good idea.

@kdorosh

kdorosh commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

👋 GM @DMilmont; any more feedback here? This behavior is causing us spurious errors and alerting noise, so we'd like to get this looked at whenever you have some cycles. Thanks!

@kdorosh

kdorosh commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Hey! I saw your comment in GitHub. I am OOO and can't login to GitHub to approve but feel free to merge in that PR - no further comments from me

DMilmont

@kdorosh kdorosh merged commit 0e7a46d into reddit:main Jun 2, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants