transparently requeue Conflict errors when using io.WithOptimisticLock()#116
Conversation
|
Took a look at this, and I don't have the full context on the motivation of this change, but I wonder if this would be better handled within I say this as I think this diff goes against: https://github.com/reddit/achilles-sdk/blob/main/docs/sdk-apply-objects.md#achilles-sdk-conventions
An idea: Just add something like if k8serrors.IsConflict(err) {
log.Infof("conflict detected, requeueing with backoff: %v", err)
return ctrl.Result{Requeue: true}, nil
}To And achilles-sdk/pkg/fsm/types/results.go Line 63 in 8ff412a That should get the same retry behavior and controller-runtime's workqueue picks it up via rate-limited backoff and the next reconcile reads "fresh" state through the cache. One other thing I noticed: When wrapped in See: achilles-sdk/pkg/io/applicator_test.go Lines 76 to 81 in 8ff412a |
f55be16 to
8f3a5cf
Compare
|
Thanks for the reviews guys! I took a second look at it and decided to move the error checking one layer up. I think this works out better. I also added an envtest. |
| // Verify that conflict was detected, Reason is set, and Ready = false | ||
| Eventually(func(g Gomega) { | ||
| actualClaim := &testv1alpha1.TestClaim{} | ||
| g.Expect(c.Get(ctx, client.ObjectKeyFromObject(claim), actualClaim)).ToNot(HaveOccurred()) |
There was a problem hiding this comment.
Can you use .To(Succeed()) instead of .ToNot(HaveOccurred())? It reads a little clearer.
There was a problem hiding this comment.
np, i was copy-pasting from the test below which also uses ToNot(HaveOccurred())
| if k8serrors.IsConflict(err) { | ||
| condition.Status = corev1.ConditionFalse | ||
| condition.Reason = "ApplyOutputsConflict" | ||
| condition.Message = "Conflict when applying outputs" |
There was a problem hiding this comment.
Optional: It might be good to include the error fmt.Sprintf("Conflict when applying outputs: %v", err)
There was a problem hiding this comment.
➕ to this - the err should be added.
DMilmont
left a comment
There was a problem hiding this comment.
The other thing worth thinking about is that there would be a behavior change with the usage of WithOptimisticLock
My understanding is that before this change: a caller using io.WithOptimisticLock() who hits a conflict would see an ErrorResult, which in some cases they may have been counting on as a signal to abort and report rather than retry-loop.
After this change: the same caller now sees an automatic requeue. The FSM will rebuild the desired object from scratch on the next reconcile, which is probably what optimistic-lock users want anyway, but it's a semantic change worth a one-line note in the PR description / release notes and perhaps an update to the sections in the docs within this repo.
https://github.com/reddit/achilles-sdk/blob/8ff412a3b9bc875609e962741fd593ca0e842f82/docs/sdk-apply-objects.md#advanced-apply-patterns - The Kubernetes Resource Lock section, this should mention that Conflict errors are now handled transparently by the FSM (requeue + ApplyOutputsConflict condition) rather than being surfaced as reconcile errors.
And probably even: https://github.com/reddit/achilles-sdk/blob/8ff412a3b9bc875609e962741fd593ca0e842f82/docs/sdk-fsm-reconciler.md
Adding something under the result type descriptions mentioning that Conflict errors from OutputSet writes get reclassified as requeue results.
| if k8serrors.IsConflict(err) { | ||
| condition.Status = corev1.ConditionFalse | ||
| condition.Reason = "ApplyOutputsConflict" | ||
| condition.Message = "Conflict when applying outputs" |
There was a problem hiding this comment.
➕ to this - the err should be added.
| condition.Reason = "ApplyOutputsConflict" | ||
| condition.Message = "Conflict when applying outputs" | ||
| conditions.SetConditions(condition) | ||
| return obj, conditions, types.RequeueResultWithBackoff("conflict detected when applying outputs, requeueing") |
There was a problem hiding this comment.
This should also surface err
fmt.Sprintf("conflict detected when applying outputs, requeueing: %v", err)
This way it should show up in .status.conditions or controller logs so its easier to reason about what is happening.
retry.RetryOnConflict()Conflict errors when using io.WithOptimisticLock()
0fae78f to
37bc0e6
Compare
37bc0e6 to
9fe1bbb
Compare
|
Feedback addressed. Thanks for the reminder to update the docs as well! Open to notes re: phrasing/language there.
My understanding is slightly different - OutputSet apply errors happen in the transition between states, so the caller-controlled FSM state code has no opportunity to intercept or act on them. And since an Definitely still worth a callout in the release notes though, that's a good idea. |
|
👋 GM @DMilmont; any more feedback here? This behavior is causing us spurious errors and alerting noise, so we'd like to get this looked at whenever you have some cycles. Thanks! |
DMilmont |
💸 TL;DR
When Conflict errors occur during OutputSet applies, that error gets propagated back up through the SDK and results in an ErrorResult being emitted by the controller. This queues a retry, which is great, but it also increments error metrics for an expected failure mode.
This PR looks for the error coming from
applyOutputsto be a Conflict, and if it is, it sets a custom Reason and propagates aRequeueResultWithBackoffup to the caller.📜 Details
Jira
🧪 Testing Steps / Validation
ran
make test, no errors.✅ Checks