Adding in retry support for transient errors in the google compute drivers package#153
Adding in retry support for transient errors in the google compute drivers package#153ShiftedMr wants to merge 22 commits into
Conversation
…n is used for determining if an error may be a transient error
…ng in missed transient error
…d removes the manual sleep of 1 second to instead properly use the precreated ratelimiter with support for context cancellation
|
|
||
| func (p *provider) waitZoneOperation(ctx context.Context, name string, zone string) error { | ||
| var op *compute.Operation | ||
| for { |
There was a problem hiding this comment.
@ShiftedMr I think this outer for loop is not needed now since we are using p.rateLimiter.Wait(ctx) ?
There was a problem hiding this comment.
I'll double check tomorrow and push up a change for this
There was a problem hiding this comment.
I think the outer loop is for waiting til the operation is completed isn't it? The retry is only dealing with transient errors as the ZoneOperations.Get can still return that the operation is pending, at least that's my understanding.
| if gerr, ok := err.(*googleapi.Error); ok && | ||
| gerr.Code == http.StatusNotFound { | ||
| return autoscaler.ErrInstanceNotFound | ||
| if err := p.rateLimiter.Wait(ctx); err != nil { |
There was a problem hiding this comment.
Before (provider.go:125-145, removed):
for {
if p.rateLimiter.Allow() {
op, err := p.service.ZoneOperations.Get(...).Do()
// ... if DONE return ...
}
time.Sleep(time.Second) // <-- this is gone
}
After (added in PR):
for {
if err := p.rateLimiter.Wait(ctx); err != nil {
return err
}
err := retry.Do(func() error { ... }, ...)
// ... if DONE return ...
}
With rate.NewLimiter(rate.Every(time.Second/25), 1), this implementation returns in approximately 40 milliseconds, polling around 25 times per second. This results in a 25-fold increase in ZoneOperations.Get and GlobalOperations.Get traffic for each pending instance. The original rate limiter was designed to cap the request rate, not to control the polling frequency. The pacing came from the explicit time.Sleep(time.Second) that was removed in this pull request. As a result, we may see an increase in 429 errors (Too Many Requests) rather than a decrease.
Fix: Maintain an explicit polling interval (for example, by adding time.Sleep(time.Second) between iterations of the outer for loop), in addition to keeping the Wait(ctx) function as a safety cap.
There was a problem hiding this comment.
Thank you for the clarity; I've added in a case that enforces a 1second wait but will also still allow ctrlC / sigterm canceling. Let me know if this covers the concern or if I missed something related to it
| err := retry.Do( | ||
| func() error { | ||
| var err error | ||
| op, err = p.service.ZoneOperations.Get(p.project, zone, name).Context(ctx).Do() |
There was a problem hiding this comment.
@ShiftedMr i think we should do this: #153 (comment)
Currently retry calls are not being considered in the rate limiter. Since we consider 429 a transient error, this can further cause 429 errors.
There was a problem hiding this comment.
@Sapa96 I can't figure out which comment that link is supposed to go to; it keeps sending me back to the main page and the comment ID doesn't seem to link to any specific comment I can find :/. My best guess is it is this one: https://github.com/drone/autoscaler/pull/153/files#r3135606135 where Raghav is talking about the outer loop not being needed. I'll take another look but if you could, could you respond to my question in that thread?
If it isn't that thread can you add a quote from the thread here?
There was a problem hiding this comment.
I think I figured out what you meant regarding this; I've update the ratelimiter location.
There was a problem hiding this comment.
i was referring to your previous comment especially with the alternative section as i understood retry calls are not being counted in rate limiter flow
"@raghavharness Okay I've updated it as requested. I figured we'd want errors checking rate limits to be treated as transient but I can understand your concern.
I've made the change
alternatively: we could probably put the ratelimit checks within the retry but allow certain errors to eject from retry as well? (e.g. context.cancelled?)"
There was a problem hiding this comment.
Ah grand! Thanks @Sapa96 I'll give that a go and post it today, Thanks!
There was a problem hiding this comment.
Okay I realized while looking at this that the 429 handling in my transient section was not ideal;
I've made 2 changes:
- To make it less likely for any of the retries to not work as we want by moving the retry wrapper entirely into util.go
- Adding in parsing for the retry-after in the util.go function
I expect we may want to rename util.go into http_retries.go but I didn't want to break the review panels yet.
Please let me know if this covers the concerns and if you'd like me to do any other tweaks. Thanks!
There was a problem hiding this comment.
Thanks, @ShiftedMr. The changes look good to me. we are validating internally with testing.
ShiftedMr
left a comment
There was a problem hiding this comment.
Updated calls to ratelimiter to avoid retries causing more rate limit violations;
Removed dead code for op nil checks as op is checked for nil within the retry calls
Added another transient check to the instances.get call
… to include retry after parsing; and moving retry decorators into util.go
Why
Adding in support for retries to run when a likely transient error was the cause of an http error on a google compute api call. This is to hopefully prevent orphaned build agents that we see on a reoccuring basis.
Example Errors we ran into
In these examples we'll end up with build agents running in our compute VM list but not listed as functioning in the autoscaler which means the autoscaler won't clean them up and can lead to wasted resources
What
New files
util & util_test.go: contains functionality for checking if an error is a transient HTTP error that should be retried
Updates
create.go: Calls insert with a UUID to protect idempotency of an agent insert. Uses a retry loop to repeat the insert calls (with UUID) in case of transient errors
provider.go: Retries added to both waitZone and waitGlobal operations to protect against transient errors calling an unnecessary error/abort
create_test & provider_test: new tests for related changes