Skip to content

Adding in retry support for transient errors in the google compute drivers package#153

Open
ShiftedMr wants to merge 22 commits into
drone:masterfrom
ShiftedMr:attempt_to_fix_autoscaler_orphaned_agents
Open

Adding in retry support for transient errors in the google compute drivers package#153
ShiftedMr wants to merge 22 commits into
drone:masterfrom
ShiftedMr:attempt_to_fix_autoscaler_orphaned_agents

Conversation

@ShiftedMr

@ShiftedMr ShiftedMr commented Mar 30, 2026

Copy link
Copy Markdown

Why

Adding in support for retries to run when a likely transient error was the cause of an http error on a google compute api call. This is to hopefully prevent orphaned build agents that we see on a reoccuring basis.

Example Errors we ran into

  • Error example 1:
    • "error": "googleapi: Error 503: Policy checks are unavailable., backendError"
    • "msg": "instance insert operation failed",
  • Error example 2:
    • "error": "googleapi: Error 503: Policy checks are unavailable., backendError"
    • "msg": "failed to provision server",
  • Error example 3:
    • "error": "googleapi: Error 503: Visibility check was unavailable. Please retry the request and contact support if the problem persists, backendError"

In these examples we'll end up with build agents running in our compute VM list but not listed as functioning in the autoscaler which means the autoscaler won't clean them up and can lead to wasted resources

What

New files

util & util_test.go: contains functionality for checking if an error is a transient HTTP error that should be retried

Updates

create.go: Calls insert with a UUID to protect idempotency of an agent insert. Uses a retry loop to repeat the insert calls (with UUID) in case of transient errors

provider.go: Retries added to both waitZone and waitGlobal operations to protect against transient errors calling an unnecessary error/abort

create_test & provider_test: new tests for related changes

Comment thread drivers/google/provider.go Outdated
Comment thread drivers/google/provider.go
Comment thread drivers/google/provider.go
@ShiftedMr ShiftedMr requested a review from raghavharness April 22, 2026 13:53

func (p *provider) waitZoneOperation(ctx context.Context, name string, zone string) error {
var op *compute.Operation
for {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShiftedMr I think this outer for loop is not needed now since we are using p.rateLimiter.Wait(ctx) ?

@ShiftedMr ShiftedMr Apr 28, 2026

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll double check tomorrow and push up a change for this

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the outer loop is for waiting til the operation is completed isn't it? The retry is only dealing with transient errors as the ZoneOperations.Get can still return that the operation is pending, at least that's my understanding.

Comment thread drivers/google/provider.go
@ShiftedMr ShiftedMr requested a review from raghavharness April 29, 2026 12:33
Comment thread drivers/google/provider.go Outdated
if gerr, ok := err.(*googleapi.Error); ok &&
gerr.Code == http.StatusNotFound {
return autoscaler.ErrInstanceNotFound
if err := p.rateLimiter.Wait(ctx); err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before (provider.go:125-145, removed):

  for {
      if p.rateLimiter.Allow() {
          op, err := p.service.ZoneOperations.Get(...).Do()
          // ... if DONE return ...
      }   
      time.Sleep(time.Second)   // <-- this is gone
  }   

After (added in PR):

  for {
      if err := p.rateLimiter.Wait(ctx); err != nil {
          return err
      }   
      err := retry.Do(func() error { ... }, ...)
      // ... if DONE return ...
  }   

With rate.NewLimiter(rate.Every(time.Second/25), 1), this implementation returns in approximately 40 milliseconds, polling around 25 times per second. This results in a 25-fold increase in ZoneOperations.Get and GlobalOperations.Get traffic for each pending instance. The original rate limiter was designed to cap the request rate, not to control the polling frequency. The pacing came from the explicit time.Sleep(time.Second) that was removed in this pull request. As a result, we may see an increase in 429 errors (Too Many Requests) rather than a decrease.

Fix: Maintain an explicit polling interval (for example, by adding time.Sleep(time.Second) between iterations of the outer for loop), in addition to keeping the Wait(ctx) function as a safety cap.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the clarity; I've added in a case that enforces a 1second wait but will also still allow ctrlC / sigterm canceling. Let me know if this covers the concern or if I missed something related to it

@ShiftedMr ShiftedMr requested a review from maanavshah May 11, 2026 18:23
Comment thread drivers/google/provider.go Outdated
err := retry.Do(
func() error {
var err error
op, err = p.service.ZoneOperations.Get(p.project, zone, name).Context(ctx).Do()

@Sapa96 Sapa96 May 17, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShiftedMr i think we should do this: #153 (comment)
Currently retry calls are not being considered in the rate limiter. Since we consider 429 a transient error, this can further cause 429 errors.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sapa96 I can't figure out which comment that link is supposed to go to; it keeps sending me back to the main page and the comment ID doesn't seem to link to any specific comment I can find :/. My best guess is it is this one: https://github.com/drone/autoscaler/pull/153/files#r3135606135 where Raghav is talking about the outer loop not being needed. I'll take another look but if you could, could you respond to my question in that thread?

If it isn't that thread can you add a quote from the thread here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I figured out what you meant regarding this; I've update the ratelimiter location.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was referring to your previous comment especially with the alternative section as i understood retry calls are not being counted in rate limiter flow

"@raghavharness Okay I've updated it as requested. I figured we'd want errors checking rate limits to be treated as transient but I can understand your concern.

I've made the change

alternatively: we could probably put the ratelimit checks within the retry but allow certain errors to eject from retry as well? (e.g. context.cancelled?)"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah grand! Thanks @Sapa96 I'll give that a go and post it today, Thanks!

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I realized while looking at this that the 429 handling in my transient section was not ideal;

I've made 2 changes:

  1. To make it less likely for any of the retries to not work as we want by moving the retry wrapper entirely into util.go
  2. Adding in parsing for the retry-after in the util.go function

I expect we may want to rename util.go into http_retries.go but I didn't want to break the review panels yet.

Please let me know if this covers the concerns and if you'd like me to do any other tweaks. Thanks!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @ShiftedMr. The changes look good to me. we are validating internally with testing.

@ShiftedMr ShiftedMr requested a review from Sapa96 May 18, 2026 17:04

@ShiftedMr ShiftedMr left a comment

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated calls to ratelimiter to avoid retries causing more rate limit violations;
Removed dead code for op nil checks as op is checked for nil within the retry calls

Added another transient check to the instances.get call

… to include retry after parsing; and moving retry decorators into util.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants