Skip to content

Code quality: retry on check failure has no backoff and retries non-transient errors #48

@tkadauke

Description

@tkadauke

Description

app/models/health_check.rb lines 153–158 retry a failed check up to 3 times with no delay between attempts:

retry_times = 3
begin
  runner.run!
rescue Exception => e
  retry_times -= 1
  retry unless retry_times == 0
end

Problems

  1. No delay: All 3 retries happen back-to-back in milliseconds. If the failure is a server-side 500 error, a timeout, or a flapping element, immediate retries are unlikely to succeed.
  2. Retries non-transient errors: Element not found, wrong URL, assertion failure—these are permanent failures that should not be retried. Retrying wastes time and inflates check_run counts.
  3. No jitter: If many checks fail simultaneously (site goes down), all retry together, creating a request storm.

Suggested approach

RETRYABLE_ERRORS = [Net::ReadTimeout, Net::OpenTimeout, Errno::ECONNREFUSED].freeze

begin
  runner.run!
rescue *RETRYABLE_ERRORS => e
  retry_times -= 1
  sleep(2 ** (3 - retry_times))  # exponential backoff: 1s, 2s, 4s
  retry unless retry_times == 0
rescue StandardError => e
  # non-retryable: fail immediately
end

Effort: small

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions