Skip to content

Pace job starts and raise rate-limit retries#394

Merged
simonsmallchua merged 2 commits into
mainfrom
work/quirky-wilbur-be0a2f
May 27, 2026
Merged

Pace job starts and raise rate-limit retries#394
simonsmallchua merged 2 commits into
mainfrom
work/quirky-wilbur-be0a2f

Conversation

@simonsmallchua
Copy link
Copy Markdown
Contributor

@simonsmallchua simonsmallchua commented May 27, 2026

Summary

Two independent, cautious changes to how the crawler handles target-domain rate limiting (429/403/503). Motivated by a 10k-task load test where 44% of tasks hard-failed, almost entirely on HTTP 429 — DB and infra were healthy, so the dominant limiter on successful throughput is target-domain throttling, not infrastructure.

1. Raise blocking retries (deferred, not failed)

  • MaxBlockingRetries was hardcoded to 3, so a rate-limited task burned all attempts in <1s (backoff 100/200/400ms) and hard-failed while the domain was still throttling.
  • Now reads GNH_RATE_LIMIT_MAX_RETRIES (previously documented in ENV_VARS.md but never wired) with a raised default of 8, so a persistently-throttling domain is paced-and-deferred across the widening per-domain delay (up to the 60s cap) rather than counted as a failure.

2. Pace every job start (no first-wave burst)

  • The pacer warm-up delay was only applied to never-crawled domains. A domain with a learned (or zero) adaptive delay seeded 0 → no per-domain cap → the full per-job concurrency burst out on the first dispatch wave and instantly triggered a 429 storm.
  • Job starts are now floored to a minimum seed delay (GNH_PACER_START_FLOOR_DELAY_MS, default 500ms, 0 restores prior behaviour). A previously-crawled domain now starts at ~3 concurrent requests and widens on success, instead of bursting. A never-crawled domain keeps the cautious 2000ms warm-up (~1 worker). A domain with a higher learned delay is unchanged.
  • Seed decision extracted into seedAdaptiveDelayMS(...) and unit-tested.

Resulting job-start ramp

Domain state Seed delay Starting workers
Never crawled 2000ms ~1
Crawled, learned ≤500ms 500ms ~3
Crawled, learned high learned value paced

The cap then widens to full job concurrency via the existing success step-down.

Reviewer notes

  • Both env vars default to the new behaviour; set GNH_PACER_START_FLOOR_DELAY_MS=0 to revert Part 2.
  • The per-domain spacing logic itself is unchanged (first 429 → 500ms, +500ms per further 429, up to 60s).
  • Known pre-existing sharp edge (not introduced here): when the adaptive delay reaches 0 the cap is removed entirely, so the final ramp step jumps from ~3 to full rather than incrementing. Candidate for a follow-up.
  • Load-test evidence is synthetic (hammering domains); worth confirming against real customer traffic before further tuning.

Test plan

  • go build ./...
  • go test ./internal/jobs/ ./internal/broker/
  • New unit tests for seedAdaptiveDelayMS (warm-up, floor, disabled-floor cases)
  • gosec clean; pre-commit security gates pass

View with Codesmith Autofix with Codesmith
Need help on this PR? Tag @codesmith with what you need. Autofix is disabled.

Summary by CodeRabbit

  • Improvements

    • Rate-limited task failures now retry up to 8 times (was 3); max retries configurable via environment.
    • Jobs use a paced start to avoid initial concurrency bursts; start-floor delay configurable (default 500ms) and can be disabled.
  • Documentation

    • Updated runtime config docs with new env vars and defaults (max retries, pacer warmup delay default 2000ms, start-floor default 500ms).
  • Tests

    • Added unit tests covering pacing and retry configuration and defaults.

Review Change Stack

@supabase
Copy link
Copy Markdown

supabase Bot commented May 27, 2026

Updates to Preview Branch (work/quirky-wilbur-be0a2f) ↗︎

Deployments Status Updated
Database Wed, 27 May 2026 11:16:00 UTC
Services Wed, 27 May 2026 11:16:00 UTC
APIs Wed, 27 May 2026 11:16:00 UTC

Tasks are run on every commit but only new migration files are pushed.
Close and reopen this PR if you want to apply changes from existing seed or migration files.

Tasks Status Updated
Configurations Wed, 27 May 2026 11:16:02 UTC
Migrations Wed, 27 May 2026 11:16:04 UTC
Seeding Wed, 27 May 2026 11:16:05 UTC
Edge Functions Wed, 27 May 2026 11:16:11 UTC

View logs for this Workflow Run ↗︎.
Learn more about Supabase for Git ↗︎.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: f1beae8d-e5f9-4968-8f1c-f9e60e2b9dfa

📥 Commits

Reviewing files that changed from the base of the PR and between 322103d and e1a0c67.

📒 Files selected for processing (1)
  • internal/jobs/pacer_seed_test.go

📝 Walkthrough

Walkthrough

The PR increases the default job retry cap for rate-limited failures from 3 to 8 (via configurable GNH_RATE_LIMIT_MAX_RETRIES), and introduces a new pacer start-floor delay (GNH_PACER_START_FLOOR_DELAY_MS, default 500ms) that floors adaptive-delay seeds for learned domains to prevent 429 burst storms during initial ramp-up.

Changes

Job Scheduler Rate-Limiting and Pacer Configuration

Layer / File(s) Summary
Retry Cap Configuration
internal/jobs/types.go, internal/jobs/executor.go
Adds DefaultMaxBlockingRetries constant (8) and updates DefaultExecutorConfig() to read MaxBlockingRetries from GNH_RATE_LIMIT_MAX_RETRIES environment variable instead of using a hardcoded value.
Pacer Start-Floor Configuration and Seeding Logic
internal/jobs/stream_worker.go
Introduces defaultPacerStartFloorDelayMS (500ms), implements pacerStartFloorDelayMS() to read the configurable floor from GNH_PACER_START_FLOOR_DELAY_MS, and implements seedAdaptiveDelayMS() to compute initial adaptive-delay seeds: never-crawled domains use warmup delay, while learned domains are floored to the start-floor delay.
Pacer Seeding Integration in Job Loading
internal/jobs/stream_worker.go
Integrates the new seeding logic into the job-info fetch path by computing the floor from database state and deriving adaptive delay via seedAdaptiveDelayMS().
Test Coverage for Adaptive Delay Seeding
internal/jobs/pacer_seed_test.go
Tests seedAdaptiveDelayMS() across multiple scenarios: never-crawled domains, zero/below/at/above-floor learned values, and floor-disabled behavior (floor = 0). Also tests pacer env parsing and executor MaxBlockingRetries env override.
Documentation Updates
CHANGELOG.md, docs/operations/ENV_VARS.md
Documents new retry and pacer configuration variables: GNH_RATE_LIMIT_MAX_RETRIES (default 8), GNH_PACER_WARMUP_DELAY_MS (default 2000ms), and GNH_PACER_START_FLOOR_DELAY_MS (default 500ms).

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • Good-Native/hover#390: Both PRs modify pacer seeding logic in internal/jobs/stream_worker.go to introduce/configure initial adaptive-delay warm-up behavior via GNH_PACER_WARMUP_DELAY_MS.
  • Good-Native/hover#383: Both PRs modify per-domain pacer seeding and ramp-up logic in internal/jobs/stream_worker.go for adaptive delay initialization.

Poem

🐰 I nibbled env vars, warmed up the pace,
Retries hopped from three to eight with grace,
Floors softly land learned domains on start,
No sudden storms to tear the crawl apart,
A rabbit's tweak to keep the jobs on pace.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the two main changes: pacing job starts and raising rate-limit retry limits.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch work/quirky-wilbur-be0a2f

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 27, 2026

Release Versions

App patch: v0.34.19v0.34.20

Changelog

Changed

  • Rate-limited (429/403/503) tasks now retry up to 8 times before hard-failing,
    raised from 3, so a domain that throttles hard is paced-and-deferred rather
    than failed mid-job. Tunable via GNH_RATE_LIMIT_MAX_RETRIES (previously
    documented but never wired).
  • Every job now starts paced instead of bursting the full per-job concurrency at
    a domain. Previously only a never-crawled domain got a warm-up delay; a domain
    with a learned (or zero) adaptive delay seeded un-paced and could trigger an
    instant 429 storm. Job starts are now floored to a minimum seed delay, so a
    previously-crawled domain ramps up from a few concurrent requests and widens
    on success. Tunable via GNH_PACER_START_FLOOR_DELAY_MS (0 restores the
    prior behaviour).

@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

❌ Patch coverage is 88.23529% with 2 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
internal/jobs/stream_worker.go 87.50% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
internal/jobs/types.go (1)

1-149: ⚡ Quick win

Run gofmt and goimports on this file.

As per coding guidelines, touched Go files should be formatted with gofmt and goimports.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/jobs/types.go` around lines 1 - 149, This file is not formatted per
project Go style; run gofmt (or gofmt -s) and goimports on this file to fix
spacing, import ordering, and unused-imports issues, then re-run `go vet`/build
to ensure no unused imports remain; focus changes will appear around the
package/import block and declarations such as JobStatus/TaskStatus constants and
structs Job, Task, JobOptions, and QuotaExceededError—commit the formatted file.
internal/jobs/executor.go (1)

1-549: ⚡ Quick win

Run gofmt and goimports on this file.

As per coding guidelines, touched Go files should be formatted with gofmt and goimports.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/jobs/executor.go` around lines 1 - 549, The file
internal/jobs/executor.go is not formatted per project policy; run gofmt and
goimports on it to fix whitespace, indentation and import ordering/unused
imports (this will update imports used by symbols like TaskExecutor, Execute,
ConstructTaskURL, buildHTMLUpload, gzipHTML, canonicalHTMLContentType,
normalisedHTMLContentType, etc.). Use gofmt (or go fmt) to normalize formatting
and goimports to group and remove unused imports so the file compiles and
matches repository style. After running both tools, re-run go vet/build to
ensure no new import errors remain and commit the formatted file.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/jobs/executor.go`:
- Around line 1-549: The file internal/jobs/executor.go is not formatted per
project policy; run gofmt and goimports on it to fix whitespace, indentation and
import ordering/unused imports (this will update imports used by symbols like
TaskExecutor, Execute, ConstructTaskURL, buildHTMLUpload, gzipHTML,
canonicalHTMLContentType, normalisedHTMLContentType, etc.). Use gofmt (or go
fmt) to normalize formatting and goimports to group and remove unused imports so
the file compiles and matches repository style. After running both tools, re-run
go vet/build to ensure no new import errors remain and commit the formatted
file.

In `@internal/jobs/types.go`:
- Around line 1-149: This file is not formatted per project Go style; run gofmt
(or gofmt -s) and goimports on this file to fix spacing, import ordering, and
unused-imports issues, then re-run `go vet`/build to ensure no unused imports
remain; focus changes will appear around the package/import block and
declarations such as JobStatus/TaskStatus constants and structs Job, Task,
JobOptions, and QuotaExceededError—commit the formatted file.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 80e771cd-8deb-4f76-b883-cfcfadfc5c59

📥 Commits

Reviewing files that changed from the base of the PR and between feb18c9 and 322103d.

📒 Files selected for processing (6)
  • CHANGELOG.md
  • docs/operations/ENV_VARS.md
  • internal/jobs/executor.go
  • internal/jobs/pacer_seed_test.go
  • internal/jobs/stream_worker.go
  • internal/jobs/types.go

@github-actions
Copy link
Copy Markdown
Contributor

🐝 Review App Deployed

Homepage: https://hover-pr-394.fly.dev
Dashboard: https://hover-pr-394.fly.dev/dashboard

@simonsmallchua simonsmallchua merged commit 463cced into main May 27, 2026
32 of 33 checks passed
@simonsmallchua simonsmallchua deleted the work/quirky-wilbur-be0a2f branch May 27, 2026 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant