Summary
The hover-analysis (lighthouse) autoscaler hit its MAX of 10 machines and the Grafana alert "Lighthouse autoscaler at MAX" fired — backlog demanded ~12 machines for 15+ minutes, so audits queued.
The lighthouse-stream backlog has been demanding more than 10 analysis machines for 15+ minutes. fly-autoscaler is at MAX and audits are queueing.
Observed during a deliberately heavy load test (~10× realistic crawl volume); no audits were lost (they queue in the Redis lighthouse stream and drain), but audit latency grows while saturated.
Mechanism
- Autoscaler scales
1→10 machines on max(1, min(10, ceil(lighthouse_backlog / 25))) (fly.autoscaler-analysis.toml).
- Each analysis machine runs
LIGHTHOUSE_MAX_CONCURRENCY=1 Chromium audit at a time (~20–30s, ~25 tasks per 12–15 min). At 10 machines the drain rate caps out; a backlog above ~250 outpaces it.
Connection-budget interaction (important)
Analysis at 10 machines × DB_MAX_OPEN_CONNS=5 = 50 connections. With worker (~30–40 under pressure) + API (25) that's ~90/120 — i.e. 10 machines is already close to the Postgres connection-budget ceiling. Raising the machine cap to clear backlog faster costs +5 conns/machine and competes with the worker for the same 120 max_connections (see fly.worker.toml budget note). This was the budget recently fixed in #396's investigation (analysis previously inherited a 70-conn default).
Options (not yet decided)
- Accept — under realistic (~10× lower) load this likely never saturates; leave the cap at 10.
- Raise the machine cap — only viable if the worker pool is trimmed to keep total < 120; otherwise it re-triggers connection exhaustion.
- More audits per machine —
LIGHTHOUSE_MAX_CONCURRENCY > 1 on larger VMs (Chromium is RAM-heavy: currently 8GB/2cpu). No extra connections, but needs memory headroom validation.
- Bigger picture — resource sizing was deliberately reduced; revisit if audit latency becomes a real (non-load-test) SLO concern.
Next step
Confirm whether audit-queue latency under realistic load is acceptable. If yes, tune the alert threshold/duration to reduce noise; if no, pick option 2/3 with the connection budget in mind.
Summary
The
hover-analysis(lighthouse) autoscaler hit its MAX of 10 machines and the Grafana alert "Lighthouse autoscaler at MAX" fired — backlog demanded ~12 machines for 15+ minutes, so audits queued.Observed during a deliberately heavy load test (~10× realistic crawl volume); no audits were lost (they queue in the Redis lighthouse stream and drain), but audit latency grows while saturated.
Mechanism
1→10machines onmax(1, min(10, ceil(lighthouse_backlog / 25)))(fly.autoscaler-analysis.toml).LIGHTHOUSE_MAX_CONCURRENCY=1Chromium audit at a time (~20–30s, ~25 tasks per 12–15 min). At 10 machines the drain rate caps out; a backlog above ~250 outpaces it.Connection-budget interaction (important)
Analysis at 10 machines ×
DB_MAX_OPEN_CONNS=5= 50 connections. With worker (~30–40 under pressure) + API (25) that's ~90/120 — i.e. 10 machines is already close to the Postgres connection-budget ceiling. Raising the machine cap to clear backlog faster costs +5 conns/machine and competes with the worker for the same 120max_connections(seefly.worker.tomlbudget note). This was the budget recently fixed in #396's investigation (analysis previously inherited a 70-conn default).Options (not yet decided)
LIGHTHOUSE_MAX_CONCURRENCY > 1on larger VMs (Chromium is RAM-heavy: currently 8GB/2cpu). No extra connections, but needs memory headroom validation.Next step
Confirm whether audit-queue latency under realistic load is acceptable. If yes, tune the alert threshold/duration to reduce noise; if no, pick option 2/3 with the connection budget in mind.