Lighthouse/analysis autoscaler saturates at MAX under load

## Summary
The `hover-analysis` (lighthouse) autoscaler hit its MAX of 10 machines and the Grafana alert "Lighthouse autoscaler at MAX" fired — backlog demanded ~12 machines for 15+ minutes, so audits queued.

> The lighthouse-stream backlog has been demanding more than 10 analysis machines for 15+ minutes. fly-autoscaler is at MAX and audits are queueing.

Observed during a deliberately heavy load test (~10× realistic crawl volume); no audits were lost (they queue in the Redis lighthouse stream and drain), but audit latency grows while saturated.

## Mechanism
- Autoscaler scales `1→10` machines on `max(1, min(10, ceil(lighthouse_backlog / 25)))` (`fly.autoscaler-analysis.toml`).
- Each analysis machine runs `LIGHTHOUSE_MAX_CONCURRENCY=1` Chromium audit at a time (~20–30s, ~25 tasks per 12–15 min). At 10 machines the drain rate caps out; a backlog above ~250 outpaces it.

## Connection-budget interaction (important)
Analysis at 10 machines × `DB_MAX_OPEN_CONNS=5` = **50 connections**. With worker (~30–40 under pressure) + API (25) that's ~90/120 — i.e. **10 machines is already close to the Postgres connection-budget ceiling**. Raising the machine cap to clear backlog faster costs +5 conns/machine and competes with the worker for the same 120 `max_connections` (see `fly.worker.toml` budget note). This was the budget recently fixed in #396's investigation (analysis previously inherited a 70-conn default).

## Options (not yet decided)
1. **Accept** — under realistic (~10× lower) load this likely never saturates; leave the cap at 10.
2. **Raise the machine cap** — only viable if the worker pool is trimmed to keep total < 120; otherwise it re-triggers connection exhaustion.
3. **More audits per machine** — `LIGHTHOUSE_MAX_CONCURRENCY > 1` on larger VMs (Chromium is RAM-heavy: currently 8GB/2cpu). No extra connections, but needs memory headroom validation.
4. **Bigger picture** — resource sizing was deliberately reduced; revisit if audit latency becomes a real (non-load-test) SLO concern.

## Next step
Confirm whether audit-queue latency under realistic load is acceptable. If yes, tune the alert threshold/duration to reduce noise; if no, pick option 2/3 with the connection budget in mind.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lighthouse/analysis autoscaler saturates at MAX under load #397

Summary

Mechanism

Connection-budget interaction (important)

Options (not yet decided)

Next step

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Lighthouse/analysis autoscaler saturates at MAX under load #397

Description

Summary

Mechanism

Connection-budget interaction (important)

Options (not yet decided)

Next step

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions