Skip to content

Lighthouse/analysis autoscaler saturates at MAX under load #397

@simonsmallchua

Description

@simonsmallchua

Summary

The hover-analysis (lighthouse) autoscaler hit its MAX of 10 machines and the Grafana alert "Lighthouse autoscaler at MAX" fired — backlog demanded ~12 machines for 15+ minutes, so audits queued.

The lighthouse-stream backlog has been demanding more than 10 analysis machines for 15+ minutes. fly-autoscaler is at MAX and audits are queueing.

Observed during a deliberately heavy load test (~10× realistic crawl volume); no audits were lost (they queue in the Redis lighthouse stream and drain), but audit latency grows while saturated.

Mechanism

  • Autoscaler scales 1→10 machines on max(1, min(10, ceil(lighthouse_backlog / 25))) (fly.autoscaler-analysis.toml).
  • Each analysis machine runs LIGHTHOUSE_MAX_CONCURRENCY=1 Chromium audit at a time (~20–30s, ~25 tasks per 12–15 min). At 10 machines the drain rate caps out; a backlog above ~250 outpaces it.

Connection-budget interaction (important)

Analysis at 10 machines × DB_MAX_OPEN_CONNS=5 = 50 connections. With worker (~30–40 under pressure) + API (25) that's ~90/120 — i.e. 10 machines is already close to the Postgres connection-budget ceiling. Raising the machine cap to clear backlog faster costs +5 conns/machine and competes with the worker for the same 120 max_connections (see fly.worker.toml budget note). This was the budget recently fixed in #396's investigation (analysis previously inherited a 70-conn default).

Options (not yet decided)

  1. Accept — under realistic (~10× lower) load this likely never saturates; leave the cap at 10.
  2. Raise the machine cap — only viable if the worker pool is trimmed to keep total < 120; otherwise it re-triggers connection exhaustion.
  3. More audits per machineLIGHTHOUSE_MAX_CONCURRENCY > 1 on larger VMs (Chromium is RAM-heavy: currently 8GB/2cpu). No extra connections, but needs memory headroom validation.
  4. Bigger picture — resource sizing was deliberately reduced; revisit if audit latency becomes a real (non-load-test) SLO concern.

Next step

Confirm whether audit-queue latency under realistic load is acceptable. If yes, tune the alert threshold/duration to reduce noise; if no, pick option 2/3 with the connection budget in mind.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions