Skip to content

Warm the rotation-tracks picker LRU on backend boot #998

@jakebromberg

Description

@jakebromberg

Problem

GET /library/rotation/:rotation_id/tracks (#940) powers the dj-site rotation entry-mode picker. The endpoint composes against LML's /api/v1/discogs/release/{id} after resolving the rotation row to a Discogs release id via getDiscogsReleaseIdByRotationId in apps/backend/services/library.service.ts.

That resolver has three tiers:

  1. rotation.discogs_release_id (direct, mirrored from tubafrenzy) — 0/21,563 rows have a value in prod as of 2026-05-21.
  2. library_identity.discogs_release_id via the album_id bridge (fallback, written by jobs/library-identity-consumer / BS#802) — column is structurally NULL today until BS#801 extends LML's bulk-resolve-libraries contract with release-level resolution.
  3. LML POST /api/v1/lookup on (artist_name, album_title) (runtime, feat(library): resolve rotation Discogs id via LML /lookup on tier-1/2 miss (#986) #987) — the only tier carrying the picker today. Results are memoized in per-rotation_id LRUs (rotationLmlPositiveCache + rotationLmlNegativeCache) so the same rotation row isn't re-queried for every picker open.

The LRUs are process-local. They start cold on every backend restart, so the first picker open per rotation row pays a full LML round-trip — bounded at 5 s per #993, but still a user-visible stall the user can avoid.

Tubafrenzy's RotationTracklistCache.warmCache warms the equivalent JVM-local map on startup. We want the backend equivalent.

End state

On every backend boot, walk the ~310 active rotation rows (kill_date IS NULL OR kill_date > CURRENT_DATE — the predicate getRotationFromDB already uses) and call getDiscogsReleaseIdByRotationId(id) for each. Hits to tier 1 or 2 cost ~1 ms; hits to tier 3 spend a lookupSemaphore permit but populate either the positive or the negative LRU. The first picker open per row in the next deploy window is instant.

Design

Startup warm task (Option A) over admin-triggered endpoint (Option B).

A separate jobs/ package would warm its own LRU and exit — useless for the API process. So the warm pass must run inside the backend process. Between A and B, the startup task wins because:

  • Backend restarts are infrequent (days between deploys), so the boot cost is amortized.
  • B requires an operator action after every deploy. The warm is too small a win to justify that friction; in practice it'd never be run.
  • The walk shares the existing 5-permit lookupSemaphore and lookupTokenBucket with concurrent live traffic, so it can't starve user requests — at most it deepens the queue while it runs. The per-call 5 s timeout from feat(lml): per-call timeout override; tighten rotation picker to 5 s (#992) #993 caps tail latency.
  • Sequential iteration (no extra fan-out) bounds outstanding LML calls at exactly the semaphore's capacity. Driving concurrency from the warmer would only deepen the queue without raising throughput.
  • Fire-and-forget on boot means health checks and live traffic don't pay startup latency. Top-level walk failures (DB outage) are Sentry-captured but don't crash the listen callback.
  • Per-row failures are Sentry-captured and don't halt the walk.

Acceptance criteria

  • New service apps/backend/services/rotation-tracks-cache-warm.service.ts exporting warmRotationTracksCache() (one walk + counters) and startRotationTracksCacheWarm() (fire-and-forget kickoff).
  • Wired into apps/backend/app.ts post-server.listen alongside startPlaylistProxy / startAlbumPlaysRefresh / setupCdcWebSocket.
  • Active-row predicate matches getRotationFromDB: kill_date IS NULL OR kill_date > CURRENT_DATE.
  • Reuses getDiscogsReleaseIdByRotationId end-to-end (so the existing semaphore, token bucket, and 5 s timeout from feat(lml): per-call timeout override; tighten rotation picker to 5 s (#992) #993 all apply unchanged — no new code paths through LML).
  • Per-row errors are Sentry-captured with subsystem=rotation-tracks-cache-warm and extra.rotation_id, but do not halt the walk.
  • Progress log every 50 rows; final summary log carries scanned, preResolved, lmlPositive, lmlNegative, errors, elapsedMs.
  • Unit tests under tests/unit/services/: walk visits every row, sibling failures don't block siblings, counter classification works (preResolved vs lmlPositive vs lmlNegative), top-level failures are caught and don't escape start....
  • Pre-flight clean: typecheck, lint, format:check, test:unit, build, scripts/check-precondition-guards.sh.
  • No new env vars.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    lmlTouches library-metadata-lookup

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions