Skip to content

fix(lifecycle): tighten shutdown ordering for DLQ metrics + AI workers + DLQ replay cap#68

Merged
aksOps merged 1 commit into
mainfrom
fix/lifecycle-shutdown-cleanup
Apr 28, 2026
Merged

fix(lifecycle): tighten shutdown ordering for DLQ metrics + AI workers + DLQ replay cap#68
aksOps merged 1 commit into
mainfrom
fix/lifecycle-shutdown-cleanup

Conversation

@aksOps

@aksOps aksOps commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Three lifecycle fixes surfaced by the codex round-2 production-readiness review. All three are bounded changes, opt-in defaults, no behaviour-breaking shifts for existing deployments.

  • DLQ metrics goroutine leak — the Size()/DiskBytes() poller had no exit condition. After dlq.Stop() ran (LIFO step 4) it kept polling on the closed DLQ and raced the file-handle close in repo.Close() (step 6). Goroutine now selects on appCtx.Done() and is registered with bootWG so it exits before dlq.Stop().
  • AI worker shutdown stallanalyzeLog was using context.Background(), so once aiService.Stop() was called the workers held the goroutine open until each in-flight 30s LLM call returned naturally. Workers now derive their context from a shutdown-aware aiCtx (cancel-derived from appCtx) via SetParentContext. aiCancel() runs immediately before aiService.Stop() in shutdown step 2, so in-flight calls abort within milliseconds instead of holding shutdown hostage for up to 30s × workerPool.
  • DLQ replay-per-tick capDLQ_MAX_REPLAY_PER_TICK (default 100). Without it, an outage that filled the DLQ with 10k files would replay every file in the first post-restart tick, hammering the just-restarted DB and exhausting connection-pool capacity. Backoff-skipped files do not count toward the cap (they cost nothing). 0 disables the cap (legacy behaviour).

Files

  • internal/ai/service.goparentCtx field + SetParentContext. Workers fall back to context.Background() when the setter isn't called (preserves embedded-caller behaviour).
  • internal/queue/dlq.gomaxReplayPerTick field + SetMaxReplayPerTick(n) (clamps negative to 0). processFiles reads the cap once under the mutex, then breaks when actual replayFn invocations hit the cap.
  • internal/queue/dlq_replay_cap_test.go — three race-safe tests: bounded attempts (cap=10/total=50), unlimited default (cap=0/total=25 → 25 attempts), negative input clamps to unlimited.
  • internal/config/config.goDLQMaxReplayPerTick field, DLQ_MAX_REPLAY_PER_TICK env, default 100.
  • main.go — DLQ metrics goroutine wired to appCtx/bootWG, AI service aiCtx/SetParentContext/aiCancel plumbing, dlq.SetMaxReplayPerTick(cfg.DLQMaxReplayPerTick) after construction.

Test plan

  • go test ./internal/queue/ ./internal/ai/ ./internal/config/ -race — 45 passed
  • go test ./... (full suite, no race) — 459 passed in 27 packages
  • go vet ./... — clean
  • go build ./... — clean
  • CI green (security stack + SonarCloud) before merge

🤖 Generated with Claude Code

…s + DLQ replay cap

Three lifecycle fixes from codex round-2 production-readiness review:

1. DLQ metrics goroutine now exits on appCtx cancellation and is
   tracked in bootWG so it stops before dlq.Stop()/repo.Close() — was
   previously a leaked ticker that kept polling Size()/DiskBytes() on
   the closed DLQ, racing the file-handle close.

2. AI service workers now derive their LLM-call context from a
   shutdown-aware aiCtx (cancel-derived from appCtx) via
   SetParentContext. aiCancel() is invoked before aiService.Stop() so
   in-flight 30s LLM calls are cancelled immediately rather than
   blocking shutdown for up to 30s × workerPool.

3. DLQ replay worker now caps replayFn invocations per tick via
   DLQ_MAX_REPLAY_PER_TICK (default 100). Without the cap, an outage
   that filled the DLQ with 10k files would replay every file in the
   first post-restart tick, hammering the just-restarted DB and
   exhausting connection-pool capacity. Backoff-skipped files don't
   count toward the cap — they cost nothing.

Tests: 3 new tests in internal/queue/ exercising cap-bounded,
unlimited (legacy default), and negative-input clamping behaviour. All
pass under -race. Full suite: 459/459 green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

@aksOps aksOps merged commit c129ae6 into main Apr 28, 2026
17 checks passed
@aksOps aksOps deleted the fix/lifecycle-shutdown-cleanup branch April 28, 2026 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant