fix(lifecycle): tighten shutdown ordering for DLQ metrics + AI workers + DLQ replay cap#68
Merged
Merged
Conversation
…s + DLQ replay cap Three lifecycle fixes from codex round-2 production-readiness review: 1. DLQ metrics goroutine now exits on appCtx cancellation and is tracked in bootWG so it stops before dlq.Stop()/repo.Close() — was previously a leaked ticker that kept polling Size()/DiskBytes() on the closed DLQ, racing the file-handle close. 2. AI service workers now derive their LLM-call context from a shutdown-aware aiCtx (cancel-derived from appCtx) via SetParentContext. aiCancel() is invoked before aiService.Stop() so in-flight 30s LLM calls are cancelled immediately rather than blocking shutdown for up to 30s × workerPool. 3. DLQ replay worker now caps replayFn invocations per tick via DLQ_MAX_REPLAY_PER_TICK (default 100). Without the cap, an outage that filled the DLQ with 10k files would replay every file in the first post-restart tick, hammering the just-restarted DB and exhausting connection-pool capacity. Backoff-skipped files don't count toward the cap — they cost nothing. Tests: 3 new tests in internal/queue/ exercising cap-bounded, unlimited (legacy default), and negative-input clamping behaviour. All pass under -race. Full suite: 459/459 green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Three lifecycle fixes surfaced by the codex round-2 production-readiness review. All three are bounded changes, opt-in defaults, no behaviour-breaking shifts for existing deployments.
Size()/DiskBytes()poller had no exit condition. Afterdlq.Stop()ran (LIFO step 4) it kept polling on the closed DLQ and raced the file-handle close inrepo.Close()(step 6). Goroutine now selects onappCtx.Done()and is registered withbootWGso it exits beforedlq.Stop().analyzeLogwas usingcontext.Background(), so onceaiService.Stop()was called the workers held the goroutine open until each in-flight 30s LLM call returned naturally. Workers now derive their context from a shutdown-awareaiCtx(cancel-derived fromappCtx) viaSetParentContext.aiCancel()runs immediately beforeaiService.Stop()in shutdown step 2, so in-flight calls abort within milliseconds instead of holding shutdown hostage for up to 30s × workerPool.DLQ_MAX_REPLAY_PER_TICK(default100). Without it, an outage that filled the DLQ with 10k files would replay every file in the first post-restart tick, hammering the just-restarted DB and exhausting connection-pool capacity. Backoff-skipped files do not count toward the cap (they cost nothing).0disables the cap (legacy behaviour).Files
internal/ai/service.go—parentCtxfield +SetParentContext. Workers fall back tocontext.Background()when the setter isn't called (preserves embedded-caller behaviour).internal/queue/dlq.go—maxReplayPerTickfield +SetMaxReplayPerTick(n)(clamps negative to 0).processFilesreads the cap once under the mutex, then breaks when actualreplayFninvocations hit the cap.internal/queue/dlq_replay_cap_test.go— three race-safe tests: bounded attempts (cap=10/total=50), unlimited default (cap=0/total=25 → 25 attempts), negative input clamps to unlimited.internal/config/config.go—DLQMaxReplayPerTickfield,DLQ_MAX_REPLAY_PER_TICKenv, default 100.main.go— DLQ metrics goroutine wired toappCtx/bootWG, AI serviceaiCtx/SetParentContext/aiCancelplumbing,dlq.SetMaxReplayPerTick(cfg.DLQMaxReplayPerTick)after construction.Test plan
go test ./internal/queue/ ./internal/ai/ ./internal/config/ -race— 45 passedgo test ./...(full suite, no race) — 459 passed in 27 packagesgo vet ./...— cleango build ./...— clean🤖 Generated with Claude Code