fix(robustness): atomic batch writes + DLQ fsync (P0)#59
Merged
Conversation
P0 robustness items from the Round-2 brainstorm. Two of the four originally proposed items are deferred with rationale below. **1. Pipeline orphan FK rows (P0).** A worker panic between BatchCreateTraces and BatchCreateSpans previously left a parent trace row with no children (or vice versa). New Repository.BatchCreateAll persists Trace→Span→Log inside a single db.Transaction; any failure (or panic via the worker's defer recover) rolls back the entire batch atomically. Trace insert preserves the existing ON CONFLICT DO NOTHING idempotency. Behavior change: trace insert errors now abort the batch instead of being "tolerated". The previous tolerance was a fragile workaround — traces are idempotent on retry, so a DLQ replay re-attempts cleanly. Test TestPipeline_FailedTracesContinuesToSpans renamed/rewritten to TestPipeline_FailedTracesAbortsBatch reflecting the new contract. TestPipeline_FailedSpansSkipsLogs (orphan-log invariant) preserved. **2. DLQ Enqueue f.Sync() (P0).** Adds f.Sync() before f.Close() so a host crash between Write and Close can't leave a torn JSON file that permanently consumes a retry slot until DLQ_MAX_RETRIES evicts it. (Edit landed on the prior checkpoint commit on this branch.) **Deferred from the brainstorm with rationale:** - "DLQ replay poisoned-pill via per-row ON CONFLICT DO NOTHING" — Spans, Logs, and MetricBuckets have no unique-index beyond auto-incrementing id, so ON CONFLICT DO NOTHING is a no-op there. Real fix needs a schema migration to add uniqueIndex on (tenant_id, trace_id, span_id) for spans (plus a pre-migration dedup pass on existing rows). Out of scope for P0; needs its own design pass. - "GraphRAG refresh shadow-swap to fix multi-second write-lock stall" — refresh.go does many tiny Upsert* calls (microseconds each, lock briefly per call), not a single bulk write-lock window. The "multi-second stall" framing didn't survive code reading; actual contention is distributed, not bursty. Demoted to P2 optimization, deferred. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Summary
P0 robustness items surfaced by a structured two-voice brainstorm with codex (gpt-5-codex). Two passes: codex critiqued my opening positions, I pushed back on three of codex's downgrades, both sides did file-grounded verification of every claim. Of the original 4 P0 items, 2 ship here and 2 are deferred with documented rationale (the dialogue revealed they were based on flawed mental models of the actual code).
Ships
Repository.BatchCreateAllpersists Trace→Span→Log inside a singledb.Transaction. A panic between twoBatchCreate*calls used to leak orphan parent rows; now the entire batch rolls back atomically. Trace insert preserves the existingON CONFLICT DO NOTHINGidempotency, so DLQ replay re-attempts cleanly.f.Sync()— adds fsync before close so a host crash between Write and Close can't leave a torn JSON file that permanently consumes a retry slot untilDLQ_MAX_RETRIESevicts it.Behavior change worth a second look
TestPipeline_FailedTracesContinuesToSpansis renamed/rewritten asTestPipeline_FailedTracesAbortsBatch. Trace insert errors now abort the batch instead of being "tolerated". The previous tolerance was a fragile workaround — traces are idempotent on retry, so a DLQ replay re-attempts cleanly.TestPipeline_FailedSpansSkipsLogs(orphan-log invariant) preserved.Deferred (documented, not silent)
ON CONFLICT DO NOTHING—Span,Log,MetricBuckethave no unique-index beyond auto-incrementid, so the clause is a no-op there. Real fix needs a schema migration (uniqueIndexon(tenant_id, trace_id, span_id)for spans, plus a pre-migration dedup on existing rows). Out of scope for P0; needs its own design pass.rebuildFromDBForTenantdoes many tinyUpsert*calls (microseconds each, lock briefly per call), not a single bulk write-lock. Contention is distributed, not bursty. Demoted to a P2 optimization, deferred to a future PR if metrics show it's needed.Skipped at the brainstorm stage (verification rejected each)
defer release()insiderunWithTimeoutalready covers it)New()beforeStart()spawns workers)flush())ctxalready propagates; gap is at handler middleware)Test plan
go vet ./...cleango build ./...succeedsgo test ./... -race -count=1— 404 tests pass across 27 packages (pre-existing flake onTestMCP_TenantIsolation_AllGraphRAGToolsunder load surfaced once, passed on rerun and in isolation; flagged as a follow-up to investigate)TestPipeline_FailedTracesAbortsBatchandTestPipeline_FailedSpansSkipsLogsboth pass under-race🤖 Generated with Claude Code