Skip to content

fix(storage): make span ingest idempotent for DLQ replay#62

Merged
aksOps merged 1 commit into
mainfrom
fix/span-replay-idempotency
Apr 28, 2026
Merged

fix(storage): make span ingest idempotent for DLQ replay#62
aksOps merged 1 commit into
mainfrom
fix/span-replay-idempotency

Conversation

@aksOps

@aksOps aksOps commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes the deferred P0 from the robustness brainstorm: span ingest is now idempotent on the (tenant_id, trace_id, span_id) key, so DLQ replay after a partial-batch failure no longer double-counts spans in the relational DB or downstream GraphRAG.

  • New composite `uniqueIndex idx_spans_tenant_trace_span` on the `spans` table
  • `BatchCreateSpans` and `BatchCreateAll` spans path use `OnConflict.DoNothing` (MySQL: `INSERT IGNORE`) — same pattern already used for traces
  • Pre-migration dedupe in `migrate_spans.go` collapses any pre-existing duplicates before the unique index is created, so upgrades from older deployments don't abort startup

Logs remain non-idempotent — OTLP logs lack a stable identifier and the natural composite key would be expensive without a clear win. Called out explicitly in `BatchCreateAll`'s doc comment as separate future work.

Test plan

  • `go test -race -count=1 ./...` → 414 passed (was 407, +7 new tests)
  • New tests cover: duplicate-insert no-op, cross-tenant key isolation, BatchCreateAll replay idempotency, dedupe migration on pre-existing dupes, no-op on fresh DB, no-op once unique index exists, AutoMigrate creates the unique index
  • Test helper `seedTrace` updated to give each span a distinct SpanID — without it, tests that seed multiple spans per trace would silently collapse under the new constraint

Out of scope

  • Log idempotency (separate design pass — natural unique key is non-obvious)

🤖 Generated with Claude Code

Adds composite uniqueIndex idx_spans_tenant_trace_span on
(tenant_id, trace_id, span_id) so a duplicate span ingest — most
commonly a DLQ replay after a partial-batch failure — collapses to
a no-op rather than producing double-counted spans in the relational
DB and downstream GraphRAG.

Changes:
- Span model: composite uniqueIndex on (tenant_id, trace_id, span_id)
- BatchCreateSpans + BatchCreateAll spans path: OnConflict.DoNothing
  (MySQL takes INSERT IGNORE) — mirrors existing trace idempotency
- New migrate_spans.go: dedupes pre-existing duplicate spans BEFORE
  AutoMigrate adds the unique index, so upgrades from pre-RAN-65
  deployments don't fail on legacy duplicates
- Test helper: distinct SpanIDs per seeded span so tests still create
  the expected count
- 6 new tests covering: duplicate-insert no-op, cross-tenant key
  isolation, BatchCreateAll replay idempotency, dedupe migration on
  pre-existing dupes, no-op on fresh DB, no-op once unique index
  exists, AutoMigrate creates the unique index

Logs remain non-idempotent (OTLP logs lack a stable identifier);
called out explicitly in BatchCreateAll's doc comment as separate
future work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

@aksOps aksOps merged commit 83fdbf7 into main Apr 28, 2026
17 checks passed
@aksOps aksOps deleted the fix/span-replay-idempotency branch April 28, 2026 06:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant