Skip to content

fix(robustness): P1+P2 follow-ups from brainstorm#61

Merged
aksOps merged 4 commits into
mainfrom
fix/robustness-p1-p2
Apr 28, 2026
Merged

fix(robustness): P1+P2 follow-ups from brainstorm#61
aksOps merged 4 commits into
mainfrom
fix/robustness-p1-p2

Conversation

@aksOps

@aksOps aksOps commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes the four actionable P1+P2 items deferred from the codex round-2 brainstorm. The fifth (DLQ replay idempotency for spans/logs) is still deferred — it needs a schema migration plus pre-migration dedup, not a code-only fix.

Severity Item Files LOC
P1 Tenant pipeline starvation fairness — per-tenant in-flight cap `pipeline.go`, `config.go`, `main.go`, `pipeline_test.go` ~150 (incl. tests)
P2 `/ready` saturation probes (DLQ disk + pipeline fullness) `health_handlers.go`, `server.go`, `main.go` ~55
P2 Retention adaptive pacing `retention.go` ~50
P2 GraphSnapshot row-count cap `snapshot.go` ~30

Per-tenant cap details

  • Healthy submissions from a tenant already at the cap drop at `Submit()` with reason `tenant_backpressure`. New `TenantDropped()` counter, separate from `DroppedHealthy` (queue-wide soft drop) and `RejectedFull` (hard reject).
  • Priority batches (errors / slow traces) bypass the cap — diagnostic data must always land.
  • Slot reserved at `Submit`, released in `process()`'s deferred cleanup so panics also release. Channel-full path undoes the reservation before returning `ErrQueueFull`.
  • Wired via `INGEST_PIPELINE_PER_TENANT_CAP` (default 0 = disabled). Single-tenant deployments see no behavior change. Multi-tenant deployments should set this to roughly `Capacity/N × 2`.

Skipped from the brainstorm with rationale

  • P1 pipeline drain-path metric blind spot. Verification shows `Stop()`'s drain processes batches via `process()`; there are no shutdown-phase drops to account for. False premise; nothing to fix.

Test plan

  • `go vet ./...` clean
  • `go build ./...` succeeds
  • `go test ./... -race -count=1` — 407 pass across 27 packages (3 new tests for the per-tenant cap contract)
  • commit signed (ED25519)

🤖 Generated with Claude Code

Closes the four actionable follow-ups deferred from the codex round-2
brainstorm. The fifth ("DLQ replay idempotency for spans/logs") is still
deferred — it requires a schema migration to add unique indexes plus a
pre-migration dedup pass, not a code-only fix.

P1 — tenant pipeline starvation fairness (`internal/ingest/pipeline.go`,
`config.go`, `main.go`).
- Added per-tenant in-flight cap to Pipeline. When set, healthy
  submissions from a tenant already at the cap are dropped at Submit()
  with reason "tenant_backpressure" (separate counter from
  DroppedHealthy / RejectedFull). Priority batches (errors / slow
  traces) bypass — diagnostic data must always land.
- Slot is reserved at Submit and released in process()'s deferred
  cleanup, so panics still release. Failed channel sends in Submit
  undo the reservation before returning ErrQueueFull.
- Wired via INGEST_PIPELINE_PER_TENANT_CAP (default 0 = disabled, no
  behavior change for single-tenant deployments). Multi-tenant
  deployments should set this to ~Capacity/N×2 where N is the number
  of concurrently-active tenants.

P1 — pipeline drain-path metric blind spot: skipped. The brainstorm's
framing was that Stop()'s drain bypasses observeDrop accounting, but
verification shows the drain path PROCESSES (not drops) batches via
process(), and there are no shutdown-phase drops to account for.

P2 — /ready saturation probes (`internal/api/health_handlers.go`,
`server.go`, `main.go`).
- Server gains two callback-shaped probes (DLQ disk fullness, ingest
  pipeline fullness). Probes return a fraction in [0,1]; /ready flips
  to 503 above 0.95 so orchestrators stop sending traffic before the
  pipeline starts hard-rejecting (gRPC RESOURCE_EXHAUSTED / HTTP 429)
  or DLQ starts FIFO-evicting.
- Wired in main.go using closures that capture dlq + ingestPipeline.
  Probes are nil-tolerant on the server side; nil = "skipped" rather
  than fatal.

P2 — retention adaptive pacing (`internal/storage/retention.go`).
- New adaptPurgeSleep on RetentionScheduler. After each purge pass,
  measures wall-clock time vs purgeInterval. >50% → double the
  inter-batch sleep (capped at 100ms). <10% → halve it (floored at
  1ms). Single-writer (the retention loop), so no synchronization
  needed; purge methods read the value once at the call boundary.

P2 — GraphSnapshot row cap (`internal/graphrag/snapshot.go`).
- pruneOldSnapshots now enforces a 100k row backstop after the by-age
  delete. Steady-state at 15-min cadence × 100 tenants is ~67k rows;
  100k is enough headroom that the backstop never triggers under
  normal operation but bounds disk if write rate spikes.

Tests: three new pipeline tests lock in the per-tenant cap contract
(drops past cap, priority bypasses, slot released after process). Full
suite: 407 tests pass under `-race -count=1`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread internal/ingest/pipeline.go Fixed
aksOps and others added 2 commits April 28, 2026 05:35
CodeQL flagged make(chan *Batch, cfg.Capacity) at pipeline.go:171
as go/uncontrolled-allocation-size — cfg.Capacity flows from the
INGEST_PIPELINE_QUEUE_SIZE env var without an upper bound, so a
typo like 10_000_000_000 would OOM the process at startup.

Add defensive caps in NewPipeline (maxPipelineCapacity = 1M,
maxPipelineWorkers = 256). Clamping at the constructor satisfies
the CodeQL taint-tracking and is the right defense regardless —
both ceilings are well above any reasonable deployment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In-place reassignment of cfg.Capacity/Workers wasn't recognized by
CodeQL as a sanitizer for go/uncontrolled-allocation-size. Reroute
through local variables clamped via min(); make() consumes the
local, which CodeQL's taint-tracking accepts as bounded.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread internal/ingest/pipeline.go Dismissed
CodeQL's BoundedFlowSource taint analysis recognizes explicit if/else
comparison guards as sanitizers but not the min() builtin (Go 1.21+).
Switching to `if x > MAX { x = MAX }` is the canonical pattern from
go/uncontrolled-allocation-size docs and should clear the alert without
changing runtime behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

@aksOps aksOps merged commit f86d391 into main Apr 28, 2026
17 checks passed
@aksOps aksOps deleted the fix/robustness-p1-p2 branch April 28, 2026 05:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants