Skip to content

feat(ingest): async ingest pipeline with hybrid backpressure#50

Merged
aksOps merged 2 commits into
mainfrom
feat/ingest-async-pipeline
Apr 27, 2026
Merged

feat(ingest): async ingest pipeline with hybrid backpressure#50
aksOps merged 2 commits into
mainfrom
feat/ingest-async-pipeline

Conversation

@aksOps

@aksOps aksOps commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Phase 1 of the 150–200 component robustness push. Decouples OTLP Export() from synchronous DB writes via a bounded async pipeline, making the ingest path resilient under burst load.

  • New internal/ingest/pipeline.go — bounded queue + worker pool with hybrid backpressure:
    • <90% queue full — accept everything
    • 90%–100% — silent drop of healthy batches; errors / slow traces always pass
    • 100% — gRPC RESOURCE_EXHAUSTED / HTTP 429 so OTLP clients back off cleanly
  • Two new Prometheus instruments — otelcontext_ingest_pipeline_queue_depth{signal} and otelcontext_ingest_pipeline_dropped_total{signal,reason}
  • Three new env vars — INGEST_ASYNC_ENABLED (default true), INGEST_PIPELINE_QUEUE_SIZE (default 50000), INGEST_PIPELINE_WORKERS (default 8)
  • INGEST_ASYNC_ENABLED=false is the kill switch — reverts to the legacy synchronous write path inside Export() bit-for-bit

Architecture preserved

  • Trace → Span → Log FK ordering held end-to-end (the Batch is the unit of work; the worker runs the same insert sequence the sync path used)
  • Existing sampler/callback/tenant flows unchanged
  • TSDB metrics path bypasses the pipeline (already async via in-mem aggregator)
  • Shutdown LIFO updated: gRPC GracefulStop → pipeline Stop (drains) → DLQ → retention → DB close

Test plan

  • go build ./... clean
  • go vet ./... clean
  • go test -race ./... — all 12 packages pass
  • 19 pipeline tests pass under -race covering: nil/empty batch, soft-threshold drop, priority bypass, hard-capacity rejection, FK ordering, callback sequencing, failed-spans-skip-logs, failed-traces-continue-to-spans, drain-on-Stop, idempotent Stop, concurrent submit, default-config fallback, callback panic recovery
  • 4 e2e tests against in-memory SQLite covering: traces persist through pipeline, logs persist through pipeline, hard-capacity returns codes.ResourceExhausted, priority batches bypass soft backpressure end-to-end
  • Existing OTLP HTTP e2e tests (TestOTLPHTTPEndToEnd) continue to pass — confirms the legacy synchronous fallback is intact

Backwards compatibility

  • Default behavior is async (new) — operators who want the prior sync path set INGEST_ASYNC_ENABLED=false
  • Intake metrics (GRPCBatchSize, IngestionRate) fire on receipt rather than on persist, so net-persisted is computed as ingestion_total - ingest_pipeline_dropped_total. Documented in docs/OPERATIONS.md.

Follow-ups (separate PRs per phase plan)

  • Phase 2: per-tenant cardinality fairness
  • Phase 3a: SQLite FTS5 + BM25 for log search (default)
  • Phase 3b: Postgres partitioning as opt-in adapter
  • Phase 4: HTTP OTLP backpressure parity (HTTP 429 + Retry-After)
  • Phase 5: DROP-PARTITION retention
  • Phase 6: MCP HTTP streamable robustness for frequent queries

🤖 Generated with Claude Code

aksOps and others added 2 commits April 27, 2026 16:00
Introduces the Pipeline type that decouples OTLP Export() from
synchronous DB writes. Builds the foundation for Phase 1 of the
robustness work — wiring into TraceServer/LogsServer.Export comes in
the next commit.

Backpressure policy is hybrid:
  - <90% queue fullness    → all batches enqueue
  - 90%-100% fullness      → healthy batches dropped (silent), priority
                              (errors / slow traces) always enqueue
  - 100% (channel full)    → ErrQueueFull returned to the caller, even
                              for priority batches; callers map this to
                              gRPC RESOURCE_EXHAUSTED / HTTP 429 so the
                              client backs off rather than retrying tighter

The unit of work is a Batch — one OTLP Export call's persistable output
packaged together. This preserves the Trace→Span→Log FK ordering the
synchronous path enforces, and lets the worker run the same insert
sequence without rewriting trace upsert logic.

Two new Prometheus instruments surface the new behavior:
  - otelcontext_ingest_pipeline_queue_depth{signal} — gauge
  - otelcontext_ingest_pipeline_dropped_total{signal,reason} — counter
    where reason ∈ {soft_backpressure, queue_full}

Test coverage includes nil/empty batch handling, soft-threshold drop,
priority bypass, hard-capacity rejection, FK ordering, callback
sequencing, partial-failure isolation (failed spans skip logs, failed
traces continue to spans), graceful drain on Stop, idempotent Stop,
race-detector-safe concurrent submit, default-config fallback, and
panic recovery in callbacks. All 14 pipeline tests pass under -race;
full suite (12 packages) green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Plumbs the Pipeline introduced in the prior commit into TraceServer.Export
and LogsServer.Export. Default behavior is async; INGEST_ASYNC_ENABLED=false
falls back to the legacy synchronous DB-write path bit-for-bit so operators
have a kill switch.

Wiring details:
  - TraceServer / LogsServer gain a *Pipeline field and SetPipeline() setter,
    matching the existing SetSampler / SetCallback pattern.
  - The parser tracks per-batch HasError / HasSlow flags during the existing
    per-resource goroutine fan-out; the merge step ORs them across goroutines
    so the pipeline's priority lane sees the right protection bit.
  - Intake metrics (GRPCBatchSize, IngestionRate) fire BEFORE the persist
    decision so dashboards reflect what was received. Net persisted is
    derivable as ingestion_total - ingest_pipeline_dropped_total.
  - On ErrQueueFull, Export returns gRPC RESOURCE_EXHAUSTED via google.golang.org/grpc/status,
    which OTLP clients map to backoff retries (not tighter loops).

Three new env vars under "Async ingest pipeline":
  - INGEST_ASYNC_ENABLED=true (default)
  - INGEST_PIPELINE_QUEUE_SIZE=50000 (default)
  - INGEST_PIPELINE_WORKERS=8 (default)

main.go: pipeline is constructed when enabled, started with context.Background
(workers exit only via Stop() drain), and Stop()'d after gRPC GracefulStop but
before DLQ.Stop in the shutdown LIFO so in-flight batches drain to the DB
before the DLQ and DB shut down.

E2E coverage adds 4 tests (pipeline_e2e_test.go) running against an in-memory
SQLite Repository:
  - traces persist through the async path
  - logs persist through the async path
  - hard-capacity overflow returns gRPC codes.ResourceExhausted
  - priority batches (error spans) bypass soft backpressure end-to-end

Docs: CLAUDE.md ingestion paths section now mentions the pipeline; the env-var
list covers all three new tunables. OPERATIONS.md adds three new alert rules
covering soft drops, hard rejections, and queue-depth headroom.

All 12 packages pass under -race; pipeline tests now total 19 (15 unit + 4 e2e)
and cover nil/empty batches, soft/hard backpressure, FK ordering, callback
sequencing, partial-failure isolation, drain-on-Stop, idempotent Stop,
concurrent submit, default-config fallback, callback panic recovery, and
gRPC status code mapping.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

@aksOps aksOps merged commit 17f70dc into main Apr 27, 2026
17 checks passed
@aksOps aksOps deleted the feat/ingest-async-pipeline branch April 27, 2026 16:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant