Skip to content

Backend robustness for 100-200 services#24

Merged
aksOps merged 20 commits into
mainfrom
feat/backend-robustness-100-200-services
Apr 23, 2026
Merged

Backend robustness for 100-200 services#24
aksOps merged 20 commits into
mainfrom
feat/backend-robustness-100-200-services

Conversation

@aksOps

@aksOps aksOps commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Hardens OtelContext's ingest, storage, and observability paths for 100-200 service deployments
  • Adds per-dimension metrics, driver-aware p99, parallel retention, gRPC limits, DLQ eviction tracking, and investigation deduplication
  • Ships a programmatic 200-producer load simulator (make loadtest)

Tasks (12)

# Area Change
1 GraphRAG Propagate real span Status (was always UNSET)
2 GraphRAG Event drop metric + GRAPHRAG_WORKER_COUNT, GRAPHRAG_EVENT_QUEUE_SIZE
3 API Rate limiter skips OTLP /v1/*
4 Ingest gRPC MaxRecvMsgSize, MaxConcurrentStreams, keepalive caps
5 Storage Parallel retention purges with panic-guarded goroutines + rows_behind gauge
6 Storage SQLite refused in APP_ENV=production unless OTELCONTEXT_ALLOW_SQLITE_PROD=true
7 Telemetry DB pool stats (in_use, idle, wait_count, etc.) sampled every 5s
8 DLQ dlq_evicted_total + dlq_evicted_bytes_total counters
9 GraphRAG 5-minute cooldown on investigation inserts to prevent spam
10 Storage Driver-switched dashboard p99 — Postgres percentile_disc, MySQL OFFSET, SQLite capped
11 Test 200-service OTLP load simulator under //go:build loadtest
12 Docs OPERATIONS.md + CLAUDE.md updates for the new knobs and metrics

New metrics

All new metrics use the Prometheus-idiomatic otelcontext_* prefix (see comment in internal/telemetry/metrics.go explaining the split from legacy OtelContext_*):

  • otelcontext_graphrag_events_dropped_total{signal}
  • otelcontext_retention_rows_behind{kind}
  • otelcontext_db_pool_{max_open,in_use,idle,wait_count,wait_duration_seconds}
  • otelcontext_dlq_evicted_total, otelcontext_dlq_evicted_bytes_total
  • otelcontext_dashboard_p99_row_cap_hits_total

Test plan

  • go build ./... clean
  • go vet ./... clean
  • CGO_ENABLED=1 go test -race -timeout 180s ./... — 10 packages, 0 failures
  • go test -tags loadtest ./test/loadsim/... — 3 unit tests pass
  • make loadtest-build produces bin/loadsim
  • Run ./bin/loadsim for 60s against a fresh backend and confirm healthy markers from OPERATIONS.md (manual, pre-merge sanity)
  • Deploy to staging, watch retention_consecutive_failures, graphrag_events_dropped_total, db_pool_in_use for one hour

Follow-ups (tracked separately, not blocking)

  • Canonicalize investigation cooldown key (lower/trim before hashing)
  • Remove dead test helpers verifyP99Index (Task 10), package-level randomDuration (Task 11)
  • StartRuntimeMetrics goroutine stop channel (pre-existing)
  • Add .env.example (pre-existing gap)

🤖 Generated with Claude Code

aksOps and others added 20 commits April 23, 2026 15:10
The ingestion callback hardcoded Status: "OK" for every span, so error
chains, RCA, anomaly detection, and impact analysis were all blind to
real failures — the in-memory graph reported green regardless of wire
truth, while the DB held the correct trace status.

Root cause required more than the callback fix: storage.Span had no
Status column, so per-span status was only available transiently at
ingest time. Changes:

- Add Status column to storage.Span (OTLP status code, indexed).
- Ingest writes span.Status = statusStr so the persisted row matches
  the trace-level status that was already being captured.
- OnSpanIngested now forwards span.Status (falling back to
  STATUS_CODE_UNSET if unset) instead of the hardcoded "OK".
- refresh.rebuildFromDB SELECT now includes the status column so the
  periodic DB rebuild produces the same ErrorCount a live ingest would.

Two new tests cover both paths: the event-loop callback and the
DB-rebuild refresh loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace silent drops in OnSpanIngested/OnLogIngested/OnMetricIngested with
an atomic counter + Prometheus metric
otelcontext_graphrag_events_dropped_total{signal}.
Honor GRAPHRAG_WORKER_COUNT and GRAPHRAG_EVENT_QUEUE_SIZE envs so operators
can tune capacity without code changes. Start now uses the configured
worker count rather than the hardcoded defaultWorkerCount constant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The global 100 RPS/IP default throttled real OTLP HTTP collectors. Add
MiddlewareExcept so /v1/{traces,logs,metrics} bypass the per-IP bucket
while /api/* still enforces the limit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a block comment at the exemption site in main.go explaining why OTLP
ingestion paths bypass the per-IP limiter, and why this is acceptable
despite the APIKeyGate running downstream of the limiter (header-only
auth, bounded CPU per unauthenticated request). Include a TODO for a
separate higher-ceiling OTLP-specific limiter if the trade-off becomes
a concern.

Also expand the top-of-function comment on TestRateLimiter_ExemptsOTLPPaths
to explain why the test exists — locking the exemption in so a future
refactor cannot silently re-enable throttling on /v1/* and regress
legitimate ingestion traffic.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OTLP batches at production volume routinely exceed the 4 MiB gRPC default.
Set 16 MiB default recv, 1000-stream concurrency cap, and keepalive
(60s ping / 10s timeout / 10m idle / 2h max age). Tunable via
GRPC_MAX_RECV_MB and GRPC_MAX_CONCURRENT_STREAMS.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code-review follow-up for Task 4 — a bad env value could push MaxRecvMsgSize
past RAM (10 GiB allocation) or wrap MaxConcurrentStreams cast. Bound to
1..256 MiB and 1..1M respectively via Validate().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run logs/traces/metric_buckets purges concurrently on Postgres/MySQL (still
serial on SQLite — single-writer lock). Make batch size and inter-batch
sleep configurable via RETENTION_BATCH_SIZE (default 50000) and
RETENTION_BATCH_SLEEP_MS (default 1). Expose
otelcontext_retention_rows_behind{table,driver} so operators see when
purge cannot keep pace with ingest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code-review follow-up for Task 5. Without defer+recover, a panic inside
Purge*Batched would leave the main loop blocked on <-results; the outer
running.CompareAndSwap guard would then skip every subsequent hourly tick.
Extract a runGuarded helper that recovers and still forwards a failure
result so the scheduler stays live.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SQLite's single-writer lock caps ingestion at ~5 services. Require
OTELCONTEXT_ALLOW_SQLITE_PROD=true to opt in, else fail at startup.
Dev/test environments print a capability-ceiling warning but start.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Export otelcontext_db_pool_{open_connections,in_use,idle,wait_count,wait_duration_seconds}
so operators can see whether DB_MAX_OPEN_CONNS is sized correctly. Sampled
every 5s from sql.DB.Stats(). WaitCount/WaitDuration are cumulative values
— compute rate() over them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
go mod tidy promotes github.com/glebarez/go-sqlite and
github.com/prometheus/client_model from indirect to direct after the
new test file in internal/telemetry references them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DLQ silently discarded oldest batches once MaxFiles/MaxDiskMB was exceeded.
Add otelcontext_dlq_evicted_{total,bytes_total} counters plus a
rate-limited warn log (one per enforceLimits call) so extended DB
outages produce an observable signal instead of silent data loss.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without a cooldown, a single stuck service produces one investigation
insert every anomaly tick (default 10s). Add an in-memory sliding-window
guard keyed by (trigger_service, root_service, root_operation) with a
janitor on the refresh tick to bound map size. Expose
InvestigationInsertCount() for tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Postgres uses percentile_disc, MySQL uses OFFSET, SQLite keeps
in-memory sort but caps at 200k rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four fixes from code review on 65f3742:

Critical 1 (Postgres row leak): Switched the Postgres branch of
p99DurationForQuery from .Row() (*sql.Row, no Close) to .Rows() so the
underlying *sql.Rows is deterministically closed via defer. Prevents
connection leak on sustained traffic.

Critical 2 (MySQL tenant filter — VERIFIED, no code change needed):
Added TestP99_MySQLBranch_PreservesTenantFilter which seeds 10 rows for
tenant "a" (durations 1k..10k) and 10 rows for tenant "b" (durations
100k..1M), then calls GetDashboardStats scoped to tenant "a" on the
MySQL branch. Asserts P99Latency == 10000. Test PASSES — GORM does
preserve the parent WHERE clause through Session(&gorm.Session{}) +
Model(&Trace{}). Reviewer's claim is disproven; no cross-tenant leak.

Important 1 (hot-path log spam): Replaced slog.Warn in the SQLite cap
branch with a Prometheus counter (otelcontext_dashboard_p99_row_cap_hits_total)
registered under Metrics.DashboardP99RowCapHitsTotal. Kept a low-volume
slog.Debug for dev observability. Counter is nil-guarded.

Important 2 (context cancellation): Changed helper signature to
p99DurationForQuery(ctx context.Context, session *gorm.DB). All sub-sessions
now use Session(&gorm.Session{Context: ctx}) so client disconnects and
request timeouts propagate to the driver. Call site in GetDashboardStats
updated. Exported GetDashboardStats signature unchanged.

Tests: 134 storage + 3 telemetry pass; go vet clean; go build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds test/loadsim — a single Go binary (build tag: loadtest) that spins
up 200 concurrent simulated services as goroutines and drives 50 spans/s
per producer over 60s against the OTLP gRPC endpoint. Features: linear
warmup stagger, ticker-based rate limiter (no new deps), 5% error rate,
parent/child trace relationships every 10th span, SIGINT/SIGTERM graceful
shutdown with exporter flush, and a progress+summary reporter.

Includes 3 unit tests (TestServiceName, TestSpanFactory, TestRateLimiter)
and a make loadtest / make loadtest-build target.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Per-producer *rand.Rand seeded from time+idx — removes the 200-goroutine
  mutex contention on the global math/rand RNG in the hot path.
- Labeled break in the warmup stagger loop so SIGINT during ramp-up
  actually halts further producer launches (bare break only exited the
  enclosing select).
- Drop dead google.golang.org/grpc/metadata import and the unused
  metadata.Pairs allocation — WithHeaders already sets x-tenant-id.
- Drop redundant coord.producers slice (duplicated the local producers
  slice; nothing read it).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Making sqliteP99RowCap a var lets the cap-trigger test temporarily
override it to 200 instead of seeding 200k+5k rows under -race, where
SQLite batch inserts serialize through the race detector and time out
the 180s test budget.

Production default unchanged at 200_000.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add alert thresholds for the new metrics (graphrag events dropped,
retention rows behind, db pool stats, dlq evicted, dashboard p99 cap
hits). Document the new config knobs (GraphRAG workers, gRPC caps,
retention batch pacing). Add a Scale & Load Testing section covering
the 200-service simulator and healthy-run markers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Fix OPERATIONS.md env var name: DB_ALLOW_SQLITE_PROD → OTELCONTEXT_ALLOW_SQLITE_PROD
- Add new Tasks 1-11 env vars to CLAUDE.md Configuration section
- Document the OtelContext_* vs otelcontext_* metric prefix split in telemetry/metrics.go

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
3 Security Hotspots

See analysis details on SonarQube Cloud

Comment thread main.go

grpcOpts := []grpc.ServerOption{
grpc.MaxRecvMsgSize(recvBytes * 1024 * 1024),
grpc.MaxConcurrentStreams(uint32(streams)),
@aksOps aksOps merged commit 1d844a4 into main Apr 23, 2026
8 of 10 checks passed
@aksOps aksOps mentioned this pull request Apr 23, 2026
4 tasks
@aksOps aksOps deleted the feat/backend-robustness-100-200-services branch April 26, 2026 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants