Backend robustness for 100-200 services by aksOps · Pull Request #24 · RandomCodeSpace/otelcontext

aksOps · 2026-04-23T18:58:46Z

Summary

Hardens OtelContext's ingest, storage, and observability paths for 100-200 service deployments
Adds per-dimension metrics, driver-aware p99, parallel retention, gRPC limits, DLQ eviction tracking, and investigation deduplication
Ships a programmatic 200-producer load simulator (make loadtest)

Tasks (12)

#	Area	Change
1	GraphRAG	Propagate real span Status (was always UNSET)
2	GraphRAG	Event drop metric + `GRAPHRAG_WORKER_COUNT`, `GRAPHRAG_EVENT_QUEUE_SIZE`
3	API	Rate limiter skips OTLP `/v1/*`
4	Ingest	gRPC `MaxRecvMsgSize`, `MaxConcurrentStreams`, keepalive caps
5	Storage	Parallel retention purges with panic-guarded goroutines + `rows_behind` gauge
6	Storage	SQLite refused in `APP_ENV=production` unless `OTELCONTEXT_ALLOW_SQLITE_PROD=true`
7	Telemetry	DB pool stats (`in_use`, `idle`, `wait_count`, etc.) sampled every 5s
8	DLQ	`dlq_evicted_total` + `dlq_evicted_bytes_total` counters
9	GraphRAG	5-minute cooldown on investigation inserts to prevent spam
10	Storage	Driver-switched dashboard p99 — Postgres `percentile_disc`, MySQL OFFSET, SQLite capped
11	Test	200-service OTLP load simulator under `//go:build loadtest`
12	Docs	`OPERATIONS.md` + `CLAUDE.md` updates for the new knobs and metrics

New metrics

All new metrics use the Prometheus-idiomatic otelcontext_* prefix (see comment in internal/telemetry/metrics.go explaining the split from legacy OtelContext_*):

otelcontext_graphrag_events_dropped_total{signal}
otelcontext_retention_rows_behind{kind}
otelcontext_db_pool_{max_open,in_use,idle,wait_count,wait_duration_seconds}
otelcontext_dlq_evicted_total, otelcontext_dlq_evicted_bytes_total
otelcontext_dashboard_p99_row_cap_hits_total

Test plan

go build ./... clean
go vet ./... clean
CGO_ENABLED=1 go test -race -timeout 180s ./... — 10 packages, 0 failures
go test -tags loadtest ./test/loadsim/... — 3 unit tests pass
make loadtest-build produces bin/loadsim
Run ./bin/loadsim for 60s against a fresh backend and confirm healthy markers from OPERATIONS.md (manual, pre-merge sanity)
Deploy to staging, watch retention_consecutive_failures, graphrag_events_dropped_total, db_pool_in_use for one hour

Follow-ups (tracked separately, not blocking)

Canonicalize investigation cooldown key (lower/trim before hashing)
Remove dead test helpers verifyP99Index (Task 10), package-level randomDuration (Task 11)
StartRuntimeMetrics goroutine stop channel (pre-existing)
Add .env.example (pre-existing gap)

🤖 Generated with Claude Code

The ingestion callback hardcoded Status: "OK" for every span, so error chains, RCA, anomaly detection, and impact analysis were all blind to real failures — the in-memory graph reported green regardless of wire truth, while the DB held the correct trace status. Root cause required more than the callback fix: storage.Span had no Status column, so per-span status was only available transiently at ingest time. Changes: - Add Status column to storage.Span (OTLP status code, indexed). - Ingest writes span.Status = statusStr so the persisted row matches the trace-level status that was already being captured. - OnSpanIngested now forwards span.Status (falling back to STATUS_CODE_UNSET if unset) instead of the hardcoded "OK". - refresh.rebuildFromDB SELECT now includes the status column so the periodic DB rebuild produces the same ErrorCount a live ingest would. Two new tests cover both paths: the event-loop callback and the DB-rebuild refresh loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace silent drops in OnSpanIngested/OnLogIngested/OnMetricIngested with an atomic counter + Prometheus metric otelcontext_graphrag_events_dropped_total{signal}. Honor GRAPHRAG_WORKER_COUNT and GRAPHRAG_EVENT_QUEUE_SIZE envs so operators can tune capacity without code changes. Start now uses the configured worker count rather than the hardcoded defaultWorkerCount constant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The global 100 RPS/IP default throttled real OTLP HTTP collectors. Add MiddlewareExcept so /v1/{traces,logs,metrics} bypass the per-IP bucket while /api/* still enforces the limit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add a block comment at the exemption site in main.go explaining why OTLP ingestion paths bypass the per-IP limiter, and why this is acceptable despite the APIKeyGate running downstream of the limiter (header-only auth, bounded CPU per unauthenticated request). Include a TODO for a separate higher-ceiling OTLP-specific limiter if the trade-off becomes a concern. Also expand the top-of-function comment on TestRateLimiter_ExemptsOTLPPaths to explain why the test exists — locking the exemption in so a future refactor cannot silently re-enable throttling on /v1/* and regress legitimate ingestion traffic. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

OTLP batches at production volume routinely exceed the 4 MiB gRPC default. Set 16 MiB default recv, 1000-stream concurrency cap, and keepalive (60s ping / 10s timeout / 10m idle / 2h max age). Tunable via GRPC_MAX_RECV_MB and GRPC_MAX_CONCURRENT_STREAMS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Code-review follow-up for Task 4 — a bad env value could push MaxRecvMsgSize past RAM (10 GiB allocation) or wrap MaxConcurrentStreams cast. Bound to 1..256 MiB and 1..1M respectively via Validate(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Run logs/traces/metric_buckets purges concurrently on Postgres/MySQL (still serial on SQLite — single-writer lock). Make batch size and inter-batch sleep configurable via RETENTION_BATCH_SIZE (default 50000) and RETENTION_BATCH_SLEEP_MS (default 1). Expose otelcontext_retention_rows_behind{table,driver} so operators see when purge cannot keep pace with ingest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Code-review follow-up for Task 5. Without defer+recover, a panic inside Purge*Batched would leave the main loop blocked on <-results; the outer running.CompareAndSwap guard would then skip every subsequent hourly tick. Extract a runGuarded helper that recovers and still forwards a failure result so the scheduler stays live. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SQLite's single-writer lock caps ingestion at ~5 services. Require OTELCONTEXT_ALLOW_SQLITE_PROD=true to opt in, else fail at startup. Dev/test environments print a capability-ceiling warning but start. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Export otelcontext_db_pool_{open_connections,in_use,idle,wait_count,wait_duration_seconds} so operators can see whether DB_MAX_OPEN_CONNS is sized correctly. Sampled every 5s from sql.DB.Stats(). WaitCount/WaitDuration are cumulative values — compute rate() over them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

go mod tidy promotes github.com/glebarez/go-sqlite and github.com/prometheus/client_model from indirect to direct after the new test file in internal/telemetry references them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DLQ silently discarded oldest batches once MaxFiles/MaxDiskMB was exceeded. Add otelcontext_dlq_evicted_{total,bytes_total} counters plus a rate-limited warn log (one per enforceLimits call) so extended DB outages produce an observable signal instead of silent data loss. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Without a cooldown, a single stuck service produces one investigation insert every anomaly tick (default 10s). Add an in-memory sliding-window guard keyed by (trigger_service, root_service, root_operation) with a janitor on the refresh tick to bound map size. Expose InvestigationInsertCount() for tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Postgres uses percentile_disc, MySQL uses OFFSET, SQLite keeps in-memory sort but caps at 200k rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Four fixes from code review on 65f3742: Critical 1 (Postgres row leak): Switched the Postgres branch of p99DurationForQuery from .Row() (*sql.Row, no Close) to .Rows() so the underlying *sql.Rows is deterministically closed via defer. Prevents connection leak on sustained traffic. Critical 2 (MySQL tenant filter — VERIFIED, no code change needed): Added TestP99_MySQLBranch_PreservesTenantFilter which seeds 10 rows for tenant "a" (durations 1k..10k) and 10 rows for tenant "b" (durations 100k..1M), then calls GetDashboardStats scoped to tenant "a" on the MySQL branch. Asserts P99Latency == 10000. Test PASSES — GORM does preserve the parent WHERE clause through Session(&gorm.Session{}) + Model(&Trace{}). Reviewer's claim is disproven; no cross-tenant leak. Important 1 (hot-path log spam): Replaced slog.Warn in the SQLite cap branch with a Prometheus counter (otelcontext_dashboard_p99_row_cap_hits_total) registered under Metrics.DashboardP99RowCapHitsTotal. Kept a low-volume slog.Debug for dev observability. Counter is nil-guarded. Important 2 (context cancellation): Changed helper signature to p99DurationForQuery(ctx context.Context, session *gorm.DB). All sub-sessions now use Session(&gorm.Session{Context: ctx}) so client disconnects and request timeouts propagate to the driver. Call site in GetDashboardStats updated. Exported GetDashboardStats signature unchanged. Tests: 134 storage + 3 telemetry pass; go vet clean; go build clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds test/loadsim — a single Go binary (build tag: loadtest) that spins up 200 concurrent simulated services as goroutines and drives 50 spans/s per producer over 60s against the OTLP gRPC endpoint. Features: linear warmup stagger, ticker-based rate limiter (no new deps), 5% error rate, parent/child trace relationships every 10th span, SIGINT/SIGTERM graceful shutdown with exporter flush, and a progress+summary reporter. Includes 3 unit tests (TestServiceName, TestSpanFactory, TestRateLimiter) and a make loadtest / make loadtest-build target. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Per-producer *rand.Rand seeded from time+idx — removes the 200-goroutine mutex contention on the global math/rand RNG in the hot path. - Labeled break in the warmup stagger loop so SIGINT during ramp-up actually halts further producer launches (bare break only exited the enclosing select). - Drop dead google.golang.org/grpc/metadata import and the unused metadata.Pairs allocation — WithHeaders already sets x-tenant-id. - Drop redundant coord.producers slice (duplicated the local producers slice; nothing read it). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Making sqliteP99RowCap a var lets the cap-trigger test temporarily override it to 200 instead of seeding 200k+5k rows under -race, where SQLite batch inserts serialize through the race detector and time out the 180s test budget. Production default unchanged at 200_000. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add alert thresholds for the new metrics (graphrag events dropped, retention rows behind, db pool stats, dlq evicted, dashboard p99 cap hits). Document the new config knobs (GraphRAG workers, gRPC caps, retention batch pacing). Add a Scale & Load Testing section covering the 200-service simulator and healthy-run markers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Fix OPERATIONS.md env var name: DB_ALLOW_SQLITE_PROD → OTELCONTEXT_ALLOW_SQLITE_PROD - Add new Tasks 1-11 env vars to CLAUDE.md Configuration section - Document the OtelContext_* vs otelcontext_* metric prefix split in telemetry/metrics.go Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-04-23T18:59:15Z

Quality Gate failed

Failed conditions
3 Security Hotspots

See analysis details on SonarQube Cloud

+
 	grpcOpts := []grpc.ServerOption{
+		grpc.MaxRecvMsgSize(recvBytes * 1024 * 1024),
+		grpc.MaxConcurrentStreams(uint32(streams)),


aksOps and others added 20 commits April 23, 2026 15:10

perf(storage): driver-switched p99 computation

65f3742

Postgres uses percentile_disc, MySQL uses OFFSET, SQLite keeps in-memory sort but caps at 200k rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-advanced-security AI found potential problems Apr 23, 2026

View reviewed changes

Comment thread main.go

grpcOpts := []grpc.ServerOption{

grpc.MaxRecvMsgSize(recvBytes * 1024 * 1024),

grpc.MaxConcurrentStreams(uint32(streams)),

aksOps merged commit 1d844a4 into main Apr 23, 2026
8 of 10 checks passed

aksOps mentioned this pull request Apr 23, 2026

Post-robustness follow-ups #25

Merged

4 tasks

aksOps deleted the feat/backend-robustness-100-200-services branch April 26, 2026 05:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend robustness for 100-200 services#24

Backend robustness for 100-200 services#24
aksOps merged 20 commits into
mainfrom
feat/backend-robustness-100-200-services

aksOps commented Apr 23, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aksOps commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tasks (12)

New metrics

Test plan

Follow-ups (tracked separately, not blocking)

Uh oh!

sonarqubecloud Bot commented Apr 23, 2026

Quality Gate failed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aksOps commented Apr 23, 2026 •

edited

Loading