Backend robustness for 100-200 services#24
Merged
Conversation
The ingestion callback hardcoded Status: "OK" for every span, so error chains, RCA, anomaly detection, and impact analysis were all blind to real failures — the in-memory graph reported green regardless of wire truth, while the DB held the correct trace status. Root cause required more than the callback fix: storage.Span had no Status column, so per-span status was only available transiently at ingest time. Changes: - Add Status column to storage.Span (OTLP status code, indexed). - Ingest writes span.Status = statusStr so the persisted row matches the trace-level status that was already being captured. - OnSpanIngested now forwards span.Status (falling back to STATUS_CODE_UNSET if unset) instead of the hardcoded "OK". - refresh.rebuildFromDB SELECT now includes the status column so the periodic DB rebuild produces the same ErrorCount a live ingest would. Two new tests cover both paths: the event-loop callback and the DB-rebuild refresh loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace silent drops in OnSpanIngested/OnLogIngested/OnMetricIngested with
an atomic counter + Prometheus metric
otelcontext_graphrag_events_dropped_total{signal}.
Honor GRAPHRAG_WORKER_COUNT and GRAPHRAG_EVENT_QUEUE_SIZE envs so operators
can tune capacity without code changes. Start now uses the configured
worker count rather than the hardcoded defaultWorkerCount constant.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The global 100 RPS/IP default throttled real OTLP HTTP collectors. Add
MiddlewareExcept so /v1/{traces,logs,metrics} bypass the per-IP bucket
while /api/* still enforces the limit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a block comment at the exemption site in main.go explaining why OTLP ingestion paths bypass the per-IP limiter, and why this is acceptable despite the APIKeyGate running downstream of the limiter (header-only auth, bounded CPU per unauthenticated request). Include a TODO for a separate higher-ceiling OTLP-specific limiter if the trade-off becomes a concern. Also expand the top-of-function comment on TestRateLimiter_ExemptsOTLPPaths to explain why the test exists — locking the exemption in so a future refactor cannot silently re-enable throttling on /v1/* and regress legitimate ingestion traffic. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OTLP batches at production volume routinely exceed the 4 MiB gRPC default. Set 16 MiB default recv, 1000-stream concurrency cap, and keepalive (60s ping / 10s timeout / 10m idle / 2h max age). Tunable via GRPC_MAX_RECV_MB and GRPC_MAX_CONCURRENT_STREAMS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code-review follow-up for Task 4 — a bad env value could push MaxRecvMsgSize past RAM (10 GiB allocation) or wrap MaxConcurrentStreams cast. Bound to 1..256 MiB and 1..1M respectively via Validate(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run logs/traces/metric_buckets purges concurrently on Postgres/MySQL (still
serial on SQLite — single-writer lock). Make batch size and inter-batch
sleep configurable via RETENTION_BATCH_SIZE (default 50000) and
RETENTION_BATCH_SLEEP_MS (default 1). Expose
otelcontext_retention_rows_behind{table,driver} so operators see when
purge cannot keep pace with ingest.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Code-review follow-up for Task 5. Without defer+recover, a panic inside Purge*Batched would leave the main loop blocked on <-results; the outer running.CompareAndSwap guard would then skip every subsequent hourly tick. Extract a runGuarded helper that recovers and still forwards a failure result so the scheduler stays live. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SQLite's single-writer lock caps ingestion at ~5 services. Require OTELCONTEXT_ALLOW_SQLITE_PROD=true to opt in, else fail at startup. Dev/test environments print a capability-ceiling warning but start. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Export otelcontext_db_pool_{open_connections,in_use,idle,wait_count,wait_duration_seconds}
so operators can see whether DB_MAX_OPEN_CONNS is sized correctly. Sampled
every 5s from sql.DB.Stats(). WaitCount/WaitDuration are cumulative values
— compute rate() over them.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
go mod tidy promotes github.com/glebarez/go-sqlite and github.com/prometheus/client_model from indirect to direct after the new test file in internal/telemetry references them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DLQ silently discarded oldest batches once MaxFiles/MaxDiskMB was exceeded.
Add otelcontext_dlq_evicted_{total,bytes_total} counters plus a
rate-limited warn log (one per enforceLimits call) so extended DB
outages produce an observable signal instead of silent data loss.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Without a cooldown, a single stuck service produces one investigation insert every anomaly tick (default 10s). Add an in-memory sliding-window guard keyed by (trigger_service, root_service, root_operation) with a janitor on the refresh tick to bound map size. Expose InvestigationInsertCount() for tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Postgres uses percentile_disc, MySQL uses OFFSET, SQLite keeps in-memory sort but caps at 200k rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four fixes from code review on 65f3742: Critical 1 (Postgres row leak): Switched the Postgres branch of p99DurationForQuery from .Row() (*sql.Row, no Close) to .Rows() so the underlying *sql.Rows is deterministically closed via defer. Prevents connection leak on sustained traffic. Critical 2 (MySQL tenant filter — VERIFIED, no code change needed): Added TestP99_MySQLBranch_PreservesTenantFilter which seeds 10 rows for tenant "a" (durations 1k..10k) and 10 rows for tenant "b" (durations 100k..1M), then calls GetDashboardStats scoped to tenant "a" on the MySQL branch. Asserts P99Latency == 10000. Test PASSES — GORM does preserve the parent WHERE clause through Session(&gorm.Session{}) + Model(&Trace{}). Reviewer's claim is disproven; no cross-tenant leak. Important 1 (hot-path log spam): Replaced slog.Warn in the SQLite cap branch with a Prometheus counter (otelcontext_dashboard_p99_row_cap_hits_total) registered under Metrics.DashboardP99RowCapHitsTotal. Kept a low-volume slog.Debug for dev observability. Counter is nil-guarded. Important 2 (context cancellation): Changed helper signature to p99DurationForQuery(ctx context.Context, session *gorm.DB). All sub-sessions now use Session(&gorm.Session{Context: ctx}) so client disconnects and request timeouts propagate to the driver. Call site in GetDashboardStats updated. Exported GetDashboardStats signature unchanged. Tests: 134 storage + 3 telemetry pass; go vet clean; go build clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds test/loadsim — a single Go binary (build tag: loadtest) that spins up 200 concurrent simulated services as goroutines and drives 50 spans/s per producer over 60s against the OTLP gRPC endpoint. Features: linear warmup stagger, ticker-based rate limiter (no new deps), 5% error rate, parent/child trace relationships every 10th span, SIGINT/SIGTERM graceful shutdown with exporter flush, and a progress+summary reporter. Includes 3 unit tests (TestServiceName, TestSpanFactory, TestRateLimiter) and a make loadtest / make loadtest-build target. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Per-producer *rand.Rand seeded from time+idx — removes the 200-goroutine mutex contention on the global math/rand RNG in the hot path. - Labeled break in the warmup stagger loop so SIGINT during ramp-up actually halts further producer launches (bare break only exited the enclosing select). - Drop dead google.golang.org/grpc/metadata import and the unused metadata.Pairs allocation — WithHeaders already sets x-tenant-id. - Drop redundant coord.producers slice (duplicated the local producers slice; nothing read it). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Making sqliteP99RowCap a var lets the cap-trigger test temporarily override it to 200 instead of seeding 200k+5k rows under -race, where SQLite batch inserts serialize through the race detector and time out the 180s test budget. Production default unchanged at 200_000. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add alert thresholds for the new metrics (graphrag events dropped, retention rows behind, db pool stats, dlq evicted, dashboard p99 cap hits). Document the new config knobs (GraphRAG workers, gRPC caps, retention batch pacing). Add a Scale & Load Testing section covering the 200-service simulator and healthy-run markers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Fix OPERATIONS.md env var name: DB_ALLOW_SQLITE_PROD → OTELCONTEXT_ALLOW_SQLITE_PROD - Add new Tasks 1-11 env vars to CLAUDE.md Configuration section - Document the OtelContext_* vs otelcontext_* metric prefix split in telemetry/metrics.go Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
|
|
||
| grpcOpts := []grpc.ServerOption{ | ||
| grpc.MaxRecvMsgSize(recvBytes * 1024 * 1024), | ||
| grpc.MaxConcurrentStreams(uint32(streams)), |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Summary
make loadtest)Tasks (12)
GRAPHRAG_WORKER_COUNT,GRAPHRAG_EVENT_QUEUE_SIZE/v1/*MaxRecvMsgSize,MaxConcurrentStreams, keepalive capsrows_behindgaugeAPP_ENV=productionunlessOTELCONTEXT_ALLOW_SQLITE_PROD=truein_use,idle,wait_count, etc.) sampled every 5sdlq_evicted_total+dlq_evicted_bytes_totalcounterspercentile_disc, MySQL OFFSET, SQLite capped//go:build loadtestOPERATIONS.md+CLAUDE.mdupdates for the new knobs and metricsNew metrics
All new metrics use the Prometheus-idiomatic
otelcontext_*prefix (see comment ininternal/telemetry/metrics.goexplaining the split from legacyOtelContext_*):otelcontext_graphrag_events_dropped_total{signal}otelcontext_retention_rows_behind{kind}otelcontext_db_pool_{max_open,in_use,idle,wait_count,wait_duration_seconds}otelcontext_dlq_evicted_total,otelcontext_dlq_evicted_bytes_totalotelcontext_dashboard_p99_row_cap_hits_totalTest plan
go build ./...cleango vet ./...cleanCGO_ENABLED=1 go test -race -timeout 180s ./...— 10 packages, 0 failuresgo test -tags loadtest ./test/loadsim/...— 3 unit tests passmake loadtest-buildproducesbin/loadsim./bin/loadsimfor 60s against a fresh backend and confirm healthy markers from OPERATIONS.md (manual, pre-merge sanity)retention_consecutive_failures,graphrag_events_dropped_total,db_pool_in_usefor one hourFollow-ups (tracked separately, not blocking)
verifyP99Index(Task 10), package-levelrandomDuration(Task 11)StartRuntimeMetricsgoroutine stop channel (pre-existing).env.example(pre-existing gap)🤖 Generated with Claude Code