Skip to content

fix: SQLite survival hardening — anomaly-store memory blowup + security/CI quick-wins#97

Open
aksOps wants to merge 2 commits into
mainfrom
fix/sqlite-survival-hardening
Open

fix: SQLite survival hardening — anomaly-store memory blowup + security/CI quick-wins#97
aksOps wants to merge 2 commits into
mainfrom
fix/sqlite-survival-hardening

Conversation

@aksOps

@aksOps aksOps commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Hardens the SQLite-survival profile after a 15-minute, 120-service soak surfaced a serious in-memory blowup, plus a batch of security / CI-unblock / reliability fixes. Two atomic commits.

fix(graphrag): bound anomaly-store memory + GOMEMLIMIT safety net

A 15-min SQLite soak at 120 services drove RSS to ~1.8 GB and climbing. Heap profiling (gc=1, true live set) attributed 84% of the live heap to AnomalyStore PRECEDED_BY edges: the 10s detector minted a new anomaly node every tick per erroring service (anom_<svc>_err_<UnixNano>), and correlateWithRecent then created O(N²) edges among them — unbounded until the 24h TTL.

  • Stable per-(service,type) anomaly IDs so detection upserts one evolving node instead of one-per-tick. Bounds both the node map and the edge mesh: AnomalyStore 272 MB → 2.6 MB; peak RSS 1.8 GB → 292 MB, now flat over the full 15 min.
  • applyMemoryLimit() — sets a soft GOMEMLIMIT at startup (honors an explicit env value, else 75% of the detected cgroup v2/v1 → /proc/meminfo budget) so the GC paces against a ceiling instead of letting next_gc run away. Defense-in-depth; stdlib-only.
  • Regression + unit tests.

fix: security, CI-unblock, and reliability quick-wins

  • security(api): close a cross-tenant read — TenantMiddleware no longer overwrites an auth-pinned tenant (per-tenant key could be escaped via X-Tenant-ID).
  • fix(ingest): correct the token-bucket sampler math (old cost 1/rate > cap for rate<1.0 → ~100% of healthy spans dropped; SQLite default 0.05 persisted almost no baseline traces).
  • fix(api): clamp limit/offset on /api/logs & /api/traces (negative limit → GORM unlimited = DoS).
  • fix(ingest): sanitize X-Tenant-ID on the HTTP OTLP path (gRPC parity).
  • fix(mcp): don't cache error tool results; enforce the response byte cap in resourceResult.
  • fix(ui): correct the red ServiceSidePanel test (split DS markup), mount ErrorBoundary, derive the connected badge from ws.status.
  • chore: bump the go directive to 1.25.11 to unblock the OSV-Scanner CI gate.

Validation

  • 3× 15-min soaks (120 svc × ~550 spans/s) + heap-profile attribution.
  • Final run: peak heap 131 MB / RSS 292 MB (was 1597/1794 MB), flat; PRAGMA integrity_check = ok; 0 drops/429s; 0 ERROR/panic; clean shutdown; goroutines/fds recover to baseline; 30,285 spans / 120 services persisted.
  • go build ./..., go vet ./..., gofmt, go test ./... (pass), golangci-lint (clean on changed files), osv-scanner (green).

Storage note (7-day rolling retention)

At the tested profile: ~3.3 GB/day~25 GB steady-state on disk for 120 services. ~1.2 KB/persisted-span (incl. trace row + indexes + any error-log); ~50% data / 50% indexes. Dominated by the synthetic 4.2% error rate (errors are always kept) — realistic <1% error rates land closer to ~10–20 GB. Beyond that band, Postgres is the recommended path (per the existing main.go warning).

Known limitations / follow-ups

  • Load was generator-limited to ~550 spans/s (single loadsim); the peak-ingest drop/429 shedding path was not exercised — a multi-process / 200-service stress would characterize the ceiling.
  • Pre-existing main.go:625 G115 lint finding is untouched.

🤖 Generated with Claude Code

aksOps and others added 2 commits June 4, 2026 17:20
- security(api): close cross-tenant read caused by middleware ordering —
  TenantMiddleware now passes through when auth already pinned a tenant
  (HasTenantContext), so a per-tenant key can't be escaped via X-Tenant-ID
- fix(ingest): correct token-bucket sampler math; the old cost (1/rate)
  exceeded the cap for rate<1.0 so ~100% of healthy spans were dropped
  (SQLite default 0.05 persisted almost no baseline traces)
- fix(api): clamp limit/offset on /api/logs and /api/traces (negative limit
  was passed to GORM as unlimited — heap/DB DoS)
- fix(ingest): sanitize X-Tenant-ID on the HTTP OTLP path (gRPC parity)
- fix(mcp): don't cache error tool results; enforce the response byte cap
  in resourceResult (trace_graph DB fallback was uncapped)
- fix(ui): correct ServiceSidePanel test for split design-system markup,
  mount ErrorBoundary, derive connected badge from ws.status
- chore: bump go directive to 1.25.11 to unblock the OSV-Scanner CI gate

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A 15-min SQLite soak at 120 services drove RSS to ~1.8 GB and climbing.
Heap profiling (gc=1) attributed 84% of the live heap to AnomalyStore
PRECEDED_BY edges: the 10s detector minted a NEW anomaly node every tick
per erroring service (UnixNano-suffixed ID), and correlateWithRecent then
created O(N^2) edges among them — unbounded until the 24h TTL.

- fix: stable per-(service,type) anomaly IDs so detection UPSERTS one
  evolving node instead of one-per-tick; this bounds both the node map and
  the edge mesh (AnomalyStore 272 MB -> 2.6 MB; peak RSS 1.8 GB -> 292 MB,
  now flat over the full 15 min). + regression test.
- feat: applyMemoryLimit() sets a soft GOMEMLIMIT at startup — honors an
  explicit env value, else 75% of the detected cgroup/host budget — so the
  GC paces against a ceiling instead of letting next_gc run away. Defense
  in depth; cgroup v2/v1 + /proc/meminfo detection, stdlib-only. + tests.

Validation: 3x 15-min soaks + heap profile; integrity ok, 0 drops/429s,
0 ERROR/panic, clean shutdown, goroutines/fds recover, 30k spans/120 svcs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sonarqubecloud

sonarqubecloud Bot commented Jun 4, 2026

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant