Skip to content

feat(mcp): HTTP streamable robustness for frequent queries#55

Merged
aksOps merged 1 commit into
mainfrom
feat/mcp-robustness
Apr 27, 2026
Merged

feat(mcp): HTTP streamable robustness for frequent queries#55
aksOps merged 1 commit into
mainfrom
feat/mcp-robustness

Conversation

@aksOps

@aksOps aksOps commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Summary

Four robustness measures so frequent agent polling doesn't cripple the MCP server:

  1. Concurrency limit — counting semaphore (default 32) gates tools/call. Past the cap → JSON-RPC -32000 server-overloaded.
  2. Per-call timeout — default 30s deadline. Past it → JSON-RPC -32001 call-timeout, slot is freed.
  3. TTL result cache — 5s default for the cheap in-memory GraphRAG tools (get_service_map, impact_analysis, root_cause_analysis, get_anomaly_timeline, get_service_health). Keyed by (tenant, tool, args) with tenant-isolation and arg-order normalization.
  4. SSE keep-alive: keep-alive\n\n comment every 25s so reverse proxies (nginx/Envoy/Istio) don't time out idle MCP streams.

Why

Phase 6 of the 7-day-retention robustness initiative — the user's explicit ask "MCP HTTP streamable needs to be robust for frequent queries". Without these, sustained agent-side polling at >32 concurrent calls degrades GraphRAG, slow tool handlers wedge slots indefinitely, and idle SSE connections die behind reverse proxies.

Configuration (env, all opt-out via 0)

Setting Default
MCP_MAX_CONCURRENT 32
MCP_CALL_TIMEOUT_MS 30000
MCP_CACHE_TTL_MS 5000

Test plan

  • go test ./... -race -short -count=1 — full suite green
  • TestRobustness_ConcurrencyLimit_OverloadsBeyondCap — verifies -32000 and counter
  • TestRobustness_ConcurrencyLimit_NoCapWhenDisabled — verifies SetCallLimit(0) is no-op
  • TestRobustness_CallTimeout_AbortsLongRunningCall — verifies deadline ctx fires
  • TestRobustness_CacheHit_ServesFromCache — verifies pre-seeded result returns and CacheHits increments
  • TestRobustness_CacheKey_TenantIsolated — same (tool, args) across tenants → distinct keys
  • TestRobustness_CacheKey_StableAcrossArgOrder — JSON map order does not affect the key
  • TestRobustness_NonWhitelistedToolNotCached — cache whitelist is enforced
  • TestRobustness_CacheTTLDisabledSetCacheTTL(0) truly disables
  • TestRobustness_SSEHeartbeat_KeepsConnectionAlive — verifies initial event: endpoint and content-type
  • TestRobustness_StatsCounters_Increment — verifies counter movement on overload
  • golangci-lint run --new-from-rev=origin/main — clean

🤖 Generated with Claude Code

Adds four robustness measures to the MCP server so frequent agent-side
polling doesn't cripple it under load:

1. Concurrency limit. Counting semaphore (default 32 in-flight) gates
   tools/call. Beyond the cap, callers receive JSON-RPC error -32000
   "server overloaded" so well-behaved clients back off.

2. Per-call timeout. Default 30s deadline applied to every tools/call.
   Past it the handler returns JSON-RPC error -32001 "call timeout"
   and frees its slot.

3. Result cache. A small TTL cache (default 5s) memoizes the cheap
   in-memory GraphRAG tools (get_service_map, impact_analysis,
   root_cause_analysis, get_anomaly_timeline, get_service_health),
   keyed by (tenant, tool, args). Cache keys are tenant-scoped so two
   tenants don't collide; arg-map order is normalized so the same
   query hits regardless of client serialization quirks.

4. SSE keep-alive. The GET /mcp stream now emits a `: keep-alive\n\n`
   comment every 25s so reverse proxies (nginx/Envoy/Istio) don't time
   out idle MCP connections. Without this, low-update-rate workloads
   reliably hit "connection reset" mid-session.

New env vars (all opt-out via 0): MCP_MAX_CONCURRENT (32),
MCP_CALL_TIMEOUT_MS (30000), MCP_CACHE_TTL_MS (5000).

Tests: 11 unit tests covering overload rejection, no-cap path, timeout
abort, cache hit/miss, tenant isolation, arg-order stability, TTL
disable, SSE event-shape, and Stats counters. Docs updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

@aksOps aksOps merged commit 050525e into main Apr 27, 2026
17 checks passed
@aksOps aksOps deleted the feat/mcp-robustness branch April 27, 2026 17:05
aksOps added a commit that referenced this pull request Apr 28, 2026
Five small follow-ups from the second-pass review of PRs #49#55:

- tsdb: fire cardinality-overflow callback AFTER releasing the
  Aggregator mutex. The callback is currently a Prometheus
  increment (atomic) but holding mu across an external function
  call is a footgun for any future hook. Capture the tenant
  under lock; invoke after Unlock.
- storage: use errors.Is(err, sql.ErrNoRows) in pgLogsRelkind
  instead of strings.Contains(err.Error(), "no rows"). Robust
  against driver wrapping.
- storage: convert Repository.logsPartitioned from plain bool
  to atomic.Bool. Removes the memory-model fragility of "the
  writer ran first" — read by retention.go from a separate
  goroutine.
- config: reject negative MCP_MAX_CONCURRENT / MCP_CALL_TIMEOUT_MS
  / MCP_CACHE_TTL_MS at Validate(). 0 stays the documented
  "disable" sentinel; negatives are typos that should fail loud.
- mcp: upgrade SetCallLimit doc to flag it startup-only — runtime
  resize leaks a slot in the old channel.

Skipped (with rationale, not silently dropped):
- M1 Submit TOCTOU on closed pipeline — cosmetic only, current
  ordering is documented.
- M2 ring/onIngest setter races — would require API change to
  fix properly; benign during normal startup-only usage.
- M4 FTS5 trigger throughput — needs a bulk-rebuild path, not
  a one-line tweak.
- M5 isQueueFull scope — hypothetical concern with no observed
  symptom; revisit only if metrics show drift.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant