feat(mcp): HTTP streamable robustness for frequent queries by aksOps · Pull Request #55 · RandomCodeSpace/otelcontext

aksOps · 2026-04-27T17:03:55Z

Summary

Four robustness measures so frequent agent polling doesn't cripple the MCP server:

Concurrency limit — counting semaphore (default 32) gates tools/call. Past the cap → JSON-RPC -32000 server-overloaded.
Per-call timeout — default 30s deadline. Past it → JSON-RPC -32001 call-timeout, slot is freed.
TTL result cache — 5s default for the cheap in-memory GraphRAG tools (get_service_map, impact_analysis, root_cause_analysis, get_anomaly_timeline, get_service_health). Keyed by (tenant, tool, args) with tenant-isolation and arg-order normalization.
SSE keep-alive — : keep-alive\n\n comment every 25s so reverse proxies (nginx/Envoy/Istio) don't time out idle MCP streams.

Why

Phase 6 of the 7-day-retention robustness initiative — the user's explicit ask "MCP HTTP streamable needs to be robust for frequent queries". Without these, sustained agent-side polling at >32 concurrent calls degrades GraphRAG, slow tool handlers wedge slots indefinitely, and idle SSE connections die behind reverse proxies.

Configuration (env, all opt-out via 0)

Setting	Default
`MCP_MAX_CONCURRENT`	`32`
`MCP_CALL_TIMEOUT_MS`	`30000`
`MCP_CACHE_TTL_MS`	`5000`

Test plan

🤖 Generated with Claude Code

Adds four robustness measures to the MCP server so frequent agent-side polling doesn't cripple it under load: 1. Concurrency limit. Counting semaphore (default 32 in-flight) gates tools/call. Beyond the cap, callers receive JSON-RPC error -32000 "server overloaded" so well-behaved clients back off. 2. Per-call timeout. Default 30s deadline applied to every tools/call. Past it the handler returns JSON-RPC error -32001 "call timeout" and frees its slot. 3. Result cache. A small TTL cache (default 5s) memoizes the cheap in-memory GraphRAG tools (get_service_map, impact_analysis, root_cause_analysis, get_anomaly_timeline, get_service_health), keyed by (tenant, tool, args). Cache keys are tenant-scoped so two tenants don't collide; arg-map order is normalized so the same query hits regardless of client serialization quirks. 4. SSE keep-alive. The GET /mcp stream now emits a `: keep-alive\n\n` comment every 25s so reverse proxies (nginx/Envoy/Istio) don't time out idle MCP connections. Without this, low-update-rate workloads reliably hit "connection reset" mid-session. New env vars (all opt-out via 0): MCP_MAX_CONCURRENT (32), MCP_CALL_TIMEOUT_MS (30000), MCP_CACHE_TTL_MS (5000). Tests: 11 unit tests covering overload rejection, no-cap path, timeout abort, cache hit/miss, tenant isolation, arg-order stability, TTL disable, SSE event-shape, and Stats counters. Docs updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-04-27T17:04:19Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Five small follow-ups from the second-pass review of PRs #49–#55: - tsdb: fire cardinality-overflow callback AFTER releasing the Aggregator mutex. The callback is currently a Prometheus increment (atomic) but holding mu across an external function call is a footgun for any future hook. Capture the tenant under lock; invoke after Unlock. - storage: use errors.Is(err, sql.ErrNoRows) in pgLogsRelkind instead of strings.Contains(err.Error(), "no rows"). Robust against driver wrapping. - storage: convert Repository.logsPartitioned from plain bool to atomic.Bool. Removes the memory-model fragility of "the writer ran first" — read by retention.go from a separate goroutine. - config: reject negative MCP_MAX_CONCURRENT / MCP_CALL_TIMEOUT_MS / MCP_CACHE_TTL_MS at Validate(). 0 stays the documented "disable" sentinel; negatives are typos that should fail loud. - mcp: upgrade SetCallLimit doc to flag it startup-only — runtime resize leaks a slot in the old channel. Skipped (with rationale, not silently dropped): - M1 Submit TOCTOU on closed pipeline — cosmetic only, current ordering is documented. - M2 ring/onIngest setter races — would require API change to fix properly; benign during normal startup-only usage. - M4 FTS5 trigger throughput — needs a bulk-rebuild path, not a one-line tweak. - M5 isQueueFull scope — hypothetical concern with no observed symptom; revisit only if metrics show drift. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

aksOps merged commit 050525e into main Apr 27, 2026
17 checks passed

aksOps deleted the feat/mcp-robustness branch April 27, 2026 17:05

This was referenced Apr 28, 2026

fix(post-review): H1-H4 + C1 from deep code review #56

Merged

fix(post-review): M/L cleanup from deep code review #57

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mcp): HTTP streamable robustness for frequent queries#55

feat(mcp): HTTP streamable robustness for frequent queries#55
aksOps merged 1 commit into
mainfrom
feat/mcp-robustness

aksOps commented Apr 27, 2026

Uh oh!

sonarqubecloud Bot commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aksOps commented Apr 27, 2026

Summary

Why

Configuration (env, all opt-out via 0)

Test plan

Uh oh!

sonarqubecloud Bot commented Apr 27, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant