Skip to content

feat(mcp,vectordb): tenant ctx + vector tenant isolation (RAN-39 + RAN-20)#31

Merged
aksOps merged 3 commits into
mainfrom
ran39-mcp-tenant-isolation
Apr 25, 2026
Merged

feat(mcp,vectordb): tenant ctx + vector tenant isolation (RAN-39 + RAN-20)#31
aksOps merged 3 commits into
mainfrom
ran39-mcp-tenant-isolation

Conversation

@aksOps

@aksOps aksOps commented Apr 25, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes RAN-39 and RAN-20 together because the two are interdependent: vectordb tenant isolation (RAN-20) relies on the MCP request-tenant ctx wiring (RAN-39) for find_similar_logs to receive the tenant, and the RAN-39 commit had to extend vectordb.Index.Search with a tenant arg to keep the build green at the call sites.

RAN-39 — MCP tenant ctx + merge-gate isolation test

  • get_system_graph and get_service_health accept ctx and route through GraphRAG.ServiceMap / AllServiceEdges so they pick up RAN-37's per-tenant in-memory partitioning. Legacy svcGraph remains as a fallback only when GraphRAG isn't wired (boot windows / future tests).
  • internal/mcp/tenant_isolation_test.go — stands up an in-process MCP server (httptest) wired to GraphRAG over in-memory SQLite, seeds three tenants (acme, beta, default) with overlapping service_name / trace_id / span_id / Drain template / log body / snapshot, and exercises every GraphRAG-backed tool three times each (X-Tenant-ID acme, beta, no header). Each response is scanned for the caller's own marker AND for any other seeded tenant's marker — proves no cross-tenant leak.
  • graphrag.RegisterAnomaly(tenant, AnomalyNode) — symmetric public API used by the test to seed per-tenant anomalies without depending on the throttled detector loop.

RAN-20 — vectordb tenant isolation

  • vectordb.Index.Add and Search now require a tenant argument; LogVector and SearchResult carry the tenant tag. Search filters by doc.Tenant so confidentiality is enforced at query time.
  • Tenant-aware FIFO eviction. When at cap, Add drops up to maxSize/10 of the inserting tenant's oldest rows so a noisy tenant cannot evict another tenant's warm rows (availability isolation; confidentiality is still enforced separately by doc.Tenant filtering in Search). Brand-new tenants on a full index drop one globally-oldest row as a one-time fairness fallback. This satisfies TechLead's "FIFO eviction of tenant A cannot evict tenant B's entries" assertion.
  • Cross-tenant startup hydration. repo.ListRecentHighSeverityLogsAllTenants — admin-scoped, explicitly cross-tenant, used only by the vector index's hydration goroutine. Closes the silent failure where GetLogsV2-based hydration only repopulated the default tenant's index on restart.
  • internal/api/similar_handler.go resolves tenant via storage.TenantFromContext(r.Context()) and passes it to Search.
  • internal/mcp/tools.go::toolFindSimilarLogs resolves via mcpCtx(ctx) + storage.TenantFromContext.
  • internal/graphrag/clustering.go::SimilarErrors resolves tenant from ctx and passes it to Search.
  • main.go live logHandler and hydration both pass l.TenantID into vectorIdx.Add.

Tests

  • internal/vectordb/index_test.go — 5 tests:
    • TestTenantIsolation_Search — confidentiality bar
    • TestUnknownTenantReturnsEmpty
    • TestEmptyTenantCoercedToDefault
    • TestFIFOEvictionFairness ← TechLead's third assertion
    • TestConcurrentTenantAddSearch — race-detector clean
  • internal/api/similar_handler_test.go — two-tenant integration through real TenantMiddleware, asserts zero cross-tenant LogID leakage on /api/logs/similar.
  • internal/mcp/tools_ran20_test.go — two-tenant integration on find_similar_logs, plus no-tenant-fallback-to-default assertion.
  • internal/mcp/tenant_isolation_test.go — full MCP-surface merge-gate test (three-tenant fan-out across every GraphRAG-backed tool).

Test plan

  • go vet ./... clean
  • go build ./... clean
  • go test ./... -count=1 clean (full suite)
  • go test -race -count=1 ./internal/{vectordb,api,mcp,graphrag,storage}/... clean

Why one PR (not two)

  • The vectordb signature change (Search(tenant, query, k)) lands in the RAN-39 commit out of necessity: RAN-37 already added tenant args at the call sites in graphrag/clustering.go and mcp/tools.go, leaving vectordb as the missing piece. Splitting RAN-39 from the vectordb signature would leave a non-buildable intermediate state.
  • All RAN-20 follow-up work (per-doc tenant tag, FIFO fairness, hydration helper, full test suite) sits on top of that foundation and is logically distinct but mechanically inseparable from RAN-39 in this PR's diff.
  • PR description splits the changes by issue so reviewers can audit each ticket independently.

🤖 Generated with Claude Code

aksOps and others added 3 commits April 25, 2026 16:58
…n test (RAN-39)

Threads the tenant resolved by the MCP transport (X-Tenant-ID header → ctx)
into every GraphRAG-backed tool handler and adds the merge-gate integration
test that asserts cross-tenant isolation for the full GraphRAG-backed MCP
surface.

mcp/tools.go
- get_system_graph and get_service_health now accept ctx and route through
  GraphRAG.ServiceMap / AllServiceEdges so they pick up RAN-37's per-tenant
  in-memory partitioning. Legacy svcGraph remains as a fallback path only
  when GraphRAG isn't wired (boot windows, future test harnesses).
- All other GraphRAG handlers were already ctx-threaded after RAN-37/38;
  no behavior change for those.

vectordb/index.go (+ main.go, api/similar_handler.go)
- vectordb.Index.Add and Search now take a tenant string; LogVector and
  SearchResult carry the tenant tag and Search filters by it. RAN-37
  already added tenant args at the call sites in graphrag/clustering.go
  and mcp/tools.go but the matching vectordb signature change had not
  landed, leaving the branch unbuildable. This closes that gap with the
  smallest surgical change; the broader vectordb rework remains RAN-20.
- main.go now passes l.TenantID into vectorIdx.Add on hydration and the
  live ingest hook; api/similar_handler resolves tenant from the request
  context before searching.

graphrag/builder.go
- New RegisterAnomaly(tenant, AnomalyNode) — small public API symmetric
  with PersistInvestigation, used by the new isolation test to seed
  per-tenant anomalies without depending on the throttled detector loop.

mcp/tenant_isolation_test.go
- Stands up an in-process MCP server (httptest) wired to GraphRAG over
  in-memory SQLite, seeds three tenants (acme, beta, default) with
  overlapping service_name / trace_id / span_id / Drain template / log
  body / snapshot, and exercises every GraphRAG-backed tool — get_service_map,
  get_service_health, get_error_chains, trace_graph, impact_analysis,
  root_cause_analysis, correlated_signals, get_anomaly_timeline,
  get_investigations, get_investigation (own + cross-tenant id-guess),
  get_graph_snapshot, find_similar_logs, get_system_graph — three times
  each (X-Tenant-ID acme, X-Tenant-ID beta, no header → DefaultTenantID).
  Each response is scanned for the caller's own tenant marker and for any
  other seeded tenant's marker (service name, log body, op name, anomaly
  evidence, snapshot id) to prove no cross-tenant leak.

Verified: go vet ./... clean; go test ./... clean; go test -race
./internal/{mcp,graphrag}/... clean.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
…isolation

Reviewer (cf5145d8) requested three changes on commit c839460. Addresses
each with verified failure-on-regression checks.

#1 — Cross-tenant boot hydration of vector index
Previously main.go hydrated the index via repo.GetLogsV2(appCtx, ...) which
is tenant-scoped to whatever tenant ctx carries — appCtx has none, so only
the default tenant's rows reloaded after restart and find_similar_logs was
cold for every other tenant until fresh ERROR logs landed. The new
tenant-aware vectorIdx.Add(..., l.TenantID, ...) didn't fix it because the
non-default rows were never fetched.

- internal/storage/log_repo.go: new ListRecentHighSeverityLogsAllTenants
  — explicitly cross-tenant administrative read used only by hydration.
  Each row carries its own TenantID; fan-out happens in the caller.
- main.go: hydration now uses the new method so every tenant's warm
  index survives a restart.
- internal/vectordb/index.go: tenant-aware FIFO eviction. At cap, drop
  up to maxSize/10 of the inserting tenant's oldest rows so a noisy
  tenant cannot evict another tenant's warm rows (availability isolation;
  confidentiality is still enforced by doc.Tenant filtering in Search).
  Brand-new tenants drop one globally-oldest row to claim a first slot.

#2 — root_cause_analysis assertion was vacuous
internal/mcp/tenant_isolation_test.go:434 passed an empty ownMarker to
assertNoLeak, so the merge gate would still pass if the tool regressed
to returning [] for every tenant. Now passes ownService (RankedCause
carries Service so it must appear in a non-empty response). Verified by
sabotaging the handler to return a nil slice — the assertion fails with
'expected own marker "acme-orders" in response, body=null' as designed,
then reverted.

#3 — Drain cluster-id test didn't actually compare cluster IDs
The previous test reused seedTenant (per-tenant service names) and only
scanned response text, so a regression that surfaced the same cluster row
across tenants would still pass. Rewritten to:

- Use one shared service name across both tenants so Drain produces
  colliding (service, templateID) keys — the SignalStore partition is the
  only thing keeping rows separate.
- Inspect the actual []graphrag.LogClusterNode returned by
  CorrelatedSignals (not just response text), checking Template +
  SampleLog content for own-marker presence and foreign-marker absence.
- Log the per-tenant cluster IDs so future refactors that change the ID
  scheme leave a visible audit trail.
- End-to-end probe via the MCP HTTP surface remains, asserting the same
  isolation reaches clients.

RAN-20 supporting tests
internal/vectordb/index_test.go, internal/api/similar_handler_test.go,
internal/mcp/tools_ran20_test.go cover vectordb tenant scoping at three
layers (in-memory, REST handler, MCP tool). They were sitting untracked
on the branch from the parallel RAN-20 work; bundling them so the
follow-up vectordb behavior added here is covered.

Verified:
- go vet ./... clean
- go test ./... clean (full repo)
- go test -race ./internal/{mcp,graphrag,vectordb,api}/... clean

Co-Authored-By: Paperclip <noreply@paperclip.ing>
…es (RAN-20)

The vectordb.Index.Search signature went tenant-aware in the prior commit
but two call sites still used the legacy 2-arg form, leaving the branch
unbuildable:

  - internal/mcp/tools.go: toolFindSimilarLogs had a TODO(RAN-20) marker
    and `_ = ctx`. Now resolves tenant via storage.TenantFromContext on
    mcpCtx(ctx) and passes it to Search.
  - internal/graphrag/clustering.go: SimilarErrors had the same TODO and
    `_ = ctx`. Now resolves tenant from the same ctx and passes it
    through.

Both surfaces share the storage tenant-context idiom used everywhere
else, so coercion rules (empty → DefaultTenantID) stay consistent.

Verified: go build ./... clean, go vet ./... clean, go test -race
./internal/{vectordb,api,mcp,graphrag,storage}/... clean.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@aksOps aksOps enabled auto-merge (squash) April 25, 2026 17:04
@sonarqubecloud

Copy link
Copy Markdown

@aksOps aksOps merged commit fbdb2b2 into main Apr 25, 2026
11 checks passed
@aksOps aksOps deleted the ran39-mcp-tenant-isolation branch April 25, 2026 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant