Skip to content

feat(graphrag): tenant-partition in-memory stores + query ctx (RAN-37)#27

Merged
aksOps merged 2 commits into
mainfrom
ran37-graphrag-tenant-partitioned-stores
Apr 25, 2026
Merged

feat(graphrag): tenant-partition in-memory stores + query ctx (RAN-37)#27
aksOps merged 2 commits into
mainfrom
ran37-graphrag-tenant-partitioned-stores

Conversation

@aksOps

@aksOps aksOps commented Apr 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Subtask A of the RAN-19 fix — in-memory and query surface only (persisted tables + merge-gate integration test remain with RAN-38 and RAN-39).

  • Every GraphRAG in-memory store is now partitioned per tenant. GraphRAG drops the four direct-store fields and holds tenants map[string]*tenantStores guarded by its own RWMutex; slices are lazily materialised via storesFor(ctx) / storesForTenant(tenant), with the default tenant seeded at New() so background loops have a baseline.
  • spanEvent / logEvent / metricEvent each carry Tenant string; OnSpanIngested / OnLogIngested / OnMetricIngested populate it from the existing storage.Span.TenantID / storage.Log.TenantID / tsdb.RawMetric.TenantID fields — callback signatures intentionally unchanged (TSDB and ingest wiring untouched).
  • Every public query method takes ctx context.Context as the first argument and routes through storesFor(ctx). Two narrow helpers — AllServiceEdges(ctx) and AnomaliesForService(ctx, service, since) — let external handlers read store data without reaching into the composite.
  • Background loops (refresh, snapshot, anomaly) iterate a stable snapshot of tenants each tick, wrap ctx with storage.WithTenantContext, and act per tenant. rebuildAllTenantsFromDB enumerates tenants from the spans table so tenants without live callbacks are still recovered on startup.
  • Drain log-template miner stays shared (templates describe shape, not content); per-tenant SignalStore maps keep cluster nodes isolated.
  • MCP tool handlers (tools.go) and the API handleGetSystemGraph thread ctx into queries; the 10s system-graph cache is tenant-keyed so cross-tenant cache hits are impossible.
  • PersistInvestigation accepts tenant as its first arg so RAN-38 can wire the column without a signature break. Table schema itself is untouched.

Acceptance

  • grep -i tenant internal/graphrag/ hits every file that holds per-tenant maps or accepts queries; schema.go (types only) and drain.go (shared log-shape miner) correctly do not.
  • go build ./..., go vet ./..., and go test ./... all green.
  • No reachable code path reads from a map[...]*Node without first going through storesFor(ctx) — verified by grep and the new TestGraphRAG_TenantIsolation_InMemoryStores regression.

Test plan

  • go build ./...
  • go vet ./...
  • go test ./...
  • go test -race ./internal/graphrag/...
  • New TestGraphRAG_TenantIsolation_InMemoryStores — two tenants ingest overlapping data; each tenant's ServiceMap / DependencyChain returns only its own services and traces.
  • New TestGraphRAG_StoresFor_EmptyContextCollapsesToDefault — empty-context lookups resolve to DefaultTenantID.

Out of scope

  • Persisted tables (investigations, graph_snapshots, drain_templates) — Subtask B (RAN-38) owns the tenant_id column migration.
  • MCP handler threading beyond compile-for-correctness and end-to-end two-tenant integration test — Subtask C (RAN-39), which is the merge gate for RAN-19.
  • internal/graph/ legacy package — unchanged per CLAUDE.md.

Refs RAN-19 RAN-21

Introduces a tenant dimension to every GraphRAG in-memory store and the
query surface, so every ingest and query lands in the calling tenant's
slice of the graph.

Architecture
- New tenantStores composite bundles one tenant's ServiceStore,
  TraceStore, SignalStore, AnomalyStore. GraphRAG drops the four
  direct-store fields and instead holds tenants map[string]*tenantStores
  guarded by its own RWMutex; slices are lazy-created via
  storesFor(ctx) / storesForTenant(tenant). The default tenant is
  seeded at New() so background loops have a baseline to iterate.
- spanEvent / logEvent / metricEvent each carry Tenant string;
  OnSpanIngested / OnLogIngested / OnMetricIngested derive it from
  storage.Span / storage.Log / tsdb.RawMetric TenantID fields. Callback
  signatures are intentionally unchanged — tsdb.RawMetric already
  carries TenantID from ingest/otlp.go, so no second argument or struct
  change is needed for the metric path.
- Every public query method takes ctx as its first argument and routes
  through storesFor(ctx): ErrorChain, ImpactAnalysis, RootCauseAnalysis,
  DependencyChain, CorrelatedSignals, ShortestPath, AnomalyTimeline,
  ServiceMap, SimilarErrors. Two narrow helpers (AllServiceEdges,
  AnomaliesForService) let handlers read store data without reaching
  into the composite directly.
- Background loops (refresh, snapshot, anomaly) iterate a stable
  snapshot of tenants on each tick, wrap ctx with
  storage.WithTenantContext(ctx, tenant), and act on each slice under
  its own per-store lock. rebuildAllTenantsFromDB enumerates tenants
  from the spans table so historical data from tenants without live
  callbacks is still recovered.
- The Drain log-template miner stays shared across tenants — templates
  describe log shape, not content — while per-tenant SignalStore maps
  keep cluster nodes isolated. Persisted tables (investigations,
  graph_snapshots, drain_templates) intentionally unchanged; Subtask B
  (RAN-38) owns the tenant_id column migration. PersistInvestigation
  now accepts tenant as its first arg so the column can be wired in B
  without a signature break.

External callers updated
- internal/mcp/tools.go: all GraphRAG-backed tools take ctx and thread
  it through the query call; AnomalyStore direct access replaced with
  AnomaliesForService.
- internal/api/graph_handler.go: handleGetSystemGraph uses
  r.Context() throughout, the 10s response cache is tenant-keyed so
  cross-tenant cache hits are impossible, and ServiceStore direct
  access replaced with AllServiceEdges.

Tests
- builder_test + investigation_cooldown_test updated to new signatures.
- New tenant_isolation_test covers the core RAN-37 invariant: two
  tenants ingesting overlapping data see only their own services /
  traces, and empty-context lookups collapse to DefaultTenantID.
- go build ./... && go vet ./... && go test ./... all green;
  go test -race ./internal/graphrag/... clean.

Scope
- Out of scope: persisted-table tenant columns (Subtask B, RAN-38),
  MCP-level tenant-isolation integration test (Subtask C, RAN-39 —
  the merge gate for RAN-19). internal/graph/ legacy package untouched.

Refs RAN-19 RAN-21

Co-Authored-By: Paperclip <noreply@paperclip.ing>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 901a4aee9a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Table("spans").
Select("span_id, parent_span_id, service_name, operation_name, duration, trace_id, status, start_time").
Where("start_time > ?", since).
Where("start_time > ? AND tenant_id = ?", since, tenant).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include empty tenant_id rows in default rebuild query

The per-tenant rebuild now filters with tenant_id = ?, which causes the default-tenant pass to miss historical spans whose tenant_id is NULL/empty (common for pre-backfill data). That means restart recovery can produce an empty in-memory graph even though recent spans exist. Since rebuildAllTenantsFromDB already normalizes empty tenant IDs to DefaultTenantID, this query should also include empty/NULL tenant rows when rebuilding the default tenant.

Useful? React with 👍 / 👎.

// anomaly loop) can re-enter ImpactAnalysis on the correct tenant slice. The
// `investigations` table itself does not yet carry a tenant_id column —
// Subtask B (RAN-38) owns that migration.
func (g *GraphRAG) PersistInvestigation(tenant, triggerService string, chains []ErrorChainResult, anomalies []*AnomalyNode) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Scope investigation cooldown key by tenant

After this change, PersistInvestigation is invoked per tenant, but the cooldown dedup key still excludes tenant identity and only uses trigger/root fields. In a multi-tenant setup, two tenants with the same service/root cause within the cooldown window will suppress each other, dropping legitimate investigations for the later tenant. Include tenant in the dedup key (or namespace the key before calling allow) to prevent cross-tenant suppression.

Useful? React with 👍 / 👎.

@aksOps

aksOps commented Apr 24, 2026

Copy link
Copy Markdown
Contributor Author

Heads-up while reviewing this PR: the current branch tip doesn't compile in isolation.

`internal/graphrag/clustering.go:87` calls `g.vectorIdx.Search(tenant, query, k*2)`, but `internal/vectordb/index.Search` is still the single-arg `(query, k)` signature on this branch. The follow-on vectordb tenant-partitioning changes (which appear to be the intended companion to this PR) seem to have been left out of the commit.

Repro:

```
go build ./...

internal/graphrag/clustering.go:87:47: too many arguments in call to g.vectorIdx.Search

```

I needed a buildable base to land RAN-38 (#28) on top of, so my local working tree carries the missing vectordb changes; my PR explicitly targets this branch and assumes RAN-37 lands first with those changes folded in. Flagging in case the omission was unintentional — happy to share the WIP I've been carrying if it helps unblock a quick fix here.

The PR-27 build failed CI because internal/graphrag/clustering.go and
internal/mcp/tools.go both called the 3-arg vectordb.Index.Search(tenant,
query, k) signature, but that signature lives on RAN-20's vectordb
tenant-isolation work and is not yet on main. The 3-arg form leaked in
from a stacked branch during the original RAN-37 cut.

Restore the 2-arg Search(query, k) call in both sites and leave a
TODO(RAN-20) so the proper vector-side tenant scoping lands with the
RAN-20 follow-up. RAN-37's in-memory tenant invariants are unaffected:
SimilarErrors still narrows results by the per-tenant SignalStore's
EMITTED_BY edges, so cross-tenant hits cannot surface even while the
underlying vector index is shared.

go build / vet / test ./... and -race on graphrag + api all green.

Co-Authored-By: Paperclip <noreply@paperclip.ing>
@sonarqubecloud

Copy link
Copy Markdown

@aksOps aksOps merged commit 33880e6 into main Apr 25, 2026
11 checks passed
@aksOps aksOps deleted the ran37-graphrag-tenant-partitioned-stores branch April 25, 2026 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant