Skip to content

fix(api): source services dropdown from GraphRAG, not traces table#69

Merged
aksOps merged 2 commits into
mainfrom
fix/services-dropdown-from-graphrag
Apr 28, 2026
Merged

fix(api): source services dropdown from GraphRAG, not traces table#69
aksOps merged 2 commits into
mainfrom
fix/services-dropdown-from-graphrag

Conversation

@aksOps

@aksOps aksOps commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Summary

UI showed 6 services on the metadata dropdown but 7 on the system map. Reproduced live with the chaos simulator hitting all 7 test services — `shipping-service` was the consistent loser; `user-service` flapped (sometimes present, sometimes not).

Root cause

`/api/metadata/services` ran `SELECT DISTINCT service_name FROM traces`. `TraceServer.Export` (`internal/ingest/otlp.go:378-385`) appends a `storage.Trace{ServiceName: serviceName, ...}` row for every span — root or child — not just root spans. The table has `uniqueIndex idx_traces_tenant_trace_id` on `(tenant_id, trace_id)`, so only one row per trace_id survives. Whichever span won the insert race set the `service_name`. Shipping-service is the deepest hop in the fan-out, so by the time its OTLP batch arrived, the trace_id was already claimed.

Fix

`handleGetServices` now reads from `graphrag.ServiceNames(ctx)` — the in-memory `ServiceStore` populated by `OnSpanIngested → UpsertService`. Every span (root or child) registers, so deep callees can't be dropped. This is the same source `/api/system/graph` already uses, so the dropdown matches the system map by construction.

  • No DB query. GraphRAG's existing 60s refresh loop is responsible for cold-start population — no need to duplicate the work in this handler.
  • Cold-start (first ~60s after restart) returns `[]`. The encoder emits `[]` not `null` so the UI stays on a valid array on first paint.
  • `Repository.GetServices` is left in place (no caller changes) — its query is fine for any consumer that genuinely wants the legacy traces-table semantic.

Files

  • `internal/graphrag/queries.go` — adds `func (g *GraphRAG) ServiceNames(ctx context.Context) []string` (sorted, tenant-scoped, mirrors the existing `AllServiceEdges` pattern).
  • `internal/api/metrics_handlers.go` — `handleGetServices` reads from GraphRAG; nil-safe when `s.graphRAG == nil`.
  • `internal/graphrag/service_names_test.go` — two tests:
    • 3-deep fan-out where the grandchild service must still appear (exact bug condition)
    • empty store returns `[]` not `nil`
    • tenant scoping covered in the same test (separate tenants don't leak)

Verification

Live, with the chaos simulator running:
```
$ curl -s http://localhost:37778/api/metadata/services
["auth-service","inventory-service","notification-service","order-service",
"payment-service","shipping-service","user-service"]
```
Now matches `/api/system/graph` exactly — all 7 services visible.

Test plan

  • `go test ./internal/graphrag/ ./internal/api/ -race` — 92 passed
  • `go build ./...` — clean
  • Live: dropdown shows all 7 services, parity with system graph
  • CI green before merge

🤖 Generated with Claude Code

aksOps and others added 2 commits April 28, 2026 09:53
…e traces table

/api/metadata/services queried `SELECT DISTINCT service_name FROM traces`,
which silently dropped any service that only ever appeared as a callee
deep in a fan-out. Root cause: TraceServer.Export inserts a Trace row for
every span (not just root spans), the table has uniqueIndex(tenant_id,
trace_id), so the deepest-fan-out span loses the insert race for its
trace_id and its service_name never lands. Result on the simulator: 7
services emit telemetry but only 6 appear in the dropdown — shipping-service
(deepest hop) is invisible while user-service occasionally wins the race.

Fix: read the dropdown from the in-memory GraphRAG ServiceStore, which
sees every UpsertService call regardless of span depth. This is the same
source /api/system/graph already uses, so the dropdown now matches the
system map exactly. No DB query — GraphRAG's own 60s refresh loop is
responsible for cold-start population.

Cold-start (first ~60s after restart) returns []; the JSON encoder emits
`[]` not `null` so the UI dropdown stays a valid array on first paint.

Tests in internal/graphrag/service_names_test.go cover:
- a 3-deep fan-out where the grandchild service must still appear
  (the exact bug condition)
- tenant scoping (no leak across tenants)
- empty store returns [] not nil

Live verification: with the simulator hammering 9001-9007 and otelcontext
running on HTTP_PORT=37778, /api/metadata/services now returns all 7
services and matches /api/system/graph exactly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…upport

The simulator's build and process-start steps used Windows-style
\$TmpDir\\$svc.exe path concatenation, producing files with literal
backslashes when run via pwsh on Linux. Replaced with Join-Path so the
script works on both Windows and Linux without behaviour change.

This is the path used to run the simulator that exposed the
services-dropdown bug fixed in this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

@aksOps aksOps merged commit 7a6ac1c into main Apr 28, 2026
17 checks passed
@aksOps aksOps deleted the fix/services-dropdown-from-graphrag branch April 28, 2026 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant