fix(api): source services dropdown from GraphRAG, not traces table#69
Merged
Conversation
…e traces table /api/metadata/services queried `SELECT DISTINCT service_name FROM traces`, which silently dropped any service that only ever appeared as a callee deep in a fan-out. Root cause: TraceServer.Export inserts a Trace row for every span (not just root spans), the table has uniqueIndex(tenant_id, trace_id), so the deepest-fan-out span loses the insert race for its trace_id and its service_name never lands. Result on the simulator: 7 services emit telemetry but only 6 appear in the dropdown — shipping-service (deepest hop) is invisible while user-service occasionally wins the race. Fix: read the dropdown from the in-memory GraphRAG ServiceStore, which sees every UpsertService call regardless of span depth. This is the same source /api/system/graph already uses, so the dropdown now matches the system map exactly. No DB query — GraphRAG's own 60s refresh loop is responsible for cold-start population. Cold-start (first ~60s after restart) returns []; the JSON encoder emits `[]` not `null` so the UI dropdown stays a valid array on first paint. Tests in internal/graphrag/service_names_test.go cover: - a 3-deep fan-out where the grandchild service must still appear (the exact bug condition) - tenant scoping (no leak across tenants) - empty store returns [] not nil Live verification: with the simulator hammering 9001-9007 and otelcontext running on HTTP_PORT=37778, /api/metadata/services now returns all 7 services and matches /api/system/graph exactly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…upport The simulator's build and process-start steps used Windows-style \$TmpDir\\$svc.exe path concatenation, producing files with literal backslashes when run via pwsh on Linux. Replaced with Join-Path so the script works on both Windows and Linux without behaviour change. This is the path used to run the simulator that exposed the services-dropdown bug fixed in this PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
UI showed 6 services on the metadata dropdown but 7 on the system map. Reproduced live with the chaos simulator hitting all 7 test services — `shipping-service` was the consistent loser; `user-service` flapped (sometimes present, sometimes not).
Root cause
`/api/metadata/services` ran `SELECT DISTINCT service_name FROM traces`. `TraceServer.Export` (`internal/ingest/otlp.go:378-385`) appends a `storage.Trace{ServiceName: serviceName, ...}` row for every span — root or child — not just root spans. The table has `uniqueIndex idx_traces_tenant_trace_id` on `(tenant_id, trace_id)`, so only one row per trace_id survives. Whichever span won the insert race set the `service_name`. Shipping-service is the deepest hop in the fan-out, so by the time its OTLP batch arrived, the trace_id was already claimed.
Fix
`handleGetServices` now reads from `graphrag.ServiceNames(ctx)` — the in-memory `ServiceStore` populated by `OnSpanIngested → UpsertService`. Every span (root or child) registers, so deep callees can't be dropped. This is the same source `/api/system/graph` already uses, so the dropdown matches the system map by construction.
Files
Verification
Live, with the chaos simulator running:
```
$ curl -s http://localhost:37778/api/metadata/services
["auth-service","inventory-service","notification-service","order-service",
"payment-service","shipping-service","user-service"]
```
Now matches `/api/system/graph` exactly — all 7 services visible.
Test plan
🤖 Generated with Claude Code