feat(spotlight): scoped baseline for Service Detail + Errors#53
Merged
Conversation
The Service Detail and Errors-page Spotlight panels were comparing the wrong things. Selection was "errors of this service" but baseline was "everything else in the time window" — which includes traffic from every OTHER service. So the top-ranked attributes ended up being whatever made this service different from other services (rpc.method=Charge ranks high on payment because no other service does Charge) — but that's not the question the user is asking. They want to know what changed when this service started failing, which the comparison can't surface when the baseline is contaminated. - spotlightAttrDiff() accepts an optional scopeKql parameter. When set, BOTH selection and baseline are restricted to spans matching the scope. The differential becomes "what's different about my selection vs the REST OF THE SCOPE" instead of "vs the rest of the window." - getSpotlightDiff() — positional topPerAttr / onAttr collapsed into a SpotlightDiffOptions object so future additions don't keep expanding the signature. - SpotlightSection accepts scopeKql and plumbs it through. - Service Detail: scope = service.name == X, selection = status.code == "2". Compares failing spans of THIS SERVICE against healthy ones. - Per-op expansion on Service Detail: scope = service + operation, selection = status.code == "2". Compares failing calls of this op against successful ones. - Errors page expansion: same scope+selection as per-op. - Captions rewritten — "failing vs healthy calls of this operation" etc. SPOTLIGHT_ATTRIBUTES gets 5 new "where did this come from" entries: peer.service, net.peer.name, net.peer.port, error.type, exception.type. These tend to dominate the ranking when investigating why a service is failing. Validated on staging with paymentFailure 50% active: - Before: Service-Detail Spotlight on payment showed 5–6 attrs that were just "payment is a gRPC service" noise. - After: ONE attribute surfaces — rpc.grpc.status_code (score 6.68). That's the actual root-cause differential between failing and healthy Charge calls. - Errors page: title now reads "Spotlight — failing vs healthy calls of payment / oteldemo.PaymentService/Charge" and the rpc.grpc.status_code differential is surfaced cleanly. Pre-merge: tsc clean, lint 0 errors, 104/104 unit tests (3 new for the scope plumbing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
coccyx
added a commit
that referenced
this pull request
May 29, 2026
Faceted trace search + Spotlight + Settings page reorganization. - Faceted-nav data layer + UI primitives (#46, #47) - Search-page integration with Spotlight rail (#48, #49) - Spotlight on Errors page + Service Detail (#50, #51, #53) - Small-multiples / readable-card / rate-bar Spotlight redesigns driven by manual validation feedback (#52, #54, #55) - Settings page reorganization with Setup status card, sticky nav, and grouped sections (#56) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The Service Detail and Errors-page Spotlight panels were comparing the wrong things, surfaced during manual validation on the
paymentFailure 50%scenario.Old behavior: selection = error spans of this service / op; baseline = everything else in the time window. That baseline includes traffic from every other service, so the top-ranked attributes ended up being whatever made this service different from other services (
rpc.method=Chargeranks high on payment because no other service does Charge). That's not the question the user is asking — they want to know what changed when this service started failing, and the comparison can't surface it when the baseline is contaminated.New behavior: a
scopeKqlparameter restricts BOTH selection and baseline to the same scope. The differential becomes "failing spans of this scope vs healthy spans of this scope."What changed
KQL + data layer
spotlightAttrDiff(attrName, selectionKql, top, scopeKql?)— new optional scope param. Injects awhereclause BEFORE theextend attr_valuestep so bothcountif(sel_match==true)andcountif(sel_match==false)are constrained to the scope.getSpotlightDiff(attrs, selectionKql, earliest, latest, options)— positionaltopPerAttr/onAttrcollapsed into aSpotlightDiffOptionsobject so future additions don't keep expanding the signature. CarriesscopeKql,topPerAttr,onAttr.SpotlightSection— newscopeKqlprop plumbed through togetSpotlightDiff.Call sites
service.name == Xstatus.code == \"2\"service + operationstatus.code == \"2\"service + operationstatus.code == \"2\"Attribute list
SPOTLIGHT_ATTRIBUTESgets 5 new "where did this come from?" entries:peer.service,net.peer.name,net.peer.port,error.type,exception.type.Captions
Rewritten to match the new comparison — e.g. "Comparing failing calls of this operation to its successful ones. Attributes with asymmetric charts are the ones that changed when this error fired — the most likely root-cause signals."
Validation
paymentFailure 50%active on staging.Before this PR: Service-Detail Spotlight on payment showed 5–6 attrs that were just "payment is a gRPC service" noise (
rpc.method=Charge,k8s.pod.name=payment-pod-X, etc.) — all uniform between sel and base, no actionable signal.After this PR: ONE attribute surfaces —
rpc.grpc.status_code(score 6.68). That's the actual root-cause differential between failing and healthy Charge calls.Errors page: title reads "Spotlight — failing vs healthy calls of payment / oteldemo.PaymentService/Charge" and
rpc.grpc.status_codeis the surfaced differential — see the screenshot below.Test plan
npx tsc --noEmit— cleannpm run lint— 0 errorsnpm test— 104/104 passing (3 new for scope plumbing)npm run deploy— packed + uploaded + provisioned on stagingValidate on staging
paymentFailure 50%is still on.rpc.grpc.status_code(score ~6.68) and not the noisyrpc.method/k8s.pod.nameattrs that previously dominated.Known limitations / follow-ups
peer.service,exception.type) aren't populated in the OTel demo, so they get dropped by the engine. No-op until upstream emits them.Session log
docs/sessions/2026-05-29-spotlight-scoped.md🤖 Generated with Claude Code