Skip to content

feat(spotlight): scoped baseline for Service Detail + Errors#53

Merged
coccyx merged 1 commit into
masterfrom
feat/spotlight-scoped-baseline
May 29, 2026
Merged

feat(spotlight): scoped baseline for Service Detail + Errors#53
coccyx merged 1 commit into
masterfrom
feat/spotlight-scoped-baseline

Conversation

@coccyx
Copy link
Copy Markdown
Contributor

@coccyx coccyx commented May 29, 2026

Summary

The Service Detail and Errors-page Spotlight panels were comparing the wrong things, surfaced during manual validation on the paymentFailure 50% scenario.

Old behavior: selection = error spans of this service / op; baseline = everything else in the time window. That baseline includes traffic from every other service, so the top-ranked attributes ended up being whatever made this service different from other services (rpc.method=Charge ranks high on payment because no other service does Charge). That's not the question the user is asking — they want to know what changed when this service started failing, and the comparison can't surface it when the baseline is contaminated.

New behavior: a scopeKql parameter restricts BOTH selection and baseline to the same scope. The differential becomes "failing spans of this scope vs healthy spans of this scope."

What changed

KQL + data layer

  • spotlightAttrDiff(attrName, selectionKql, top, scopeKql?) — new optional scope param. Injects a where clause BEFORE the extend attr_value step so both countif(sel_match==true) and countif(sel_match==false) are constrained to the scope.
  • getSpotlightDiff(attrs, selectionKql, earliest, latest, options) — positional topPerAttr / onAttr collapsed into a SpotlightDiffOptions object so future additions don't keep expanding the signature. Carries scopeKql, topPerAttr, onAttr.
  • SpotlightSection — new scopeKql prop plumbed through to getSpotlightDiff.

Call sites

Surface Scope Selection
Traces rail (none — unchanged) user filter
Service Detail service.name == X status.code == \"2\"
Per-op expansion service + operation status.code == \"2\"
Errors page expansion service + operation status.code == \"2\"

Attribute list

SPOTLIGHT_ATTRIBUTES gets 5 new "where did this come from?" entries:
peer.service, net.peer.name, net.peer.port, error.type, exception.type.

Captions

Rewritten to match the new comparison — e.g. "Comparing failing calls of this operation to its successful ones. Attributes with asymmetric charts are the ones that changed when this error fired — the most likely root-cause signals."

Validation

paymentFailure 50% active on staging.

Before this PR: Service-Detail Spotlight on payment showed 5–6 attrs that were just "payment is a gRPC service" noise (rpc.method=Charge, k8s.pod.name=payment-pod-X, etc.) — all uniform between sel and base, no actionable signal.

After this PR: ONE attribute surfaces — rpc.grpc.status_code (score 6.68). That's the actual root-cause differential between failing and healthy Charge calls.

Errors page: title reads "Spotlight — failing vs healthy calls of payment / oteldemo.PaymentService/Charge" and rpc.grpc.status_code is the surfaced differential — see the screenshot below.

errors

svc

Test plan

  • npx tsc --noEmit — clean
  • npm run lint — 0 errors
  • npm test — 104/104 passing (3 new for scope plumbing)
  • npm run deploy — packed + uploaded + provisioned on staging
  • Playwright captures succeeded for both Service Detail and Errors page

Validate on staging

paymentFailure 50% is still on.

  1. Open Services → payment. Spotlight section should surface rpc.grpc.status_code (score ~6.68) and not the noisy rpc.method/k8s.pod.name attrs that previously dominated.
  2. Open Errors → expand the payment Charge row. Title should read "failing vs healthy calls of payment / oteldemo.PaymentService/Charge".
  3. Click any value in the per-op differential — drills to filtered Traces.

Known limitations / follow-ups

  • "Fewer attributes" is the right outcome but can look like the panel is empty. Empty-state copy could distinguish "nothing differs" from "still loading."
  • Status Mix chart on Service Detail still shows zero errors for gRPC services (separate bug — slices by HTTP status code class only). Follow-up PR.
  • Some newly added attributes (peer.service, exception.type) aren't populated in the OTel demo, so they get dropped by the engine. No-op until upstream emits them.

Session log

docs/sessions/2026-05-29-spotlight-scoped.md

🤖 Generated with Claude Code

The Service Detail and Errors-page Spotlight panels were comparing
the wrong things. Selection was "errors of this service" but
baseline was "everything else in the time window" — which includes
traffic from every OTHER service. So the top-ranked attributes
ended up being whatever made this service different from other
services (rpc.method=Charge ranks high on payment because no other
service does Charge) — but that's not the question the user is
asking. They want to know what changed when this service started
failing, which the comparison can't surface when the baseline is
contaminated.

- spotlightAttrDiff() accepts an optional scopeKql parameter. When
  set, BOTH selection and baseline are restricted to spans matching
  the scope. The differential becomes "what's different about my
  selection vs the REST OF THE SCOPE" instead of "vs the rest of
  the window."
- getSpotlightDiff() — positional topPerAttr / onAttr collapsed
  into a SpotlightDiffOptions object so future additions don't keep
  expanding the signature.
- SpotlightSection accepts scopeKql and plumbs it through.
- Service Detail: scope = service.name == X, selection =
  status.code == "2". Compares failing spans of THIS SERVICE
  against healthy ones.
- Per-op expansion on Service Detail: scope = service + operation,
  selection = status.code == "2". Compares failing calls of this
  op against successful ones.
- Errors page expansion: same scope+selection as per-op.
- Captions rewritten — "failing vs healthy calls of this
  operation" etc.

SPOTLIGHT_ATTRIBUTES gets 5 new "where did this come from" entries:
peer.service, net.peer.name, net.peer.port, error.type,
exception.type. These tend to dominate the ranking when
investigating why a service is failing.

Validated on staging with paymentFailure 50% active:
- Before: Service-Detail Spotlight on payment showed 5–6 attrs
  that were just "payment is a gRPC service" noise.
- After: ONE attribute surfaces — rpc.grpc.status_code (score
  6.68). That's the actual root-cause differential between failing
  and healthy Charge calls.
- Errors page: title now reads "Spotlight — failing vs healthy
  calls of payment / oteldemo.PaymentService/Charge" and the
  rpc.grpc.status_code differential is surfaced cleanly.

Pre-merge: tsc clean, lint 0 errors, 104/104 unit tests
(3 new for the scope plumbing).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coccyx coccyx merged commit 7b1f2f1 into master May 29, 2026
3 checks passed
@coccyx coccyx deleted the feat/spotlight-scoped-baseline branch May 29, 2026 20:55
coccyx added a commit that referenced this pull request May 29, 2026
Faceted trace search + Spotlight + Settings page reorganization.

- Faceted-nav data layer + UI primitives (#46, #47)
- Search-page integration with Spotlight rail (#48, #49)
- Spotlight on Errors page + Service Detail (#50, #51, #53)
- Small-multiples / readable-card / rate-bar Spotlight redesigns
  driven by manual validation feedback (#52, #54, #55)
- Settings page reorganization with Setup status card, sticky
  nav, and grouped sections (#56)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant