feat(spotlight): scoped baseline for Service Detail + Errors by coccyx · Pull Request #53 · criblio/apm

coccyx · 2026-05-29T16:58:03Z

Summary

The Service Detail and Errors-page Spotlight panels were comparing the wrong things, surfaced during manual validation on the paymentFailure 50% scenario.

Old behavior: selection = error spans of this service / op; baseline = everything else in the time window. That baseline includes traffic from every other service, so the top-ranked attributes ended up being whatever made this service different from other services (rpc.method=Charge ranks high on payment because no other service does Charge). That's not the question the user is asking — they want to know what changed when this service started failing, and the comparison can't surface it when the baseline is contaminated.

New behavior: a scopeKql parameter restricts BOTH selection and baseline to the same scope. The differential becomes "failing spans of this scope vs healthy spans of this scope."

What changed

KQL + data layer

spotlightAttrDiff(attrName, selectionKql, top, scopeKql?) — new optional scope param. Injects a where clause BEFORE the extend attr_value step so both countif(sel_match==true) and countif(sel_match==false) are constrained to the scope.
getSpotlightDiff(attrs, selectionKql, earliest, latest, options) — positional topPerAttr / onAttr collapsed into a SpotlightDiffOptions object so future additions don't keep expanding the signature. Carries scopeKql, topPerAttr, onAttr.
SpotlightSection — new scopeKql prop plumbed through to getSpotlightDiff.

Call sites

Surface	Scope	Selection
Traces rail	(none — unchanged)	user filter
Service Detail	`service.name == X`	`status.code == \"2\"`
Per-op expansion	`service + operation`	`status.code == \"2\"`
Errors page expansion	`service + operation`	`status.code == \"2\"`

Attribute list

SPOTLIGHT_ATTRIBUTES gets 5 new "where did this come from?" entries:
peer.service, net.peer.name, net.peer.port, error.type, exception.type.

Captions

Rewritten to match the new comparison — e.g. "Comparing failing calls of this operation to its successful ones. Attributes with asymmetric charts are the ones that changed when this error fired — the most likely root-cause signals."

Validation

paymentFailure 50% active on staging.

Before this PR: Service-Detail Spotlight on payment showed 5–6 attrs that were just "payment is a gRPC service" noise (rpc.method=Charge, k8s.pod.name=payment-pod-X, etc.) — all uniform between sel and base, no actionable signal.

After this PR: ONE attribute surfaces — rpc.grpc.status_code (score 6.68). That's the actual root-cause differential between failing and healthy Charge calls.

Errors page: title reads "Spotlight — failing vs healthy calls of payment / oteldemo.PaymentService/Charge" and rpc.grpc.status_code is the surfaced differential — see the screenshot below.

Test plan

npx tsc --noEmit — clean
npm run lint — 0 errors
npm test — 104/104 passing (3 new for scope plumbing)
npm run deploy — packed + uploaded + provisioned on staging
Playwright captures succeeded for both Service Detail and Errors page

Validate on staging

paymentFailure 50% is still on.

Open Services → payment. Spotlight section should surface rpc.grpc.status_code (score ~6.68) and not the noisy rpc.method/k8s.pod.name attrs that previously dominated.
Open Errors → expand the payment Charge row. Title should read "failing vs healthy calls of payment / oteldemo.PaymentService/Charge".
Click any value in the per-op differential — drills to filtered Traces.

Known limitations / follow-ups

"Fewer attributes" is the right outcome but can look like the panel is empty. Empty-state copy could distinguish "nothing differs" from "still loading."
Status Mix chart on Service Detail still shows zero errors for gRPC services (separate bug — slices by HTTP status code class only). Follow-up PR.
Some newly added attributes (peer.service, exception.type) aren't populated in the OTel demo, so they get dropped by the engine. No-op until upstream emits them.

Session log

docs/sessions/2026-05-29-spotlight-scoped.md

🤖 Generated with Claude Code

The Service Detail and Errors-page Spotlight panels were comparing the wrong things. Selection was "errors of this service" but baseline was "everything else in the time window" — which includes traffic from every OTHER service. So the top-ranked attributes ended up being whatever made this service different from other services (rpc.method=Charge ranks high on payment because no other service does Charge) — but that's not the question the user is asking. They want to know what changed when this service started failing, which the comparison can't surface when the baseline is contaminated. - spotlightAttrDiff() accepts an optional scopeKql parameter. When set, BOTH selection and baseline are restricted to spans matching the scope. The differential becomes "what's different about my selection vs the REST OF THE SCOPE" instead of "vs the rest of the window." - getSpotlightDiff() — positional topPerAttr / onAttr collapsed into a SpotlightDiffOptions object so future additions don't keep expanding the signature. - SpotlightSection accepts scopeKql and plumbs it through. - Service Detail: scope = service.name == X, selection = status.code == "2". Compares failing spans of THIS SERVICE against healthy ones. - Per-op expansion on Service Detail: scope = service + operation, selection = status.code == "2". Compares failing calls of this op against successful ones. - Errors page expansion: same scope+selection as per-op. - Captions rewritten — "failing vs healthy calls of this operation" etc. SPOTLIGHT_ATTRIBUTES gets 5 new "where did this come from" entries: peer.service, net.peer.name, net.peer.port, error.type, exception.type. These tend to dominate the ranking when investigating why a service is failing. Validated on staging with paymentFailure 50% active: - Before: Service-Detail Spotlight on payment showed 5–6 attrs that were just "payment is a gRPC service" noise. - After: ONE attribute surfaces — rpc.grpc.status_code (score 6.68). That's the actual root-cause differential between failing and healthy Charge calls. - Errors page: title now reads "Spotlight — failing vs healthy calls of payment / oteldemo.PaymentService/Charge" and the rpc.grpc.status_code differential is surfaced cleanly. Pre-merge: tsc clean, lint 0 errors, 104/104 unit tests (3 new for the scope plumbing). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Faceted trace search + Spotlight + Settings page reorganization. - Faceted-nav data layer + UI primitives (#46, #47) - Search-page integration with Spotlight rail (#48, #49) - Spotlight on Errors page + Service Detail (#50, #51, #53) - Small-multiples / readable-card / rate-bar Spotlight redesigns driven by manual validation feedback (#52, #54, #55) - Settings page reorganization with Setup status card, sticky nav, and grouped sections (#56) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coccyx mentioned this pull request May 29, 2026

feat(spotlight): readable cards with TL;DR + inline values #54

Closed

5 tasks

coccyx merged commit 7b1f2f1 into master May 29, 2026
3 checks passed

coccyx deleted the feat/spotlight-scoped-baseline branch May 29, 2026 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spotlight): scoped baseline for Service Detail + Errors#53

feat(spotlight): scoped baseline for Service Detail + Errors#53
coccyx merged 1 commit into
masterfrom
feat/spotlight-scoped-baseline

coccyx commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

coccyx commented May 29, 2026

Summary

What changed

KQL + data layer

Call sites

Attribute list

Captions

Validation

Test plan

Validate on staging

Known limitations / follow-ups

Session log

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant