feat(spotlight): pivot to error-rate-per-value + drop tautological attrs by coccyx · Pull Request #55 · criblio/apm

coccyx · 2026-05-29T17:54:21Z

Summary

Three iterations of share-differential visualizations all got the same feedback: "I don't understand what this is showing me." The Operations table reads "Charge: 14% error rate" — one number per row, instantly. Spotlight was making the user combine three numbers (sel share / base share / diff) to reach the same insight. Wrong primitive.

Pivot Spotlight to use the same primitive as the Operations table, generalized to every attribute.

For each attribute value: render a horizontal bar whose width is the error rate (or selection rate in unscoped contexts). Above the bar: value name + percentage. Below: "N total · X errors." Click Search → drill into matching spans.

Also drops response-side attributes (rpc.grpc.status_code, http.response.status_code, response_flags, error.type, exception.type) that 1:1 correlate with the error selection and just produce uninformative 100% bars.

Also drops the curated SVC_SPOTLIGHT_ATTRS short-list on Service Detail. The concurrency-cap-at-4 semaphore in api/search.ts already handles the cluster job limit; the curated list was just leaving real signal on the floor.

What you'll see

Errors page expansion of a product-catalog / GetProduct row with productCatalogFailure on:

Spotlight — error rate per attribute for product-catalog / GetProduct
app.product.id (overall 23% errors)

OLJCESPC7Z [████████████████] 100% — 642 total · 642 errors [Search →]
12345 [████████████████] 100% — 16 total · 16 errors [Search →]
66VCHSJNUP [ ] 0% — 559 total · 0 errors [Search →]
0PUK6V6EV0 [ ] 0% — 374 total · 0 errors [Search →]

The smoking gun (the product ID targeted by the flag) is immediately visible.

Architecture

Engine rewritten: selectionRate = selN / total per value; ranking by max(|rate - overall|) * log1p(total).
New selectionNoun prop on SpotlightPanel / SpotlightSection — "errors" on Service Detail / Errors, "matching" on Traces.
Added name and kind as queryable top-level span columns via attrValueExpr(). The OTel span operation name lives in the bare name column, not under attributes['name'].
SpotlightHistogram component deleted — the chart was the wrong primitive.
Response-side tautological attrs removed from SPOTLIGHT_ATTRIBUTES.
SVC_SPOTLIGHT_ATTRS removed; Service Detail uses the full broad list.

Tests

Engine: 11 new test cases for the rate metric + variance score + uniform-attribute filtering + tautology handling.
Queries: new test for top-level column resolution (name, kind).
Attribute list: explicit asserts that the tautological attrs are excluded.
107/107 total passing.

Test plan

npx tsc --noEmit — clean
npm run lint — 0 errors
npm test — 107/107 passing
npm run deploy — packed + uploaded + provisioned on staging
Manual validation with paymentFailure + productCatalogFailure scenarios

🤖 Generated with Claude Code

After the scoped-baseline fix landed, manual feedback was: "I see one thing — rpc.grpc.status_code — and I still don't understand how to read this or how it helps me find things." The chart-only design hid the actual information (value names, counts, percentages) behind a click. Visual asymmetry told users "something differs" without telling them what, or what to do about it. Each attribute card now leads with words and numbers; the chart is a scanning aid, not the primary readout. - TL;DR headline sentence above the chart. Picks the strongest single value (largest |diff|) and writes it out: "Selection over-represents `13` by +100.0 pp (144 sel vs 0 base)." Reads cold without translating bars. - Inline value rows below the chart — top 3 by default — with per-row counts, percentages, and a dedicated "Search →" button that drills into matching spans. No click required to see the substance. - Per-value Search button is explicit and single-purpose. Action is unambiguous. - Plain-English legend at the panel bottom: "selection (the spans you're investigating)" / "baseline (what they're being compared against)". The earlier "sel" / "base" jargon assumed background the novice user doesn't have. - "Show N more values" toggle for the long tail (>3 values). Same engine, same scoped-baseline data — purely a readability change. SpotlightSection (Errors, Service Detail) inherits. Validated on staging with paymentFailure 50%: - rpc.grpc.status_code differential now reads as "Selection under-represents `0` by -100.0 pp (0 sel vs 905 base)" with the two value rows visible inline (`0` -100pp, `13` +100pp), each with its Search button. Pre-merge: tsc clean, lint 0 errors, 104/104 unit tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

After three iterations of share-differential visualizations, the feedback stayed the same: "I don't understand what this is showing me or how to use it to find things." The Operations table on Service Detail shows "PaymentService/Charge: 14% error rate" — one number per row, instantly readable. Spotlight was making the user mentally combine sel share / base share / diff to recover the same insight. Wrong primitive. Pivot Spotlight to use the SAME primitive as the Operations table, generalized to every attribute: per-value selection rate. For each attribute value: - selN — spans with this value that ARE in the selection - baseN — spans with this value that are NOT in the selection - total — selN + baseN - selectionRate — selN / total (the headline metric) The card renders one row per value: value label, horizontal bar whose width IS the rate, the percentage, "N total · X errors" caption, and a per-value Search → button. Above-average rows get an accent tint; below get muted. No histograms, no overlaid distributions, no sel/base/diff jargon. Selection wording is context-aware via the new `selectionNoun` prop: "98% errors" on Service Detail / Errors pages instead of generic "98% selection rate." Score = max over rows of |row.rate - overall.rate| * log1p(total). The L∞ norm with volume weighting captures the case where a single tiny-but-extreme value is the real signal — the "one pod is broken" pattern that volume-weighted variance under-counted. Added `name` (and `kind`) as queryable attributes via a new attrValueExpr() helper. The OTel span operation name lives as a top-level column, not under attributes['name'] — without this fix Spotlight was blind to "which operation is failing," the most natural Service Detail differentiator. Result on staging with paymentFailure 50%: - Errors page expansion of frontend / POST /api/checkout shows http.status_code with rows "500: 50% (1,610 total · 805 errors)" and "200: 0% (1,636 total · 0 errors)" — bars + percentages make the insight immediate. - Service Detail on payment shows `name` with the failing op (PaymentService/Charge) called out with its error rate. Engine rewritten; SpotlightHistogram component removed (chart was the wrong primitive; bars are direct). 11 new engine test cases covering the rate metric + variance score; existing 95 still pass. Total: 106/106. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop response-side attrs that 1:1 correlate with the error selection and produce uninformative 100% bars: - rpc.grpc.status_code - http.response.status_code, http.status_code - response_flags, error.type, exception.type Drop the SVC_SPOTLIGHT_ATTRS curated list on Service Detail. The original rationale was avoiding the cluster's 20-concurrent-job ceiling, but the concurrency-cap-at-4 semaphore in api/search.ts already handles that. The curated list was leaving real signal on the floor — input-side attributes like app.product.id that surface the actual cause of scenario-driven failures. Service Detail (page-level + per-op expansion) now scans the same SPOTLIGHT_ATTRIBUTES set as Errors and Traces. Tests updated: the SPOTLIGHT_ATTRIBUTES content asserts now check for request-shape attrs and explicitly assert the tautological attrs are excluded. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Item #2 (Faceted trace search) is now DONE — folded into the v0.9.0 Faceted-search + Spotlight thread (PRs #46–#55) with a short summary block under Completed. - Item #13 (Settings page cleanup) is DONE — PR #56 (this branch). - Renumbered items 3–12 down by one to close the gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… (ROADMAP #13) (#56) * feat(settings): reorganize page into grouped sections with sticky nav ROADMAP item #13. Pure UX rearrangement — no new features, no schema or query changes. The page had grown to 8 sections in chronological-by-landing order rather than mental-model order. Setup actions were buried at the bottom; first-time installers had to scroll past 5 tuning sections to find provisioning. No section nav, no setup-status summary. Trace originators (a read-only audit table) was mixed with config-action sections. Noise filters and Error filtering were separated by Dataset even though they're the same kind of "shape what counts as an error" tuning. What's now in place: - Setup status card at the top — two checkmark rows (Scheduled searches / Dataset acceleration) showing live provisioning state. Green-left-border when both OK; otherwise shows what's missing with a "Jump to X" anchor button. Reuses the same planOnly + getDatasetStatus checks the global ProvisioningBanners use. - Two-column layout: sticky left-rail section nav (with IntersectionObserver-driven active-link highlight) and the content cards on the right. Collapses to single column below 960px. - Sections grouped by purpose with group headings: - Setup: Provisioning, Dataset acceleration (TOP). - Workspace: Dataset, Detection cadence, Notification targets. - Filtering & heuristics: Noise filters, Error filtering. - Diagnostics: Trace originators (collapsed by default). - Setup panels moved from the very bottom to the top. The duplicate at the bottom is removed. - Trace originators collapsed by default — operators rarely look at the classification audit; the section title is now a toggle. New files: SettingsSetupStatus.tsx (+css), SettingsNav.tsx (+css). Pre-merge: tsc clean, lint 0 errors, 107/107 unit tests, build succeeds, deployed + visually validated on staging. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * docs(roadmap): mark Faceted search + Settings cleanup as shipped - Item #2 (Faceted trace search) is now DONE — folded into the v0.9.0 Faceted-search + Spotlight thread (PRs #46–#55) with a short summary block under Completed. - Item #13 (Settings page cleanup) is DONE — PR #56 (this branch). - Renumbered items 3–12 down by one to close the gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Faceted trace search + Spotlight + Settings page reorganization. - Faceted-nav data layer + UI primitives (#46, #47) - Search-page integration with Spotlight rail (#48, #49) - Spotlight on Errors page + Service Detail (#50, #51, #53) - Small-multiples / readable-card / rate-bar Spotlight redesigns driven by manual validation feedback (#52, #54, #55) - Settings page reorganization with Setup status card, sticky nav, and grouped sections (#56) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coccyx changed the base branch from feat/spotlight-readable-results to master May 29, 2026 20:55

coccyx and others added 3 commits May 29, 2026 13:56

coccyx force-pushed the feat/spotlight-error-rate branch from d721984 to be74fdb Compare May 29, 2026 20:56

coccyx changed the title ~~feat(spotlight): pivot to error-rate-per-value (same primitive as Operations table)~~ feat(spotlight): pivot to error-rate-per-value + drop tautological attrs May 29, 2026

coccyx merged commit 11fc7f2 into master May 29, 2026
3 checks passed

coccyx deleted the feat/spotlight-error-rate branch May 29, 2026 20:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spotlight): pivot to error-rate-per-value + drop tautological attrs#55

feat(spotlight): pivot to error-rate-per-value + drop tautological attrs#55
coccyx merged 3 commits into
masterfrom
feat/spotlight-error-rate

coccyx commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

coccyx commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What you'll see

Architecture

Tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coccyx commented May 29, 2026 •

edited

Loading