fix(alerts): use stable alert_id per service so resolutions emit by coccyx · Pull Request #58 · criblio/apm

coccyx · 2026-05-30T04:25:05Z

Symptom

Alert Timeline / Detected Issues showed every firing alert as ongoing forever. After turning off all flagd scenarios, alerts didn't clear; querying the `criblapm_alert` history feed for the last 24h returned 30 `firing` events and zero `resolved` events.

Root cause

The alert state machine produced firing events but the resolving→ok walk never ran, so no `resolved` events were emitted. Reason: the alert_id was built from `(signal_type, svc)` — e.g. `auto:error_rate:payment`. When the service recovers, `signal_type` rolls back to `"none"`, so the next eval cycle uses a different alert_id (`auto:none:payment`), gets no match in the `criblapm_alert_states` lookup, and defaults `prev_status="ok"`. The state machine sees "good cycle from ok" instead of "good cycle from firing" — never transitions through resolving.

Compounding it: the state export uses `mode=overwrite`, so the old `auto:error_rate:payment` row gets wiped before any subsequent eval could recover it.

(Original idea was "Option B" — union the eval with leftover rows from the lookup and walk them through. Discarded after MCP testing showed the overwrite-mode export deletes the orphan before the next cycle can find it.)

Fix

`alert_id` is now a stable per-service key:

```diff

| extend alert_id=strcat("auto:", signal_type, ":", svc)

| extend alert_id=strcat("auto:health:", svc)
```

`signal_type` is still recorded as a column for UI display, but doesn't participate in the key. Same service can't simultaneously fire on multiple service-level signals anyway, so coalescing all health alerts under one stable id matches how users think ("payment is unhealthy" — the type of unhealth is metadata, not identity).

Latency alerts keep their compound id `auto:latency:svc:op` — those need per-op granularity.

Effects

New firing/resolved cycles now emit both transitions cleanly.
Old phantom firing events already in the `otel` dataset (under the rotating id scheme) will age out of the Alert Timeline lookback as their start times fall outside the visible window. Lookup orphans are inert (no eval will ever produce them again).
Server-side verified: queried saved-search content via MCP, confirmed the rendered query now reads `alert_id=strcat("auto:health:", svc)`.

Test plan

`npx tsc --noEmit` — clean
`npm run lint` — 0 errors
`npm test` — 107/107 passing
`npm run deploy` — packed + uploaded + provisioned; saved-search updated
MCP probe of the deployed saved-search confirms new alert_id pattern is in effect

🤖 Generated with Claude Code

Same methodology as the 2026-05-19 baseline: - 100 most-recent scheduled job runs via cribl_getSearchJobs - tests/baseline-ui-timing.spec.ts, one consecutive run Headlines: - Two previously-failing searches recovered cleanly: - criblapm__attr_catalog (0/2 → 2/2) - criblapm__home_alerts_prev (0/3 → 3/3); alerts no longer silently running against stale criblapm_alert_prev data. - Queue waits collapsed 70-94% across the board. op_baselines was 184 s p50; now 14.8 s. - UI page latencies dropped 51-72% on the live-query pages. Services list: 35.3 s → 10.0 s. Service Detail: 33.2 s → 9.4 s. Errors: 19.0 s → 9.3 s. Cache-hitting pages (Home, System Architecture) had less room and moved <10%. Full tables appended to the perf-baseline session log. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Alert Timeline showed every firing alert as ongoing forever because the alert state machine produced "firing" events to the otel dataset but never the matching "resolved" event. Root cause: the alert_id was built from (signal_type, svc) — e.g. "auto:error_rate:payment". When the service recovers, signal_type rolls back to "none", so the new alert_id "auto:none:payment" points at a fresh row in the criblapm_alert_states lookup. The state machine sees prev_status="ok" (no entry), not "firing", so the resolving→ok walk never fires and transitioned_to stays "". Compounding it: the state export uses `mode=overwrite`, so the old "auto:error_rate:payment" row is wiped every cycle. There's no orphan-recovery hook to walk a stale row toward "ok". Fix: alert_id now uses a stable per-service key ("auto:health:svc"). signal_type is still recorded as a column for UI display, but it doesn't participate in the key. Same service can't simultaneously fire on multiple signals anyway, so coalescing all health-related alerts under one stable id matches how users think about it ("payment is unhealthy" — the type of unhealth is metadata, not identity). Latency alerts keep their (svc, op) compound id since those need per-op granularity. Effects: - New firing/resolved cycles now emit both transitions cleanly. - Old phantom firing events in the otel dataset (already emitted under the rotating-id scheme) will age out of the Alert Timeline lookback as they fall outside the visible window; the lookup orphans are inert. - Verified server-side: queried saved-search content via MCP, confirmed the rendered query now reads `alert_id=strcat("auto:health:", svc)`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coccyx and others added 2 commits May 29, 2026 19:53

coccyx merged commit 195ab0d into master May 30, 2026
3 checks passed

coccyx deleted the fix/alert-resolve-orphans branch May 30, 2026 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(alerts): use stable alert_id per service so resolutions emit#58

fix(alerts): use stable alert_id per service so resolutions emit#58
coccyx merged 2 commits into
masterfrom
fix/alert-resolve-orphans

coccyx commented May 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

coccyx commented May 30, 2026

Symptom

Root cause

Fix

Effects

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant