Skip to content

fix(alerts): use stable alert_id per service so resolutions emit#58

Merged
coccyx merged 2 commits into
masterfrom
fix/alert-resolve-orphans
May 30, 2026
Merged

fix(alerts): use stable alert_id per service so resolutions emit#58
coccyx merged 2 commits into
masterfrom
fix/alert-resolve-orphans

Conversation

@coccyx
Copy link
Copy Markdown
Contributor

@coccyx coccyx commented May 30, 2026

Symptom

Alert Timeline / Detected Issues showed every firing alert as ongoing forever. After turning off all flagd scenarios, alerts didn't clear; querying the `criblapm_alert` history feed for the last 24h returned 30 `firing` events and zero `resolved` events.

Root cause

The alert state machine produced firing events but the resolving→ok walk never ran, so no `resolved` events were emitted. Reason: the alert_id was built from `(signal_type, svc)` — e.g. `auto:error_rate:payment`. When the service recovers, `signal_type` rolls back to `"none"`, so the next eval cycle uses a different alert_id (`auto:none:payment`), gets no match in the `criblapm_alert_states` lookup, and defaults `prev_status="ok"`. The state machine sees "good cycle from ok" instead of "good cycle from firing" — never transitions through resolving.

Compounding it: the state export uses `mode=overwrite`, so the old `auto:error_rate:payment` row gets wiped before any subsequent eval could recover it.

(Original idea was "Option B" — union the eval with leftover rows from the lookup and walk them through. Discarded after MCP testing showed the overwrite-mode export deletes the orphan before the next cycle can find it.)

Fix

`alert_id` is now a stable per-service key:

```diff

  • | extend alert_id=strcat("auto:", signal_type, ":", svc)
  • | extend alert_id=strcat("auto:health:", svc)
    ```

`signal_type` is still recorded as a column for UI display, but doesn't participate in the key. Same service can't simultaneously fire on multiple service-level signals anyway, so coalescing all health alerts under one stable id matches how users think ("payment is unhealthy" — the type of unhealth is metadata, not identity).

Latency alerts keep their compound id `auto:latency:svc:op` — those need per-op granularity.

Effects

  • New firing/resolved cycles now emit both transitions cleanly.
  • Old phantom firing events already in the `otel` dataset (under the rotating id scheme) will age out of the Alert Timeline lookback as their start times fall outside the visible window. Lookup orphans are inert (no eval will ever produce them again).
  • Server-side verified: queried saved-search content via MCP, confirmed the rendered query now reads `alert_id=strcat("auto:health:", svc)`.

Test plan

  • `npx tsc --noEmit` — clean
  • `npm run lint` — 0 errors
  • `npm test` — 107/107 passing
  • `npm run deploy` — packed + uploaded + provisioned; saved-search updated
  • MCP probe of the deployed saved-search confirms new alert_id pattern is in effect

🤖 Generated with Claude Code

coccyx and others added 2 commits May 29, 2026 19:53
Same methodology as the 2026-05-19 baseline:
- 100 most-recent scheduled job runs via cribl_getSearchJobs
- tests/baseline-ui-timing.spec.ts, one consecutive run

Headlines:

- Two previously-failing searches recovered cleanly:
  - criblapm__attr_catalog (0/2 → 2/2)
  - criblapm__home_alerts_prev (0/3 → 3/3); alerts no longer
    silently running against stale criblapm_alert_prev data.

- Queue waits collapsed 70-94% across the board. op_baselines
  was 184 s p50; now 14.8 s.

- UI page latencies dropped 51-72% on the live-query pages.
  Services list: 35.3 s → 10.0 s. Service Detail: 33.2 s →
  9.4 s. Errors: 19.0 s → 9.3 s. Cache-hitting pages
  (Home, System Architecture) had less room and moved <10%.

Full tables appended to the perf-baseline session log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Alert Timeline showed every firing alert as ongoing forever
because the alert state machine produced "firing" events to the
otel dataset but never the matching "resolved" event.

Root cause: the alert_id was built from (signal_type, svc) — e.g.
"auto:error_rate:payment". When the service recovers, signal_type
rolls back to "none", so the new alert_id "auto:none:payment"
points at a fresh row in the criblapm_alert_states lookup. The
state machine sees prev_status="ok" (no entry), not "firing", so
the resolving→ok walk never fires and transitioned_to stays "".

Compounding it: the state export uses `mode=overwrite`, so the
old "auto:error_rate:payment" row is wiped every cycle. There's
no orphan-recovery hook to walk a stale row toward "ok".

Fix: alert_id now uses a stable per-service key
("auto:health:svc"). signal_type is still recorded as a column
for UI display, but it doesn't participate in the key. Same
service can't simultaneously fire on multiple signals anyway, so
coalescing all health-related alerts under one stable id matches
how users think about it ("payment is unhealthy" — the type of
unhealth is metadata, not identity).

Latency alerts keep their (svc, op) compound id since those need
per-op granularity.

Effects:
- New firing/resolved cycles now emit both transitions cleanly.
- Old phantom firing events in the otel dataset (already
  emitted under the rotating-id scheme) will age out of the
  Alert Timeline lookback as they fall outside the visible
  window; the lookup orphans are inert.
- Verified server-side: queried saved-search content via MCP,
  confirmed the rendered query now reads
  `alert_id=strcat("auto:health:", svc)`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coccyx coccyx merged commit 195ab0d into master May 30, 2026
3 checks passed
@coccyx coccyx deleted the fix/alert-resolve-orphans branch May 30, 2026 21:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant