fix(alerts): use stable alert_id per service so resolutions emit#58
Merged
Conversation
Same methodology as the 2026-05-19 baseline:
- 100 most-recent scheduled job runs via cribl_getSearchJobs
- tests/baseline-ui-timing.spec.ts, one consecutive run
Headlines:
- Two previously-failing searches recovered cleanly:
- criblapm__attr_catalog (0/2 → 2/2)
- criblapm__home_alerts_prev (0/3 → 3/3); alerts no longer
silently running against stale criblapm_alert_prev data.
- Queue waits collapsed 70-94% across the board. op_baselines
was 184 s p50; now 14.8 s.
- UI page latencies dropped 51-72% on the live-query pages.
Services list: 35.3 s → 10.0 s. Service Detail: 33.2 s →
9.4 s. Errors: 19.0 s → 9.3 s. Cache-hitting pages
(Home, System Architecture) had less room and moved <10%.
Full tables appended to the perf-baseline session log.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Alert Timeline showed every firing alert as ongoing forever
because the alert state machine produced "firing" events to the
otel dataset but never the matching "resolved" event.
Root cause: the alert_id was built from (signal_type, svc) — e.g.
"auto:error_rate:payment". When the service recovers, signal_type
rolls back to "none", so the new alert_id "auto:none:payment"
points at a fresh row in the criblapm_alert_states lookup. The
state machine sees prev_status="ok" (no entry), not "firing", so
the resolving→ok walk never fires and transitioned_to stays "".
Compounding it: the state export uses `mode=overwrite`, so the
old "auto:error_rate:payment" row is wiped every cycle. There's
no orphan-recovery hook to walk a stale row toward "ok".
Fix: alert_id now uses a stable per-service key
("auto:health:svc"). signal_type is still recorded as a column
for UI display, but it doesn't participate in the key. Same
service can't simultaneously fire on multiple signals anyway, so
coalescing all health-related alerts under one stable id matches
how users think about it ("payment is unhealthy" — the type of
unhealth is metadata, not identity).
Latency alerts keep their (svc, op) compound id since those need
per-op granularity.
Effects:
- New firing/resolved cycles now emit both transitions cleanly.
- Old phantom firing events in the otel dataset (already
emitted under the rotating-id scheme) will age out of the
Alert Timeline lookback as they fall outside the visible
window; the lookup orphans are inert.
- Verified server-side: queried saved-search content via MCP,
confirmed the rendered query now reads
`alert_id=strcat("auto:health:", svc)`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symptom
Alert Timeline / Detected Issues showed every firing alert as ongoing forever. After turning off all flagd scenarios, alerts didn't clear; querying the `criblapm_alert` history feed for the last 24h returned 30 `firing` events and zero `resolved` events.
Root cause
The alert state machine produced firing events but the resolving→ok walk never ran, so no `resolved` events were emitted. Reason: the alert_id was built from `(signal_type, svc)` — e.g. `auto:error_rate:payment`. When the service recovers, `signal_type` rolls back to `"none"`, so the next eval cycle uses a different alert_id (`auto:none:payment`), gets no match in the `criblapm_alert_states` lookup, and defaults `prev_status="ok"`. The state machine sees "good cycle from ok" instead of "good cycle from firing" — never transitions through resolving.
Compounding it: the state export uses `mode=overwrite`, so the old `auto:error_rate:payment` row gets wiped before any subsequent eval could recover it.
(Original idea was "Option B" — union the eval with leftover rows from the lookup and walk them through. Discarded after MCP testing showed the overwrite-mode export deletes the orphan before the next cycle can find it.)
Fix
`alert_id` is now a stable per-service key:
```diff
```
`signal_type` is still recorded as a column for UI display, but doesn't participate in the key. Same service can't simultaneously fire on multiple service-level signals anyway, so coalescing all health alerts under one stable id matches how users think ("payment is unhealthy" — the type of unhealth is metadata, not identity).
Latency alerts keep their compound id `auto:latency:svc:op` — those need per-op granularity.
Effects
Test plan
🤖 Generated with Claude Code