Skip to content

[BUG] Webhook (Teams/Slack) alerts re-fire after an app restart - #981 restart-dedup was only applied to the email channel #1145

@gotqn

Description

@gotqn

Component

Lite

Performance Monitor Version

3.0.0

SQL Server Version

SQL Server 2022

Windows Version

Windows Server 2019

Describe the Bug

Restarting Lite re-sends a deadlock/blocking webhook alert that was already delivered before the restart, producing a duplicate Teams/Slack message with an identical fingerprint (same Dedup Key, same Occurrences). The restart-dedup added in #981 protects the email channel but was never applied to the webhook channel, and the edge-trigger watermark from #1091 is in-memory only — so a restart clears both guards that would normally suppress the re-send.

Rare in practice (you don't normally restart the app), but it can leak duplicate signals into downstream automation (ticketing, paging), so filing for completeness.

Steps to Reproduce

  1. Generate several deadlocks over the same object set so an alert fires. (I got a card with Occurrences: 4, Dedup Key: 37486afbc3…07bd3a15, at 16:10 local.)
  2. Let the alert deliver to Teams.
  3. Close and reopen Lite (the deadlocks are still within the 1-hour lookback window).
  4. On the first post-restart sweep, the same card is posted again (16:27 local) — identical Dedup Key and Occurrences: 4.

Expected Behavior

An alert already delivered shortly before a restart should not be re-delivered after the restart - i.e. the same guarantee #981 gives email.

Actual Behavior

The webhook is re-posted with the identical fingerprint.

Error Messages / Log Output


Screenshots

No response

Additional Context

Root cause
Two independent guards normally prevent the re-send; a restart clears both, and neither is rebuilt from persisted state for the webhook channel.

  1. Edge-trigger watermark is in-memory ([BUG] Deadlock alerts re‑fire on the rolling 1‑hour count every cooldown #1091). _lastAlertedDeadlockCount / _lastAlertedBlockingCount are plain in-memory dictionaries on the window with no load/save (Lite/MainWindow.xaml.cs:98-99). On restart the watermark defaults to 0. The alert check re-reads the last hour of deadlocks (GetRecentDeadlocksAsync(hoursBack: 1)), and RollingCountAlertGate.Evaluate(currentCount: 4, threshold: 1, watermark: 0, …) returns Fire = true because 4 > 0 (Lite/Services/RollingCountAlertGate.cs:55-87). In a live session the watermark would already be 4, so 4 > 4 is false and it stays quiet.

  2. Webhook cooldown is in-memory and never seeded from history. WebhookAlertService._cooldowns is a ConcurrentDictionary keyed webhook:{serverId}:{metricName}, populated only after a successful send (PerformanceMonitor.Notifications/WebhookAlertService.cs:40, 71-92). There is no seed-from-history path. Contrast the email path, which seeds its cooldown from config_alert_log on first use, with a comment explicitly citing this exact scenario:

"Seed the in-memory cooldown from the alert log the first time this key is seen, so an alert email sent shortly before an app restart is not immediately re-sent afterward (#981)." — EmailSendCore.cs:92-99 / DuckDbAlertHistoryStore.GetLastEmailSentUtcAsync (Lite/Services/DuckDbAlertHistoryStore.cs:112-149)

So #981 closed this for email but left the webhook channel open. For a webhook-only deployment there is zero restart protection.

Both gates must fail together: the watermark (0) is what makes Fire = true; the empty webhook cooldown is what lets the resulting post through.

Why the duplicate is byte-identical
There is no fingerprint-based suppression at send time — the Dedup Key / incident data is used only for rendering and per-event delivery shaping, never to ask "did I already send fingerprint X?" (that's delegated to downstream consumers). So the re-send carries the same Dedup Key and Occurrences verbatim.

Downstream impact
A consumer that dedups on the fingerprint against open tickets will catch the duplicate but re-add the occurrence count (e.g. a ticket showing 4 jumps to 8). A consumer that has already closed the ticket will open a new one. Either way the duplicate signal escapes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions