Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Fixed

- **Lite: blocking/deadlock XE sessions now self-heal and failures are surfaced** ([#1086]) — the `PerformanceMonitor_BlockedProcess` and `PerformanceMonitor_Deadlock` Extended Events sessions were created only when a server tab was opened; the recurring background collection loop never created or retried them. A server monitored without an open tab (e.g. app minimized to tray after a restart), or a first attempt that failed (connection not ready, missing `ALTER ANY EVENT SESSION`), left blocking/deadlock capture permanently dead — while the collectors read the non-existent ring buffer, got zero rows, and reported **OK**. The session ensure now runs inside the collector itself on every cycle (cheap existence check once created), so both the tab-open path and the background loop create/start/retry it. A failed ensure can no longer be masked: it fails the collector run, shows in the status-bar collector health (including permission failures, which previously didn't count as "erroring"), and fires a one-time tray notification ("Capture Not Running") on the transition. The Azure SQL DB database-scoped sessions also gain `STARTUP_STATE = ON` so they restart automatically after a failover
- **Dashboard: blocking/deadlock XE sessions self-heal, Azure SQL DB sessions are actually created, and a missing session raises a Capture Down alert** — same silent-failure family as [#1086], worse on the Dashboard side. (1) The server-scoped sessions were created once at install and never re-ensured: if later stopped or dropped, `collect.blocked_process_xml_collector` and `collect.deadlock_xml_collector` swallowed the missing-session error and logged `SUCCESS` with zero rows forever. Both procs now ensure (create/start) the session at the top of every run. (2) On Azure SQL DB, the code comments claimed the database-scoped sessions were "auto-created by the collection procedures" — nothing anywhere created them, so blocking/deadlock capture was 100% non-functional on Azure SQL DB; the procs now create and start them (`database_xml_deadlock_report` for deadlocks — the Azure read also filtered on the wrong event name and would have returned nothing even with a session present). (3) Honest logging: when the session is genuinely absent and can't be created (typically missing `ALTER ANY EVENT SESSION` on-prem / `CREATE ANY DATABASE EVENT SESSION` on Azure SQL DB), the run logs `SESSION_MISSING` with the real error instead of `SUCCESS`. (4) The alert engine reads that status and raises a **Capture Down** alert through the standard pipeline — snoozable tray notification, email, webhook, alert history, cooldown, and mute — with a **Capture Restored** clear when the session comes back. Note: on Azure SQL DB the blocked-process *threshold* cannot be set via `sp_configure` and Microsoft documents no default, so the blocked-process session may exist yet capture nothing there; deadlock capture has no such dependency

- **Lite and Dashboard UI no longer goes blank or disappears after sleep/wake** ([#1050]) — closing a laptop lid (or locking the screen) and then resuming could leave the app running with no usable window: notifications kept firing but the window was gone from the desktop and taskbar, and relaunching showed an empty window until a full exit/restart. Two causes, both fixed. (1) WPF's GPU render thread can lose its rendering surface across a sleep/wake or RDP reconnect and never recover, leaving a live-but-blank window; both apps now use software rendering (`RenderOptions.ProcessRenderMode = SoftwareOnly`) to remove the GPU dependency — charts are unaffected because ScottPlot already renders via SkiaSharp. (2) When Windows turned the sleep-driven minimize into a hidden window, the minimize-to-tray logic left it hidden with no automatic way back; a new shared resume guard now restores the window from the tray on resume/unlock if it was visible beforehand (a window the user deliberately sent to the tray is left alone)
- **"Silence All Alerts" now suppresses email too** ([#1035]) — right-clicking a monitored instance and choosing *Silence All Alerts* hid tray notifications and Alerts-tab badges, but two email paths ignored the silenced state and kept sending: connection up/down emails (*Server Unreachable* / *Server Restored*) and analysis-finding emails (the narrative findings from the analysis engine, which include CPU/memory/blocking stories). Only the threshold-alert path (High CPU, blocking, deadlocks, etc.) honored silencing. Both gaps are closed — a silenced server now produces no tray, email, or alert-history row from any path. The analysis path was the likely source of the reporter's "High CPU" email, since the threshold-based High CPU alert was already suppressed. The shared `AnalysisNotificationService` (used by Lite too) gains an optional per-server silence predicate; Lite has no silencing feature and passes none
- **Dashboard time labels are now consistently 24-hour** ([#1012]) — the time-range header at the top of each tab (e.g. *"Original: May 28, 11:30 PM – May 29, 1:30 AM (PST)"*) and the Query Performance heatmap x-axis tick labels used `h:mm tt`, while every other timestamp in the app (footer "Last refresh", DataGrid columns, slicer, tooltips, logs) already used 24-hour `HH:mm`/`HH:mm:ss`. The AM/PM marker was also being truncated in the column shown by the reporter. Normalized the four outliers to `HH:mm` to match the rest of the app. The Lite heatmap had the same `h:mm tt` straggler — fixed alongside
Expand Down Expand Up @@ -41,6 +44,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
[#1012]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/1012
[#1035]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/1035
[#1050]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/1050
[#1086]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/1086

## [2.11.0] - 2026-05-19

Expand Down
57 changes: 57 additions & 0 deletions Dashboard/MainWindow.xaml.cs
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,8 @@ public partial class MainWindow : Window
private readonly ConcurrentDictionary<string, bool> _activeTempDbSpaceAlert = new();
private readonly ConcurrentDictionary<string, DateTime> _lastLongRunningJobAlert = new();
private readonly ConcurrentDictionary<string, bool> _activeLongRunningJobAlert = new();
private readonly ConcurrentDictionary<string, DateTime> _lastCaptureDownAlert = new();
private readonly ConcurrentDictionary<string, bool> _activeCaptureDownAlert = new();
private readonly ConcurrentDictionary<string, long> _previousDeadlockCounts = new();

private const double ExpandedWidth = 250;
Expand Down Expand Up @@ -1611,6 +1613,61 @@ await _emailAlertService.TrySendAlertEmailAsync(
"0", prefs.DeadlockThreshold.ToString(), true, "tray");
}

/* Capture Down alerts — the blocking/deadlock XE session is missing and the
collector couldn't create it, so capture is silently non-functional (#1086).
Gated on the blocking/deadlock notification prefs: if the user wants those
alerts, they need to know when the data feeding them stops existing. */
bool captureDown = (prefs.NotifyOnBlocking || prefs.NotifyOnDeadlock)
&& health.MissingCaptureSessions.Count > 0;

if (captureDown)
{
_activeCaptureDownAlert[serverId] = true;
if (!_lastCaptureDownAlert.TryGetValue(serverId, out var lastAlert) || (now - lastAlert) >= alertCooldown)
{
var muteCtx = new AlertMuteContext { ServerName = serverName, MetricName = "Capture Down" };
bool isMuted = _muteRuleService.IsAlertMuted(muteCtx);
_lastCaptureDownAlert[serverId] = now;

var captureList = string.Join(" and ", health.MissingCaptureSessions);
var detailText = $"The {captureList} Extended Events session(s) are missing and could not be created. " +
"Blocking/deadlock data is NOT being captured. " +
"Check the collection log for the SESSION_MISSING error detail (usually a permissions problem: " +
"ALTER ANY EVENT SESSION on-prem, CREATE ANY DATABASE EVENT SESSION on Azure SQL DB).";

if (!isMuted)
{
_notificationService?.ShowSnoozableNotification(
"Capture Down",
$"{serverName}: {captureList} capture is not running — XE session missing",
NotificationType.Error,
serverName,
"Capture Down",
_muteRuleService);
}

_emailAlertService.RecordAlert(serverId, serverName, "Capture Down",
captureList, "session running", !isMuted, isMuted ? "muted" : "tray", muted: isMuted, detailText: detailText);

if (!isMuted)
{
await _emailAlertService.TrySendAlertEmailAsync(
"Capture Down",
serverName,
captureList,
"session running",
serverId);
}
}
}
else if (_activeCaptureDownAlert.TryRemove(serverId, out var wasCaptureDown) && wasCaptureDown)
{
_notificationService?.ShowNotification("Capture Restored",
$"{serverName}: Blocking/deadlock capture is running again");
_emailAlertService.RecordAlert(serverId, serverName, "Capture Restored",
"running", "session running", true, "tray");
}

/* High CPU alerts — evaluator picks Total or SQL based on prefs.CpuAlertMode */
int? alertCpuValue = prefs.CpuAlertMode == CpuAlertMode.Total
? health.TotalCpuPercent
Expand Down
7 changes: 7 additions & 0 deletions Dashboard/Models/AlertHealthResult.cs
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,13 @@ public class AlertHealthResult
public List<AnomalousJobInfo> AnomalousJobs { get; set; } = new();
public bool IsOnline { get; set; } = true;

/// <summary>
/// Capture types ("Blocking", "Deadlock") whose XE session is missing —
/// the collector's latest collection_log status is SESSION_MISSING (#1086).
/// Empty when both sessions are healthy.
/// </summary>
public List<string> MissingCaptureSessions { get; set; } = new();

/// <summary>
/// Total CPU = SQL + Other.
/// </summary>
Expand Down
52 changes: 50 additions & 2 deletions Dashboard/Services/DatabaseService.NocHealth.cs
Original file line number Diff line number Diff line change
Expand Up @@ -152,10 +152,11 @@ public async Task<AlertHealthResult> GetAlertHealthAsync(
var longRunningTask = GetLongRunningQueriesAsync(connection, longRunningQueryThresholdMinutes, longRunningQueryMaxResults, excludeSpServerDiagnostics, excludeWaitFor, excludeBackups, excludeMiscWaits);
var tempDbTask = GetTempDbSpaceAsync(connection);
var anomalousJobTask = GetAnomalousJobsAsync(connection, longRunningJobMultiplier);
var missingCaptureTask = GetMissingCaptureSessionsAsync(connection);

var allTasks = filteredDeadlockTask != null
? new Task[] { cpuTask, blockingTask, deadlockTask, filteredDeadlockTask, poisonWaitTask, longRunningTask, tempDbTask, anomalousJobTask }
: new Task[] { cpuTask, blockingTask, deadlockTask, poisonWaitTask, longRunningTask, tempDbTask, anomalousJobTask };
? new Task[] { cpuTask, blockingTask, deadlockTask, filteredDeadlockTask, poisonWaitTask, longRunningTask, tempDbTask, anomalousJobTask, missingCaptureTask }
: new Task[] { cpuTask, blockingTask, deadlockTask, poisonWaitTask, longRunningTask, tempDbTask, anomalousJobTask, missingCaptureTask };
await Task.WhenAll(allTasks);

var cpuResult = await cpuTask;
Expand All @@ -173,6 +174,7 @@ public async Task<AlertHealthResult> GetAlertHealthAsync(
result.LongRunningQueries = await longRunningTask;
result.TempDbSpace = await tempDbTask;
result.AnomalousJobs = await anomalousJobTask;
result.MissingCaptureSessions = await missingCaptureTask;
}
catch (Exception ex)
{
Expand All @@ -183,6 +185,52 @@ public async Task<AlertHealthResult> GetAlertHealthAsync(
return result;
}

/// <summary>
/// Returns capture types ("Blocking", "Deadlock") whose collector most recently
/// logged SESSION_MISSING — the XE session is absent and couldn't be created,
/// so capture is non-functional even though reads "succeed" with zero rows (#1086).
/// </summary>
private async Task<List<string>> GetMissingCaptureSessionsAsync(SqlConnection connection)
{
const string query = @"SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;

SELECT
x.collector_name
FROM
(
SELECT
cl.collector_name,
cl.collection_status,
n = ROW_NUMBER() OVER (PARTITION BY cl.collector_name ORDER BY cl.log_id DESC)
FROM config.collection_log AS cl
WHERE cl.collector_name IN (N'blocked_process_xml_collector', N'deadlock_xml_collector')
) AS x
WHERE x.n = 1
AND x.collection_status = N'SESSION_MISSING'
OPTION(RECOMPILE);";

var missing = new List<string>();

try
{
using var cmd = new SqlCommand(query, connection);
cmd.CommandTimeout = 10;
using var reader = await cmd.ExecuteReaderAsync();

while (await reader.ReadAsync())
{
var collectorName = reader.GetString(0);
missing.Add(collectorName == "blocked_process_xml_collector" ? "Blocking" : "Deadlock");
}
}
catch (Exception ex)
{
Logger.Warning($"Failed to check capture session status: {ex.Message}");
}

return missing;
}

/// <summary>
/// Returns blocking values directly (without writing to a ServerHealthStatus).
/// Used by GetAlertHealthAsync for lightweight alert checks.
Expand Down
116 changes: 116 additions & 0 deletions Lite.Tests/XeSessionHealthTests.cs
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
/*
* Copyright (c) 2026 Erik Darling, Darling Data LLC
*
* This file is part of the SQL Server Performance Monitor Lite.
*
* Licensed under the MIT License. See LICENSE file in the project root for full license information.
*/

using System.Linq;
using PerformanceMonitorLite.Services;
using Xunit;

namespace PerformanceMonitorLite.Tests;

/// <summary>
/// Regression tests for #1086: a missing/uncreatable XE session must surface in
/// collector health (so the UI stops showing OK while blocking/deadlock capture
/// is dead), and must clear again once a collection cycle succeeds (self-heal).
///
/// RecordCollectorResult and GetHealthSummary only touch the in-memory health
/// dictionary, so the service is constructed with null dependencies.
/// </summary>
public class XeSessionHealthTests
{
private const int ServerId = 42;
private const string Collector = "blocked_process_report";

private static RemoteCollectorService CreateService() =>
new(duckDb: null!, serverManager: null!, scheduleManager: null!);

[Fact]
public void XeFailure_Classified_As_Error_Surfaces_In_Both_Lists()
{
var service = CreateService();

service.RecordCollectorResult(ServerId, Collector, "ERROR",
"Failed to ensure blocked process XE session: boom", xeSessionUnavailable: true);

var summary = service.GetHealthSummary(ServerId);

var failure = Assert.Single(summary.XeSessionFailures);
Assert.True(failure.XeSessionUnavailable);
Assert.Equal(Collector, failure.CollectorName);
Assert.Contains("boom", failure.XeSessionMessage);

/* ERROR increments ConsecutiveErrors, so it also shows as an erroring collector */
Assert.Equal(1, summary.ErroringCollectors);
}

[Fact]
public void XeFailure_Classified_As_Permissions_Still_Surfaces()
{
var service = CreateService();

/* PERMISSIONS deliberately does not increment ConsecutiveErrors, which is
exactly why XeSessionFailures must be tracked separately — without it
this failure would be invisible and the status bar would show OK */
service.RecordCollectorResult(ServerId, Collector, "PERMISSIONS",
"ALTER ANY EVENT SESSION permission denied", xeSessionUnavailable: true);

var summary = service.GetHealthSummary(ServerId);

Assert.Equal(0, summary.ErroringCollectors);
var failure = Assert.Single(summary.XeSessionFailures);
Assert.True(failure.XeSessionUnavailable);
Assert.True(failure.IsPermissionRestricted);
}

[Fact]
public void Subsequent_Success_Clears_The_Flag()
{
var service = CreateService();

service.RecordCollectorResult(ServerId, Collector, "ERROR",
"Failed to ensure deadlock XE session: boom", xeSessionUnavailable: true);
service.RecordCollectorResult(ServerId, Collector, "SUCCESS");

var summary = service.GetHealthSummary(ServerId);

Assert.Empty(summary.XeSessionFailures);
Assert.Equal(0, summary.ErroringCollectors);

var entry = summary.Errors.SingleOrDefault(e => e.CollectorName == Collector);
Assert.Null(entry);
}

[Fact]
public void Success_Clears_Message_Too()
{
var service = CreateService();

service.RecordCollectorResult(ServerId, Collector, "ERROR", "boom", xeSessionUnavailable: true);
service.RecordCollectorResult(ServerId, Collector, "SUCCESS");
service.RecordCollectorResult(ServerId, Collector, "ERROR", "unrelated query failure");

var summary = service.GetHealthSummary(ServerId);

/* A later non-XE failure must not resurrect the stale XE message */
Assert.Empty(summary.XeSessionFailures);
var erroring = Assert.Single(summary.Errors);
Assert.False(erroring.XeSessionUnavailable);
Assert.Null(erroring.XeSessionMessage);
}

[Fact]
public void Failures_Are_Scoped_Per_Server()
{
var service = CreateService();

service.RecordCollectorResult(ServerId, Collector, "ERROR", "boom", xeSessionUnavailable: true);
service.RecordCollectorResult(ServerId + 1, Collector, "SUCCESS");

Assert.Single(service.GetHealthSummary(ServerId).XeSessionFailures);
Assert.Empty(service.GetHealthSummary(ServerId + 1).XeSessionFailures);
}
}
Loading
Loading