diff --git a/CHANGELOG.md b/CHANGELOG.md index c83a6519..57fd970a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Fixed +- **Lite: blocking/deadlock XE sessions now self-heal and failures are surfaced** ([#1086]) — the `PerformanceMonitor_BlockedProcess` and `PerformanceMonitor_Deadlock` Extended Events sessions were created only when a server tab was opened; the recurring background collection loop never created or retried them. A server monitored without an open tab (e.g. app minimized to tray after a restart), or a first attempt that failed (connection not ready, missing `ALTER ANY EVENT SESSION`), left blocking/deadlock capture permanently dead — while the collectors read the non-existent ring buffer, got zero rows, and reported **OK**. The session ensure now runs inside the collector itself on every cycle (cheap existence check once created), so both the tab-open path and the background loop create/start/retry it. A failed ensure can no longer be masked: it fails the collector run, shows in the status-bar collector health (including permission failures, which previously didn't count as "erroring"), and fires a one-time tray notification ("Capture Not Running") on the transition. The Azure SQL DB database-scoped sessions also gain `STARTUP_STATE = ON` so they restart automatically after a failover +- **Dashboard: blocking/deadlock XE sessions self-heal, Azure SQL DB sessions are actually created, and a missing session raises a Capture Down alert** — same silent-failure family as [#1086], worse on the Dashboard side. (1) The server-scoped sessions were created once at install and never re-ensured: if later stopped or dropped, `collect.blocked_process_xml_collector` and `collect.deadlock_xml_collector` swallowed the missing-session error and logged `SUCCESS` with zero rows forever. Both procs now ensure (create/start) the session at the top of every run. (2) On Azure SQL DB, the code comments claimed the database-scoped sessions were "auto-created by the collection procedures" — nothing anywhere created them, so blocking/deadlock capture was 100% non-functional on Azure SQL DB; the procs now create and start them (`database_xml_deadlock_report` for deadlocks — the Azure read also filtered on the wrong event name and would have returned nothing even with a session present). (3) Honest logging: when the session is genuinely absent and can't be created (typically missing `ALTER ANY EVENT SESSION` on-prem / `CREATE ANY DATABASE EVENT SESSION` on Azure SQL DB), the run logs `SESSION_MISSING` with the real error instead of `SUCCESS`. (4) The alert engine reads that status and raises a **Capture Down** alert through the standard pipeline — snoozable tray notification, email, webhook, alert history, cooldown, and mute — with a **Capture Restored** clear when the session comes back. Note: on Azure SQL DB the blocked-process *threshold* cannot be set via `sp_configure` and Microsoft documents no default, so the blocked-process session may exist yet capture nothing there; deadlock capture has no such dependency + - **Lite and Dashboard UI no longer goes blank or disappears after sleep/wake** ([#1050]) — closing a laptop lid (or locking the screen) and then resuming could leave the app running with no usable window: notifications kept firing but the window was gone from the desktop and taskbar, and relaunching showed an empty window until a full exit/restart. Two causes, both fixed. (1) WPF's GPU render thread can lose its rendering surface across a sleep/wake or RDP reconnect and never recover, leaving a live-but-blank window; both apps now use software rendering (`RenderOptions.ProcessRenderMode = SoftwareOnly`) to remove the GPU dependency — charts are unaffected because ScottPlot already renders via SkiaSharp. (2) When Windows turned the sleep-driven minimize into a hidden window, the minimize-to-tray logic left it hidden with no automatic way back; a new shared resume guard now restores the window from the tray on resume/unlock if it was visible beforehand (a window the user deliberately sent to the tray is left alone) - **"Silence All Alerts" now suppresses email too** ([#1035]) — right-clicking a monitored instance and choosing *Silence All Alerts* hid tray notifications and Alerts-tab badges, but two email paths ignored the silenced state and kept sending: connection up/down emails (*Server Unreachable* / *Server Restored*) and analysis-finding emails (the narrative findings from the analysis engine, which include CPU/memory/blocking stories). Only the threshold-alert path (High CPU, blocking, deadlocks, etc.) honored silencing. Both gaps are closed — a silenced server now produces no tray, email, or alert-history row from any path. The analysis path was the likely source of the reporter's "High CPU" email, since the threshold-based High CPU alert was already suppressed. The shared `AnalysisNotificationService` (used by Lite too) gains an optional per-server silence predicate; Lite has no silencing feature and passes none - **Dashboard time labels are now consistently 24-hour** ([#1012]) — the time-range header at the top of each tab (e.g. *"Original: May 28, 11:30 PM – May 29, 1:30 AM (PST)"*) and the Query Performance heatmap x-axis tick labels used `h:mm tt`, while every other timestamp in the app (footer "Last refresh", DataGrid columns, slicer, tooltips, logs) already used 24-hour `HH:mm`/`HH:mm:ss`. The AM/PM marker was also being truncated in the column shown by the reporter. Normalized the four outliers to `HH:mm` to match the rest of the app. The Lite heatmap had the same `h:mm tt` straggler — fixed alongside @@ -41,6 +44,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 [#1012]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/1012 [#1035]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/1035 [#1050]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/1050 +[#1086]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/1086 ## [2.11.0] - 2026-05-19 diff --git a/Dashboard/MainWindow.xaml.cs b/Dashboard/MainWindow.xaml.cs index 42a78c4e..138a812a 100644 --- a/Dashboard/MainWindow.xaml.cs +++ b/Dashboard/MainWindow.xaml.cs @@ -99,6 +99,8 @@ public partial class MainWindow : Window private readonly ConcurrentDictionary _activeTempDbSpaceAlert = new(); private readonly ConcurrentDictionary _lastLongRunningJobAlert = new(); private readonly ConcurrentDictionary _activeLongRunningJobAlert = new(); + private readonly ConcurrentDictionary _lastCaptureDownAlert = new(); + private readonly ConcurrentDictionary _activeCaptureDownAlert = new(); private readonly ConcurrentDictionary _previousDeadlockCounts = new(); private const double ExpandedWidth = 250; @@ -1611,6 +1613,61 @@ await _emailAlertService.TrySendAlertEmailAsync( "0", prefs.DeadlockThreshold.ToString(), true, "tray"); } + /* Capture Down alerts — the blocking/deadlock XE session is missing and the + collector couldn't create it, so capture is silently non-functional (#1086). + Gated on the blocking/deadlock notification prefs: if the user wants those + alerts, they need to know when the data feeding them stops existing. */ + bool captureDown = (prefs.NotifyOnBlocking || prefs.NotifyOnDeadlock) + && health.MissingCaptureSessions.Count > 0; + + if (captureDown) + { + _activeCaptureDownAlert[serverId] = true; + if (!_lastCaptureDownAlert.TryGetValue(serverId, out var lastAlert) || (now - lastAlert) >= alertCooldown) + { + var muteCtx = new AlertMuteContext { ServerName = serverName, MetricName = "Capture Down" }; + bool isMuted = _muteRuleService.IsAlertMuted(muteCtx); + _lastCaptureDownAlert[serverId] = now; + + var captureList = string.Join(" and ", health.MissingCaptureSessions); + var detailText = $"The {captureList} Extended Events session(s) are missing and could not be created. " + + "Blocking/deadlock data is NOT being captured. " + + "Check the collection log for the SESSION_MISSING error detail (usually a permissions problem: " + + "ALTER ANY EVENT SESSION on-prem, CREATE ANY DATABASE EVENT SESSION on Azure SQL DB)."; + + if (!isMuted) + { + _notificationService?.ShowSnoozableNotification( + "Capture Down", + $"{serverName}: {captureList} capture is not running — XE session missing", + NotificationType.Error, + serverName, + "Capture Down", + _muteRuleService); + } + + _emailAlertService.RecordAlert(serverId, serverName, "Capture Down", + captureList, "session running", !isMuted, isMuted ? "muted" : "tray", muted: isMuted, detailText: detailText); + + if (!isMuted) + { + await _emailAlertService.TrySendAlertEmailAsync( + "Capture Down", + serverName, + captureList, + "session running", + serverId); + } + } + } + else if (_activeCaptureDownAlert.TryRemove(serverId, out var wasCaptureDown) && wasCaptureDown) + { + _notificationService?.ShowNotification("Capture Restored", + $"{serverName}: Blocking/deadlock capture is running again"); + _emailAlertService.RecordAlert(serverId, serverName, "Capture Restored", + "running", "session running", true, "tray"); + } + /* High CPU alerts — evaluator picks Total or SQL based on prefs.CpuAlertMode */ int? alertCpuValue = prefs.CpuAlertMode == CpuAlertMode.Total ? health.TotalCpuPercent diff --git a/Dashboard/Models/AlertHealthResult.cs b/Dashboard/Models/AlertHealthResult.cs index ce50d4d4..2fc92010 100644 --- a/Dashboard/Models/AlertHealthResult.cs +++ b/Dashboard/Models/AlertHealthResult.cs @@ -37,6 +37,13 @@ public class AlertHealthResult public List AnomalousJobs { get; set; } = new(); public bool IsOnline { get; set; } = true; + /// + /// Capture types ("Blocking", "Deadlock") whose XE session is missing — + /// the collector's latest collection_log status is SESSION_MISSING (#1086). + /// Empty when both sessions are healthy. + /// + public List MissingCaptureSessions { get; set; } = new(); + /// /// Total CPU = SQL + Other. /// diff --git a/Dashboard/Services/DatabaseService.NocHealth.cs b/Dashboard/Services/DatabaseService.NocHealth.cs index cf7c1e48..1203b7f5 100644 --- a/Dashboard/Services/DatabaseService.NocHealth.cs +++ b/Dashboard/Services/DatabaseService.NocHealth.cs @@ -152,10 +152,11 @@ public async Task GetAlertHealthAsync( var longRunningTask = GetLongRunningQueriesAsync(connection, longRunningQueryThresholdMinutes, longRunningQueryMaxResults, excludeSpServerDiagnostics, excludeWaitFor, excludeBackups, excludeMiscWaits); var tempDbTask = GetTempDbSpaceAsync(connection); var anomalousJobTask = GetAnomalousJobsAsync(connection, longRunningJobMultiplier); + var missingCaptureTask = GetMissingCaptureSessionsAsync(connection); var allTasks = filteredDeadlockTask != null - ? new Task[] { cpuTask, blockingTask, deadlockTask, filteredDeadlockTask, poisonWaitTask, longRunningTask, tempDbTask, anomalousJobTask } - : new Task[] { cpuTask, blockingTask, deadlockTask, poisonWaitTask, longRunningTask, tempDbTask, anomalousJobTask }; + ? new Task[] { cpuTask, blockingTask, deadlockTask, filteredDeadlockTask, poisonWaitTask, longRunningTask, tempDbTask, anomalousJobTask, missingCaptureTask } + : new Task[] { cpuTask, blockingTask, deadlockTask, poisonWaitTask, longRunningTask, tempDbTask, anomalousJobTask, missingCaptureTask }; await Task.WhenAll(allTasks); var cpuResult = await cpuTask; @@ -173,6 +174,7 @@ public async Task GetAlertHealthAsync( result.LongRunningQueries = await longRunningTask; result.TempDbSpace = await tempDbTask; result.AnomalousJobs = await anomalousJobTask; + result.MissingCaptureSessions = await missingCaptureTask; } catch (Exception ex) { @@ -183,6 +185,52 @@ public async Task GetAlertHealthAsync( return result; } + /// + /// Returns capture types ("Blocking", "Deadlock") whose collector most recently + /// logged SESSION_MISSING — the XE session is absent and couldn't be created, + /// so capture is non-functional even though reads "succeed" with zero rows (#1086). + /// + private async Task> GetMissingCaptureSessionsAsync(SqlConnection connection) + { + const string query = @"SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED; + + SELECT + x.collector_name + FROM + ( + SELECT + cl.collector_name, + cl.collection_status, + n = ROW_NUMBER() OVER (PARTITION BY cl.collector_name ORDER BY cl.log_id DESC) + FROM config.collection_log AS cl + WHERE cl.collector_name IN (N'blocked_process_xml_collector', N'deadlock_xml_collector') + ) AS x + WHERE x.n = 1 + AND x.collection_status = N'SESSION_MISSING' + OPTION(RECOMPILE);"; + + var missing = new List(); + + try + { + using var cmd = new SqlCommand(query, connection); + cmd.CommandTimeout = 10; + using var reader = await cmd.ExecuteReaderAsync(); + + while (await reader.ReadAsync()) + { + var collectorName = reader.GetString(0); + missing.Add(collectorName == "blocked_process_xml_collector" ? "Blocking" : "Deadlock"); + } + } + catch (Exception ex) + { + Logger.Warning($"Failed to check capture session status: {ex.Message}"); + } + + return missing; + } + /// /// Returns blocking values directly (without writing to a ServerHealthStatus). /// Used by GetAlertHealthAsync for lightweight alert checks. diff --git a/Lite.Tests/XeSessionHealthTests.cs b/Lite.Tests/XeSessionHealthTests.cs new file mode 100644 index 00000000..563751d6 --- /dev/null +++ b/Lite.Tests/XeSessionHealthTests.cs @@ -0,0 +1,116 @@ +/* + * Copyright (c) 2026 Erik Darling, Darling Data LLC + * + * This file is part of the SQL Server Performance Monitor Lite. + * + * Licensed under the MIT License. See LICENSE file in the project root for full license information. + */ + +using System.Linq; +using PerformanceMonitorLite.Services; +using Xunit; + +namespace PerformanceMonitorLite.Tests; + +/// +/// Regression tests for #1086: a missing/uncreatable XE session must surface in +/// collector health (so the UI stops showing OK while blocking/deadlock capture +/// is dead), and must clear again once a collection cycle succeeds (self-heal). +/// +/// RecordCollectorResult and GetHealthSummary only touch the in-memory health +/// dictionary, so the service is constructed with null dependencies. +/// +public class XeSessionHealthTests +{ + private const int ServerId = 42; + private const string Collector = "blocked_process_report"; + + private static RemoteCollectorService CreateService() => + new(duckDb: null!, serverManager: null!, scheduleManager: null!); + + [Fact] + public void XeFailure_Classified_As_Error_Surfaces_In_Both_Lists() + { + var service = CreateService(); + + service.RecordCollectorResult(ServerId, Collector, "ERROR", + "Failed to ensure blocked process XE session: boom", xeSessionUnavailable: true); + + var summary = service.GetHealthSummary(ServerId); + + var failure = Assert.Single(summary.XeSessionFailures); + Assert.True(failure.XeSessionUnavailable); + Assert.Equal(Collector, failure.CollectorName); + Assert.Contains("boom", failure.XeSessionMessage); + + /* ERROR increments ConsecutiveErrors, so it also shows as an erroring collector */ + Assert.Equal(1, summary.ErroringCollectors); + } + + [Fact] + public void XeFailure_Classified_As_Permissions_Still_Surfaces() + { + var service = CreateService(); + + /* PERMISSIONS deliberately does not increment ConsecutiveErrors, which is + exactly why XeSessionFailures must be tracked separately — without it + this failure would be invisible and the status bar would show OK */ + service.RecordCollectorResult(ServerId, Collector, "PERMISSIONS", + "ALTER ANY EVENT SESSION permission denied", xeSessionUnavailable: true); + + var summary = service.GetHealthSummary(ServerId); + + Assert.Equal(0, summary.ErroringCollectors); + var failure = Assert.Single(summary.XeSessionFailures); + Assert.True(failure.XeSessionUnavailable); + Assert.True(failure.IsPermissionRestricted); + } + + [Fact] + public void Subsequent_Success_Clears_The_Flag() + { + var service = CreateService(); + + service.RecordCollectorResult(ServerId, Collector, "ERROR", + "Failed to ensure deadlock XE session: boom", xeSessionUnavailable: true); + service.RecordCollectorResult(ServerId, Collector, "SUCCESS"); + + var summary = service.GetHealthSummary(ServerId); + + Assert.Empty(summary.XeSessionFailures); + Assert.Equal(0, summary.ErroringCollectors); + + var entry = summary.Errors.SingleOrDefault(e => e.CollectorName == Collector); + Assert.Null(entry); + } + + [Fact] + public void Success_Clears_Message_Too() + { + var service = CreateService(); + + service.RecordCollectorResult(ServerId, Collector, "ERROR", "boom", xeSessionUnavailable: true); + service.RecordCollectorResult(ServerId, Collector, "SUCCESS"); + service.RecordCollectorResult(ServerId, Collector, "ERROR", "unrelated query failure"); + + var summary = service.GetHealthSummary(ServerId); + + /* A later non-XE failure must not resurrect the stale XE message */ + Assert.Empty(summary.XeSessionFailures); + var erroring = Assert.Single(summary.Errors); + Assert.False(erroring.XeSessionUnavailable); + Assert.Null(erroring.XeSessionMessage); + } + + [Fact] + public void Failures_Are_Scoped_Per_Server() + { + var service = CreateService(); + + service.RecordCollectorResult(ServerId, Collector, "ERROR", "boom", xeSessionUnavailable: true); + service.RecordCollectorResult(ServerId + 1, Collector, "SUCCESS"); + + Assert.Single(service.GetHealthSummary(ServerId).XeSessionFailures); + Assert.Empty(service.GetHealthSummary(ServerId + 1).XeSessionFailures); + } +} diff --git a/Lite/MainWindow.xaml.cs b/Lite/MainWindow.xaml.cs index 2390f333..ce41bfc9 100644 --- a/Lite/MainWindow.xaml.cs +++ b/Lite/MainWindow.xaml.cs @@ -44,6 +44,7 @@ public partial class MainWindow : Window private readonly Dictionary AlertCounts, Action ApplyTimeRange, Func ManualRefresh)> _tabEventHandlers = new(); private readonly Dictionary _previousConnectionStates = new(); private readonly Dictionary _previousCollectorErrorStates = new(); + private readonly Dictionary _previousXeSessionFailureStates = new(); private readonly Dictionary _lastCpuAlert = new(); private readonly Dictionary _lastBlockingAlert = new(); private readonly Dictionary _lastDeadlockAlert = new(); @@ -464,6 +465,17 @@ private void UpdateCollectorHealth() string.Join("\n", health.Errors.Select(e => $"{e.CollectorName}: {e.ConsecutiveErrors}x consecutive - {e.LastErrorMessage}")); } + else if (health.XeSessionFailures.Count > 0) + { + /* XE session couldn't be created (#1086). Permission failures don't + increment ConsecutiveErrors, so without this branch the status bar + would show OK while blocking/deadlock capture is dead. */ + var names = string.Join(", ", health.XeSessionFailures.Select(e => e.CollectorName)); + CollectorHealthText.Text = $"Capture down: {names}"; + CollectorHealthText.Foreground = System.Windows.Media.Brushes.OrangeRed; + CollectorHealthText.ToolTip = string.Join("\n", health.XeSessionFailures.Select(e => + $"{e.CollectorName}: {e.XeSessionMessage}")); + } else { CollectorHealthText.Text = $"Collectors: {health.TotalCollectors} OK"; @@ -1290,8 +1302,10 @@ private void CheckConnectionsAndNotify() if (status?.IsOnline == null) continue; bool isOnline = status.IsOnline == true; - bool hasErrors = _collectorService != null && isOnline - && _collectorService.GetHealthSummary(server).ErroringCollectors > 0; + var healthSummary = _collectorService != null && isOnline + ? _collectorService.GetHealthSummary(server) + : null; + bool hasErrors = healthSummary?.ErroringCollectors > 0; server.HasCollectorErrors = hasErrors; if (_previousConnectionStates.TryGetValue(server.Id, out var wasOnline)) @@ -1328,6 +1342,25 @@ private void CheckConnectionsAndNotify() if (_previousCollectorErrorStates.TryGetValue(server.Id, out var prevHasErrors) && prevHasErrors != hasErrors) needsRefresh = true; + /* One-time balloon when blocking/deadlock capture can't start because the + XE session couldn't be created (#1086). Edge-triggered on the false→true + transition so it doesn't re-fire every poll while the condition persists. */ + bool xeSessionDown = healthSummary?.XeSessionFailures.Count > 0; + _previousXeSessionFailureStates.TryGetValue(server.Id, out var wasXeSessionDown); + + if (App.AlertsEnabled && xeSessionDown && !wasXeSessionDown) + { + var captures = string.Join(" and ", healthSummary!.XeSessionFailures + .Select(f => f.CollectorName == "blocked_process_report" ? "blocking" : "deadlock")); + var reason = healthSummary.XeSessionFailures[0].XeSessionMessage ?? "unknown error"; + + _trayService?.ShowNotification( + "Capture Not Running", + $"{server.DisplayNameWithIntent}: {captures} capture can't start — {reason}", + Hardcodet.Wpf.TaskbarNotification.BalloonIcon.Warning); + } + + _previousXeSessionFailureStates[server.Id] = xeSessionDown; _previousConnectionStates[server.Id] = isOnline; _previousCollectorErrorStates[server.Id] = hasErrors; } diff --git a/Lite/Services/RemoteCollectorService.BlockedProcessReport.cs b/Lite/Services/RemoteCollectorService.BlockedProcessReport.cs index 5ed567f6..5894403a 100644 --- a/Lite/Services/RemoteCollectorService.BlockedProcessReport.cs +++ b/Lite/Services/RemoteCollectorService.BlockedProcessReport.cs @@ -60,6 +60,10 @@ public async Task EnsureBlockedProcessXeSessionAsync(ServerConnection server, in _logger?.LogWarning("Failed to ensure blocked process XE session on '{Server}': {Message}", server.DisplayName, ex.Message); AppLogger.Error("XeSession", $"[{server.DisplayName}] Failed to ensure blocked process XE session: {ex.Message}"); + + /* Propagate so RunCollectorAsync marks the collector unhealthy instead + of letting a zero-row ring-buffer read record SUCCESS (#1086) */ + throw new XeSessionEnsureException("blocked process", ex); } } @@ -149,6 +153,7 @@ LEFT JOIN sys.dm_xe_sessions AS dxs _logger?.LogWarning("Failed to start blocked process XE session on '{Server}': {Message}", server.DisplayName, ex.Message); AppLogger.Error("XeSession", $"[{server.DisplayName}] Failed to start blocked process XE session: {ex.Message}"); + throw; } } else @@ -187,6 +192,7 @@ ADD TARGET package0.ring_buffer _logger?.LogWarning("Failed to create blocked process XE session on '{Server}': {Message}", server.DisplayName, ex.Message); AppLogger.Error("XeSession", $"[{server.DisplayName}] Failed to create blocked process XE session: {ex.Message}"); + throw; } } @@ -243,7 +249,8 @@ ADD TARGET package0.ring_buffer ) WITH ( - MAX_DISPATCH_LATENCY = 5 SECONDS + MAX_DISPATCH_LATENCY = 5 SECONDS, + STARTUP_STATE = ON ); ALTER EVENT SESSION [{BlockedProcessXeSessionName}] ON DATABASE STATE = START;", connection)) diff --git a/Lite/Services/RemoteCollectorService.Deadlocks.cs b/Lite/Services/RemoteCollectorService.Deadlocks.cs index 68bd656e..11fcefbe 100644 --- a/Lite/Services/RemoteCollectorService.Deadlocks.cs +++ b/Lite/Services/RemoteCollectorService.Deadlocks.cs @@ -60,6 +60,10 @@ public async Task EnsureDeadlockXeSessionAsync(ServerConnection server, int engi _logger?.LogWarning("Failed to ensure deadlock XE session on '{Server}': {Message}", server.DisplayName, ex.Message); AppLogger.Error("XeSession", $"[{server.DisplayName}] Failed to ensure deadlock XE session: {ex.Message}"); + + /* Propagate so RunCollectorAsync marks the collector unhealthy instead + of letting a zero-row ring-buffer read record SUCCESS (#1086) */ + throw new XeSessionEnsureException("deadlock", ex); } } @@ -102,6 +106,7 @@ LEFT JOIN sys.dm_xe_sessions AS dxs _logger?.LogWarning("Failed to start deadlock XE session on '{Server}': {Message}", server.DisplayName, ex.Message); AppLogger.Error("XeSession", $"[{server.DisplayName}] Failed to start deadlock XE session: {ex.Message}"); + throw; } } else @@ -143,6 +148,7 @@ ADD TARGET package0.ring_buffer _logger?.LogWarning("Failed to create deadlock XE session on '{Server}': {Message}", server.DisplayName, ex.Message); AppLogger.Error("XeSession", $"[{server.DisplayName}] Failed to create deadlock XE session: {ex.Message}"); + throw; } } @@ -239,7 +245,8 @@ ADD TARGET package0.ring_buffer WITH ( MAX_DISPATCH_LATENCY = 5 SECONDS, - EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS + EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS, + STARTUP_STATE = ON ); ALTER EVENT SESSION [{DeadlockXeSessionName}] ON DATABASE STATE = START;", connection)) diff --git a/Lite/Services/RemoteCollectorService.cs b/Lite/Services/RemoteCollectorService.cs index ff36b844..2618522e 100644 --- a/Lite/Services/RemoteCollectorService.cs +++ b/Lite/Services/RemoteCollectorService.cs @@ -49,6 +49,34 @@ public class CollectorHealthEntry * permissions get granted later, the next launch retries once. */ public bool IsPermissionRestricted { get; set; } + + /* + * Set when the collector's Extended Events session could not be + * created or started (issue #1086). Distinct from a query failure: + * capture is non-functional even though reads would "succeed" with + * zero rows. Cleared on the next successful run. MainWindow raises + * a one-time tray notification on the false→true transition. + */ + public bool XeSessionUnavailable { get; set; } + public string? XeSessionMessage { get; set; } +} + +/// +/// Thrown when an Extended Events session required by a collector cannot +/// be created or started. Raised before the collect query runs so a missing +/// session can never be masked by a zero-row "successful" read (issue #1086). +/// +public class XeSessionEnsureException : Exception +{ + public string SessionKind { get; } + + public XeSessionEnsureException(string sessionKind, SqlException inner) + : base($"Failed to ensure {sessionKind} XE session: {inner.Message}", inner) + { + SessionKind = sessionKind; + } + + public new SqlException InnerException => (SqlException)base.InnerException!; } /// @@ -60,6 +88,14 @@ public class CollectorHealthSummary public int ErroringCollectors { get; set; } public int LoggingFailures { get; set; } public List Errors { get; set; } = new(); + + /* + * Collectors whose XE session couldn't be created/started (#1086). + * Tracked separately from Errors because a PERMISSIONS-classified + * failure deliberately does not increment ConsecutiveErrors, so it + * would otherwise be invisible here. + */ + public List XeSessionFailures { get; set; } = new(); } public partial class RemoteCollectorService @@ -175,6 +211,11 @@ public CollectorHealthSummary GetHealthSummary(int? serverId = null) summary.ErroringCollectors++; summary.Errors.Add(entry); } + + if (entry.XeSessionUnavailable) + { + summary.XeSessionFailures.Add(entry); + } } return summary; @@ -214,7 +255,7 @@ public void ClearHealthExcept(HashSet activeServerIds) /// /// Records a collector execution result for health tracking. /// - private void RecordCollectorResult(int serverId, string collectorName, string status, string? errorMessage = null) + internal void RecordCollectorResult(int serverId, string collectorName, string status, string? errorMessage = null, bool xeSessionUnavailable = false) { lock (_healthLock) { @@ -225,6 +266,9 @@ private void RecordCollectorResult(int serverId, string collectorName, string st _collectorHealth[key] = entry; } + entry.XeSessionUnavailable = xeSessionUnavailable; + entry.XeSessionMessage = xeSessionUnavailable ? errorMessage : null; + if (status == "SUCCESS") { entry.LastSuccessTime = DateTime.UtcNow; @@ -344,11 +388,9 @@ public async Task RunAllCollectorsForServerAsync(ServerConnection server, Cancel .Where(s => s.Enabled) .ToList(); - /* Ensure XE sessions are set up before collecting */ + /* XE session setup happens inside RunCollectorAsync so the background + collection loop also ensures/retries it, not just tab-open (#1086) */ var serverStatus = _serverManager.GetConnectionStatus(server.Id); - var engineEdition = serverStatus.SqlEngineEdition; - await EnsureBlockedProcessXeSessionAsync(server, engineEdition, cancellationToken); - await EnsureDeadlockXeSessionAsync(server, engineEdition, cancellationToken); /* Persist edition/version to DuckDB for the analysis engine */ await PersistServerMetadataAsync(server, serverStatus); @@ -380,6 +422,7 @@ public async Task RunCollectorAsync(ServerConnection server, string collectorNam var status = "SUCCESS"; string? errorMessage = null; int rowsCollected = 0; + bool xeSessionUnavailable = false; try { @@ -418,6 +461,20 @@ public async Task RunCollectorAsync(ServerConnection server, string collectorNam _logger?.LogDebug("Running collector '{Collector}' for server '{Server}'", collectorName, server.DisplayName); + /* Ensure the backing XE session exists before reading its ring buffer. + Runs on every cycle (cheap existence check when already present) so a + failed first attempt self-heals instead of staying broken until a manual + tab re-open (#1086). Throws XeSessionEnsureException on failure so the + zero-row read below can never record a misleading SUCCESS. */ + if (collectorName == "blocked_process_report") + { + await EnsureBlockedProcessXeSessionAsync(server, engineEdition, cancellationToken); + } + else if (collectorName == "deadlocks") + { + await EnsureDeadlockXeSessionAsync(server, engineEdition, cancellationToken); + } + rowsCollected = collectorName switch { "wait_stats" => await CollectWaitStatsAsync(server, cancellationToken), @@ -452,6 +509,21 @@ public async Task RunCollectorAsync(ServerConnection server, string collectorNam var elapsed = (int)(DateTime.UtcNow - startTime).TotalMilliseconds; AppLogger.Info("Collector", $" [{server.DisplayName}] {collectorName} => {rowsCollected} rows in {elapsed}ms (sql:{_lastSqlMs}ms, duck:{_lastDuckDbMs}ms)"); } + catch (XeSessionEnsureException ex) + { + /* XE session couldn't be created/started — capture is dead even though + the ring-buffer read would "succeed" with zero rows. Classify like a + direct SQL failure so the health indicator stops showing OK (#1086). */ + var sqlError = ex.InnerException; + errorMessage = ex.Message; + status = (sqlError.Number == 229 || sqlError.Number == 297 || sqlError.Number == 300) + ? "PERMISSIONS" + : "ERROR"; + xeSessionUnavailable = true; + AppLogger.Error("Collector", $" [{server.DisplayName}] {collectorName} {ex.Message}"); + _logger?.LogWarning("Collector '{Collector}' XE session unavailable for server '{Server}': {Message}", + collectorName, server.DisplayName, ex.Message); + } catch (SqlException ex) { status = "ERROR"; @@ -503,7 +575,7 @@ public async Task RunCollectorAsync(ServerConnection server, string collectorNam } // Track collector health - RecordCollectorResult(GetServerId(server), collectorName, status, errorMessage); + RecordCollectorResult(GetServerId(server), collectorName, status, errorMessage, xeSessionUnavailable); // Log the collection attempt await LogCollectionAsync(GetServerId(server), server.DisplayName, collectorName, startTime, status, errorMessage, rowsCollected, _lastSqlMs, _lastDuckDbMs); diff --git a/install/21_setup_blocked_process_xe.sql b/install/21_setup_blocked_process_xe.sql index 70106071..cf65dd21 100644 --- a/install/21_setup_blocked_process_xe.sql +++ b/install/21_setup_blocked_process_xe.sql @@ -8,8 +8,11 @@ Creates ring_buffer sessions that work across: - Azure SQL Managed Instance - AWS RDS for SQL Server -Note: Azure SQL DB requires database-scoped sessions which are handled -by the collection procedures in scripts 22 and 24. +Note: Azure SQL DB requires database-scoped sessions. The collection +procedures (scripts 22 and 24) create and start those at the top of every +run (#1086); they also re-create/re-start the server-scoped sessions below +if they are later dropped or stopped, so this script is the initial setup, +not the only creation path. */ SET ANSI_NULLS ON; diff --git a/install/22_collect_blocked_processes.sql b/install/22_collect_blocked_processes.sql index 11fd3a91..a0241401 100644 --- a/install/22_collect_blocked_processes.sql +++ b/install/22_collect_blocked_processes.sql @@ -10,7 +10,16 @@ Supports: - On-premises SQL Server (server-scoped session) - Azure SQL Managed Instance (server-scoped session) - AWS RDS for SQL Server (server-scoped session) - - Azure SQL DB (database-scoped session, auto-created) + - Azure SQL DB (database-scoped session) + +The XE session is ensured (created/started) at the top of every run (#1086), +so a session that was dropped, stopped, or never created self-heals on the +next collection cycle. If it still can't be created, the run logs +SESSION_MISSING instead of a misleading SUCCESS with zero rows. + +Note: on Azure SQL DB the blocked process threshold cannot be set via +sp_configure and Microsoft does not document a default, so the session may +exist yet capture nothing. Deadlock capture has no such dependency. */ SET ANSI_NULLS ON; @@ -52,6 +61,8 @@ BEGIN @cutoff_time datetime2(7) = DATEADD(MINUTE, -@minutes_back, SYSUTCDATETIME()), @blocked_threshold_configured integer, @is_azure_sql_db bit = 0, + @session_missing bit = 0, + @ensure_error nvarchar(4000) = N'', @sql nvarchar(max) = N''; BEGIN TRY @@ -63,6 +74,181 @@ BEGIN SET @is_azure_sql_db = 1; END; + /* + Ensure the XE session exists and is running before reading it (#1086). + Runs before BEGIN TRANSACTION because event session DDL is not allowed + inside a user transaction. Dynamic SQL keeps Azure-only catalog views + out of statement binding on other platforms. + */ + BEGIN TRY + IF @is_azure_sql_db = 1 + BEGIN + SET @sql = N' + IF NOT EXISTS + ( + SELECT + 1/0 + FROM sys.database_event_sessions AS des + WHERE des.name = @session_name + ) + BEGIN + CREATE EVENT SESSION + [PerformanceMonitor_BlockedProcess] + ON DATABASE + ADD EVENT + sqlserver.blocked_process_report + ADD TARGET + package0.ring_buffer + ( + SET max_memory = 4096 + ) + WITH + ( + MAX_DISPATCH_LATENCY = 5 SECONDS, + EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS, + STARTUP_STATE = ON + ); + END; + + IF NOT EXISTS + ( + SELECT + 1/0 + FROM sys.dm_xe_database_sessions AS dxs + WHERE dxs.name = @session_name + ) + BEGIN + ALTER EVENT SESSION + [PerformanceMonitor_BlockedProcess] + ON DATABASE + STATE = START; + END;'; + END; + ELSE + BEGIN + SET @sql = N' + IF NOT EXISTS + ( + SELECT + 1/0 + FROM sys.server_event_sessions AS ses + WHERE ses.name = @session_name + ) + BEGIN + CREATE EVENT SESSION + [PerformanceMonitor_BlockedProcess] + ON SERVER + ADD EVENT + sqlserver.blocked_process_report + ADD TARGET + package0.ring_buffer + ( + SET max_memory = 4096 + ) + WITH + ( + MAX_MEMORY = 4096KB, + EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS, + MAX_DISPATCH_LATENCY = 5 SECONDS, + MEMORY_PARTITION_MODE = NONE, + STARTUP_STATE = ON + ); + END; + + IF NOT EXISTS + ( + SELECT + 1/0 + FROM sys.dm_xe_sessions AS dxs + WHERE dxs.name = @session_name + ) + BEGIN + ALTER EVENT SESSION + [PerformanceMonitor_BlockedProcess] + ON SERVER + STATE = START; + END;'; + END; + + EXECUTE sys.sp_executesql + @sql, + N'@session_name sysname', + @session_name; + END TRY + BEGIN CATCH + /* + Couldn't create/start (e.g. login lacks ALTER ANY EVENT SESSION, + or CREATE ANY DATABASE EVENT SESSION on Azure SQL DB). + Verified below — if the session is genuinely absent we log + SESSION_MISSING with this message instead of a fake SUCCESS. + */ + SET @ensure_error = ERROR_MESSAGE(); + END CATCH; + + /* + Verify the session is actually running; if not, log SESSION_MISSING and bail + */ + IF @is_azure_sql_db = 1 + BEGIN + SET @sql = N' + IF NOT EXISTS + ( + SELECT + 1/0 + FROM sys.dm_xe_database_sessions AS dxs + WHERE dxs.name = @session_name + ) + BEGIN + SET @session_missing = 1; + END;'; + END; + ELSE + BEGIN + SET @sql = N' + IF NOT EXISTS + ( + SELECT + 1/0 + FROM sys.dm_xe_sessions AS dxs + WHERE dxs.name = @session_name + ) + BEGIN + SET @session_missing = 1; + END;'; + END; + + EXECUTE sys.sp_executesql + @sql, + N'@session_name sysname, @session_missing bit OUTPUT', + @session_name, + @session_missing OUTPUT; + + IF @session_missing = 1 + BEGIN + INSERT INTO + config.collection_log + ( + collector_name, + collection_status, + duration_ms, + error_message + ) + VALUES + ( + N'blocked_process_xml_collector', + N'SESSION_MISSING', + DATEDIFF(MILLISECOND, @start_time, SYSDATETIME()), + ISNULL(NULLIF(@ensure_error, N''), N'XE session ' + @session_name + N' is not running and could not be created or started') + ); + + IF @debug = 1 + BEGIN + RAISERROR(N'Blocked process XE session missing and could not be created: %s', 0, 1, @ensure_error) WITH NOWAIT; + END; + + RETURN; + END; + BEGIN TRANSACTION; /* @@ -282,13 +468,16 @@ BEGIN END TRY BEGIN CATCH /* - Session doesn't exist or is not accessible - This is expected if XE setup hasn't been run + The session was verified running above, so a read failure here is a + real error — re-raise so the outer CATCH logs ERROR instead of this + run recording a misleading SUCCESS (#1086) */ IF @debug = 1 BEGIN - RAISERROR(N'Blocked process session not available: %s', 0, 1, @session_name) WITH NOWAIT; + RAISERROR(N'Blocked process ring buffer read failed for session: %s', 0, 1, @session_name) WITH NOWAIT; END; + + THROW; END CATCH; /* diff --git a/install/24_collect_deadlock_xml.sql b/install/24_collect_deadlock_xml.sql index 38df99b2..6f538371 100644 --- a/install/24_collect_deadlock_xml.sql +++ b/install/24_collect_deadlock_xml.sql @@ -10,7 +10,12 @@ Supports: - On-premises SQL Server (server-scoped session) - Azure SQL Managed Instance (server-scoped session) - AWS RDS for SQL Server (server-scoped session) - - Azure SQL DB (database-scoped session, auto-created) + - Azure SQL DB (database-scoped session) + +The XE session is ensured (created/started) at the top of every run (#1086), +so a session that was dropped, stopped, or never created self-heals on the +next collection cycle. If it still can't be created, the run logs +SESSION_MISSING instead of a misleading SUCCESS with zero rows. */ SET ANSI_NULLS ON; @@ -51,6 +56,8 @@ BEGIN @start_time datetime2(7) = SYSDATETIME(), @cutoff_time datetime2(7) = DATEADD(MINUTE, -@minutes_back, SYSUTCDATETIME()), @is_azure_sql_db bit = 0, + @session_missing bit = 0, + @ensure_error nvarchar(4000) = N'', @sql nvarchar(max) = N''; BEGIN TRY @@ -62,6 +69,181 @@ BEGIN SET @is_azure_sql_db = 1; END; + /* + Ensure the XE session exists and is running before reading it (#1086). + Runs before BEGIN TRANSACTION because event session DDL is not allowed + inside a user transaction. Dynamic SQL keeps Azure-only catalog views + out of statement binding on other platforms. + */ + BEGIN TRY + IF @is_azure_sql_db = 1 + BEGIN + SET @sql = N' + IF NOT EXISTS + ( + SELECT + 1/0 + FROM sys.database_event_sessions AS des + WHERE des.name = @session_name + ) + BEGIN + CREATE EVENT SESSION + [PerformanceMonitor_Deadlock] + ON DATABASE + ADD EVENT + sqlserver.database_xml_deadlock_report + ADD TARGET + package0.ring_buffer + ( + SET max_memory = 4096 + ) + WITH + ( + MAX_DISPATCH_LATENCY = 5 SECONDS, + EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS, + STARTUP_STATE = ON + ); + END; + + IF NOT EXISTS + ( + SELECT + 1/0 + FROM sys.dm_xe_database_sessions AS dxs + WHERE dxs.name = @session_name + ) + BEGIN + ALTER EVENT SESSION + [PerformanceMonitor_Deadlock] + ON DATABASE + STATE = START; + END;'; + END; + ELSE + BEGIN + SET @sql = N' + IF NOT EXISTS + ( + SELECT + 1/0 + FROM sys.server_event_sessions AS ses + WHERE ses.name = @session_name + ) + BEGIN + CREATE EVENT SESSION + [PerformanceMonitor_Deadlock] + ON SERVER + ADD EVENT + sqlserver.xml_deadlock_report + ADD TARGET + package0.ring_buffer + ( + SET max_memory = 4096 + ) + WITH + ( + MAX_MEMORY = 4096KB, + EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS, + MAX_DISPATCH_LATENCY = 5 SECONDS, + MEMORY_PARTITION_MODE = NONE, + STARTUP_STATE = ON + ); + END; + + IF NOT EXISTS + ( + SELECT + 1/0 + FROM sys.dm_xe_sessions AS dxs + WHERE dxs.name = @session_name + ) + BEGIN + ALTER EVENT SESSION + [PerformanceMonitor_Deadlock] + ON SERVER + STATE = START; + END;'; + END; + + EXECUTE sys.sp_executesql + @sql, + N'@session_name sysname', + @session_name; + END TRY + BEGIN CATCH + /* + Couldn't create/start (e.g. login lacks ALTER ANY EVENT SESSION, + or CREATE ANY DATABASE EVENT SESSION on Azure SQL DB). + Verified below — if the session is genuinely absent we log + SESSION_MISSING with this message instead of a fake SUCCESS. + */ + SET @ensure_error = ERROR_MESSAGE(); + END CATCH; + + /* + Verify the session is actually running; if not, log SESSION_MISSING and bail + */ + IF @is_azure_sql_db = 1 + BEGIN + SET @sql = N' + IF NOT EXISTS + ( + SELECT + 1/0 + FROM sys.dm_xe_database_sessions AS dxs + WHERE dxs.name = @session_name + ) + BEGIN + SET @session_missing = 1; + END;'; + END; + ELSE + BEGIN + SET @sql = N' + IF NOT EXISTS + ( + SELECT + 1/0 + FROM sys.dm_xe_sessions AS dxs + WHERE dxs.name = @session_name + ) + BEGIN + SET @session_missing = 1; + END;'; + END; + + EXECUTE sys.sp_executesql + @sql, + N'@session_name sysname, @session_missing bit OUTPUT', + @session_name, + @session_missing OUTPUT; + + IF @session_missing = 1 + BEGIN + INSERT INTO + config.collection_log + ( + collector_name, + collection_status, + duration_ms, + error_message + ) + VALUES + ( + N'deadlock_xml_collector', + N'SESSION_MISSING', + DATEDIFF(MILLISECOND, @start_time, SYSDATETIME()), + ISNULL(NULLIF(@ensure_error, N''), N'XE session ' + @session_name + N' is not running and could not be created or started') + ); + + IF @debug = 1 + BEGIN + RAISERROR(N'Deadlock XE session missing and could not be created: %s', 0, 1, @ensure_error) WITH NOWAIT; + END; + + RETURN; + END; + BEGIN TRANSACTION; /* @@ -158,7 +340,7 @@ BEGIN rb.ring_buffer FROM @ring_buffer AS rb ) AS rb - CROSS APPLY rb.ring_buffer.nodes(''RingBufferTarget/event[@name="xml_deadlock_report"]'') AS q(evt) + CROSS APPLY rb.ring_buffer.nodes(''RingBufferTarget/event[@name="database_xml_deadlock_report"]'') AS q(evt) WHERE evt.value(''(@timestamp)[1]'', ''datetime2(7)'') >= @cutoff_time AND NOT EXISTS ( @@ -243,13 +425,16 @@ BEGIN END TRY BEGIN CATCH /* - Session doesn't exist or is not accessible - This is expected if XE setup hasn't been run + The session was verified running above, so a read failure here is a + real error — re-raise so the outer CATCH logs ERROR instead of this + run recording a misleading SUCCESS (#1086) */ IF @debug = 1 BEGIN - RAISERROR(N'Deadlock session not available: %s', 0, 1, @session_name) WITH NOWAIT; + RAISERROR(N'Deadlock ring buffer read failed for session: %s', 0, 1, @session_name) WITH NOWAIT; END; + + THROW; END CATCH; /*