From bb9ff1b015d7ec47c4f04929743f0845f4ba9acc Mon Sep 17 00:00:00 2001 From: Erik Darling <2136037+erikdarlingdata@users.noreply.github.com> Date: Mon, 15 Jun 2026 20:58:53 -0400 Subject: [PATCH 1/5] Release prep for v3.0.0 - Bump Version/AssemblyVersion/FileVersion/InformationalVersion to 3.0.0 in Dashboard, Lite, Installer, and Installer.Core. The Installer/Installer.Core AssemblyVersion bump is load-bearing: ScriptProvider.FilterUpgrades only applies an upgrade folder whose ToVersion <= the installer's own assembly version, so without it the 2.11.0-to-3.0.0 folder would be silently skipped. - Rename upgrades/2.11.0-to-2.12.0 -> upgrades/2.11.0-to-3.0.0. - Write the [3.0.0] CHANGELOG section: an Important note plus the gaps the release audit flagged (code-review hardening sweep #1093-#1108, the off-UI-thread perf overhaul #1116/#1121, object/index-level collection #1103, the Recommendations/Apply-Fix engine, #1085, #1092/#1122, #1096, #754, #749, #768), and fix the stale "2.11.0 -> 2.12.0" upgrade-script reference. Validation: Installer unit tests 61/61; Installer integration tests 19/19 against live SQL2022 (idempotency, version detection, adversarial upgrade paths) at the 3.0.0 assembly version. Co-Authored-By: Claude Opus 4.8 (1M context) --- CHANGELOG.md | 35 +++++++++++++++++-- Dashboard/Dashboard.csproj | 8 ++--- Installer.Core/Installer.Core.csproj | 8 ++--- Installer/PerformanceMonitorInstaller.csproj | 8 ++--- Lite/PerformanceMonitorLite.csproj | 8 ++--- ..._extend_blocked_process_report_columns.sql | 0 .../02_make_other_process_cpu_nullable.sql | 0 .../03_add_transaction_mutex_ignored_wait.sql | 0 .../04_add_server_health_columns.sql | 0 .../upgrade.txt | 0 10 files changed, 48 insertions(+), 19 deletions(-) rename upgrades/{2.11.0-to-2.12.0 => 2.11.0-to-3.0.0}/01_extend_blocked_process_report_columns.sql (100%) rename upgrades/{2.11.0-to-2.12.0 => 2.11.0-to-3.0.0}/02_make_other_process_cpu_nullable.sql (100%) rename upgrades/{2.11.0-to-2.12.0 => 2.11.0-to-3.0.0}/03_add_transaction_mutex_ignored_wait.sql (100%) rename upgrades/{2.11.0-to-2.12.0 => 2.11.0-to-3.0.0}/04_add_server_health_columns.sql (100%) rename upgrades/{2.11.0-to-2.12.0 => 2.11.0-to-3.0.0}/upgrade.txt (100%) diff --git a/CHANGELOG.md b/CHANGELOG.md index ca8123a9..f467de85 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,7 +5,11 @@ All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). -## [Unreleased] +## [3.0.0] - 2026-06-15 + +### Important + +- **Major release — 2.11.0 → 3.0.0, no breaking changes.** This version rolls up a codebase-wide correctness and security hardening pass (the `code-review-*` series spanning the SQL schema, collectors, and views, the installer, the Lite and Dashboard services, and the shared libraries); a major UI-responsiveness overhaul that moves the data path off the WPF dispatcher in both apps; new object- and index-level collection (per-table / per-index size, growth, usage, and locking/contention); the rebuilt Recommendations / Apply Fix engine (advise-and-act, with safe and destructive fixes appliable behind informed two-sided consent); and a batch of smaller fixes and features. Nothing here is a breaking change — existing installations upgrade in place via `upgrades/2.11.0-to-3.0.0/` (typed blocked-process columns, a nullable host-CPU column, the `TRANSACTION_MUTEX` ignored wait, and new server-health columns), and the Dashboard and Lite apps auto-update over the top ### Fixed @@ -23,6 +27,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - **Dashboard alert emails no longer re-fire after an app restart** — brings Dashboard `EmailAlertService` to parity with the Lite-side persistence introduced in [#981]. The cooldown is now seeded from the in-memory alert log (loaded from `alert_history.json` on startup) the first time each `{serverId}:{metricName}` key is evaluated - **Analysis-finding notification cooldowns now persist across restarts on both Lite and Dashboard** — the per-finding re-notification cooldown in `AnalysisNotificationService` lived only in memory, so restarting either app cleared it and a finding that had just fired (and entered its `AnalysisNotifyCooldownMinutes` cooldown) could re-notify immediately. The cooldown now seeds lazily from the alert log (Lite: `config_alert_log`; Dashboard: `alert_history.json`) on first lookup per finding, mirroring the email-cooldown pattern from #981. Entries past 2× the cooldown window are pruned on each notify cycle so the dictionary stays bounded - **Data Retention job no longer fails with `xp_delete_file` error 22049** ([#972]) — the trace-file cleanup added in v2.11.0 passed a wildcard path to `xp_delete_file`, raising an uncatchable `Msg 22049` that failed the entire `PerformanceMonitor - Data Retention` Agent job on every run once any `Monitor_LongQueries_*.trc` files existed. `xp_delete_file` also cannot delete `.trc` files at all — it only accepts SQL Server backup files and Maintenance Plan report files — so that cleanup step has been removed from `config.data_retention` +- **Codebase-wide correctness and security hardening pass** — a broad review (the `code-review-*` PR series, #1093–#1108) fixed defects across the stack without changing behavior users depend on: + - **Shared libraries** — defects in the extracted `PerformanceMonitor.Analysis` / `.PlanAnalysis` / `.Ui` / `.Common` code + - **Dashboard** — timezone and CPU-path defects + - **Lite** — services, analysis, and UI defects, plus `ArchiveService` data-loss / corruption fixes + - **Installer** — CLI version-detection and failure-handling + - **SQL** — high-impact collector defects, view / analyzer crashes (including a Linux CPU gap), and schema / job / validation defects +- **FinOps no longer recommends downgrading to Standard Edition on a server running Availability Groups** ([#1085]) — an Enterprise instance with no TDE was told to "review whether Standard Edition would meet workload requirements" even when it was running AGs, which Standard supports only in the limited Basic Availability Groups form. FinOps now counts advanced (non-basic) AGs via `sys.availability_groups.basic_features` and, when any are present, appends a caveat naming the AG count and Standard's Basic-AG limitations (two replicas, one database per group, no readable secondary), retitles the finding to "review Availability Group requirements before downgrading," and lowers its confidence — the savings estimate is retained. The Dashboard, which previously had no AG awareness at all, was brought to full parity and also gains the [#980] AG-secondary informational note it never received +- **Server-tab alert badge is now clearable** ([#1092]) — the red alert badge on a server tab could previously only be cleared through an undiscoverable right-click menu. Left-clicking the badge now acknowledges and clears it (hand cursor, *"Click to dismiss · Right-click for options"* tooltip), and Alert History **Dismiss All** clears the matching server badge(s) too. A follow-up ([#1122]) closed the last gap: **Dismiss Selected** now also clears the badge for every distinct server represented in the dismissed rows. On the Dashboard, which already had richer auto-resolving badges, this added the missing left-click affordance for parity +- **Long-running-query alert no longer constantly trips on CDC capture jobs** ([#1096]) — the Change Data Capture capture job runs as a continuous SQL Agent session (`sp_MScdc_capture_job` → `sp_cdc_scan`), so its elapsed time permanently exceeded the long-running-query threshold and the alert fired non-stop; none of the four existing `wait_type`-based exclusions caught it. Both apps gain an **Exclude CDC capture jobs** toggle (default on) that identifies the capture session server-side by decoding its Agent `program_name` to a `job_id` and matching `msdb.dbo.cdc_jobs` (`job_type = 'capture'`), falling back to a whole-text match when msdb is unreadable or `cdc_jobs` doesn't yet exist — so it stays CDC-specific and never hides unrelated Agent jobs. Dashboard filters the live DMV query inline; Lite computes a per-row `is_cdc_capture` flag in the collector (its snapshots store only statement-level text) and filters on read ### Changed @@ -30,15 +43,24 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - **`PlanIconMapper` split to break a shared-library WPF dependency** — `ShowPlanParser` calls `PlanIconMapper.GetIconName` to populate `PlanNode.IconName` during parse, but the rest of `PlanIconMapper` is WPF-bound (`GetIcon` returns `BitmapImage`). The pure-data half (the `IconMap` dictionary + the `GetIconName` lookup) is now `IconNameMapper` inside `PerformanceMonitor.PlanAnalysis`. The per-app `PlanIconMapper.GetIcon(string iconName)` is unchanged; the per-app `GetIconName` forwarder is gone (`ShowPlanParser` calls `IconNameMapper.GetIconName` directly, and there were no other callers) - **Analysis engine extracted to shared library `PerformanceMonitor.Analysis`** — the previously duplicated `FactScorer`, `RelationshipGraph`, `InferenceEngine`, `AnalysisModels`, `IFactCollector`, `IPlanFetcher`, and `BlockingChainReconstructor` pairs across `Dashboard/Analysis/` and `Lite/Analysis/` are now one copy referenced by both apps and both test projects via ``. The new library targets `net10.0` (no WPF) so it can be picked up by future non-WPF consumers without a multi-target rewrite. The `blocking-reconstructor-sync-checker` agent is retired (no copies to sync). `BlockingChainReconstructorTests` ported to `Dashboard.Tests` (10 tests) as part of the same change — Dashboard now exercises the same reconstruction coverage as Lite. `AnalysisService` and the DB-bound adapters (`*FactCollector`, `*DrillDownCollector`, `*FindingStore`, `*AnomalyDetector`, `*BaselineProvider`, `*PlanFetcher`) stay per-app because they bind to `DuckDBConnection` vs `SqlConnection`. `PlanAnalyzer` and its `planalyzer-sync-checker` are outside this extraction's scope and stay - **Trace files are now bounded at the source** ([#972]) — `collect.trace_management_collector` creates the long-query trace with a rollover file-count cap (`@filecount`, via the new `@max_files` parameter, default 5), so SQL Server itself deletes the oldest `.trc` file as the trace rolls. The scheduled collector also now issues `START` instead of `RESTART`: it keeps one trace running rather than tearing it down and spawning a fresh timestamped trace — and a fresh batch of orphaned files — every cycle -- **Blocked-process reports expose blocker-side fields as typed columns** — `collect.blocking_BlockedProcessReport` now carries `blocking_spid`, `blocking_last_tran_started`, `blocking_status`, `blocked_sql_text`, and `blocking_sql_text` populated at insert time from `blocked_process_report_xml`. Existing rows are backfilled idempotently by the 2.11.0 → 2.12.0 upgrade script +- **Blocked-process reports expose blocker-side fields as typed columns** — `collect.blocking_BlockedProcessReport` now carries `blocking_spid`, `blocking_last_tran_started`, `blocking_status`, `blocked_sql_text`, and `blocking_sql_text` populated at insert time from `blocked_process_report_xml`. Existing rows are backfilled idempotently by the 2.11.0 → 3.0.0 upgrade script - **Blocking-chain reconstruction now reads typed columns from `collect.blocking_BlockedProcessReport`** instead of re-parsing `blocked_process_report_xml` on every analysis cycle — eliminates up to 5000 `XElement.Parse` calls per `BLOCKING_CHAIN` fact collection. The Dashboard `BlockedProcessXmlParser` has been deleted; the Lite collection-time parser is unchanged (Lite has no SQL-side staging table and still parses once at collect time) - **Analysis minimum-data threshold lowered to 24 hours** — `Lite/Analysis/AnalysisService.cs` and `Dashboard/Analysis/AnalysisService.cs` now require 24 hours of collected data before analysis runs, down from 72. Validated empirically as sufficient for fraction-of-period calculations, so a fresh install starts producing findings after one day instead of three +- **Major UI-responsiveness overhaul — the data path now runs off the WPF dispatcher in both apps** — DuckDB.NET is synchronous, so in Lite `await _dataService.X()` completed on the calling (UI) thread, and a single DuckDB connection open under load is ~750 ms; the result was multi-hundred-millisecond to multi-second UI freezes on the per-minute pipeline, refreshes, and alert checks. The fix moves the work onto pool threads (`Task.Run`) across the board: Lite's background collect/checkpoint/archive pipeline, the full-refresh fan-out, the 60-second sub-tab refreshes, picker charts, the overview sweep, timeline lanes, connect, and the FinOps and Recommendations reads; the Dashboard's `ServerTab` row materialization and its execution-plan parse/analyze; and — found later by wall-clock thread-time profiling under a HammerDB TPC-C load — the alert-check / overview-sweep DuckDB queries that were still on the dispatcher ([#1121], which cut the worst measured dispatcher stall from ~1.2 s to under 10 ms). Lite also skips the heavy refresh for non-selected (hidden) server tabs, the shared crosshair/hover hot path was made cheaper for both apps, and Dashboard timers gained re-entrancy guards. A cluster of long-session memory leaks that progressively degraded responsiveness was fixed alongside ([#1116]): an Alerts-tab `DispatcherTimer` that kept ticking after the tab closed, unbounded per-run alert-key dictionaries, a tray-service handler re-subscribed on every theme change, and plan-viewer controls leaked through a static theme event. (The related sleep/wake blank-window and software-rendering fix is tracked separately under [#1050] above.) Net effect: the UI stays responsive under heavy collection and query load ### Added - **`tools/Remove-OrphanedTraceFiles.ps1`** ([#972]) — one-time cleanup script for `Monitor_LongQueries_*.trc` files left on disk by versions through 2.11.0. Run it on the SQL Server host; it skips files belonging to a running trace and files that are in use - **`FactAdvice` and `FactRemediation` in `PerformanceMonitor.Analysis`** — new shared-library data layer that maps every scorable fact-key to a Headline / Investigation / Remediation advice block, plus a copy-paste-ready `sp_query_store_force_plan` T-SQL generator for `PLAN_REGRESSION` findings (gated to that single fact-key in v1; `PARAMETER_SENSITIVITY` deliberately does not generate plan-force T-SQL because forcing locks in the wrong plan for some parameter values). Drill-down collectors now also project `best_plan_id` (via `MAX(plan_id)` in the plan-dedup CTE) so the generated EXEC carries the integer ID `sp_query_store_force_plan` actually accepts, not just the hash. Lite's `BuildContext` now mirrors Dashboard's — both apps emit a Diagnosis card at `Details[0]` carrying Story / Severity / Notify threshold / Confidence / Facts / Database / Window before the drill-down items. The rendering surfaces that consume this data (email HTML, plain-text email, Teams + Slack webhook payloads, in-app Alert Details window) ship in a separate follow-up PR - +- **Object- and index-level collection: sizes, growth, usage, and locking/contention** ([#1103]) — both apps gain a daily collector that snapshots per-table and per-index storage (`sys.dm_db_partition_stats`), index usage (`sys.dm_db_index_usage_stats` — seeks/scans/lookups/updates), and per-object locking/latch/escalation (`sys.dm_db_index_operational_stats` — row-lock waits, page-latch waits, lock-escalation attempts), all from stock DMVs verified stable from SQL Server 2016 through 2025 and on Azure SQL DB / Managed Instance. On-prem and MI iterate user databases (honoring the collector exclusion list); Azure SQL DB uses its single-database branch. Dashboard collects into `collect.index_object_stats` (`install/55_collect_index_object_stats.sql`, scheduled daily with 90-day retention picked up by dynamic retention); Lite collects into DuckDB with archival registered. Three new FinOps sub-tabs in each app — **Object Sizes & Growth** (per-table size plus 7-/30-day growth and daily rate), **Index Usage** (Unused / Write-only / Active classification), and **Locking & Contention** (top-contended indexes) — plus MCP read tools (`get_table_index_sizes`, `get_index_usage`, `get_object_locking`). Because the daily snapshots are cumulative, the new **object-growth** (`ANOMALY_OBJECT_GROWTH`, a table grew >100 MB and ≥20% day-over-day) and **lock-contention** (`ANOMALY_OBJECT_CONTENTION`, an index gained ≥60s of new row-lock wait) alerts are delta-based (the two most recent snapshots, reset-guarded) and flow through the existing anomaly → `AnalysisNotificationService` pipeline. Thresholds are fixed constants in this release; making them user-configurable is a follow-up +- **Recommendations / Apply Fix engine (advise-and-act rebuild)** — the analysis engine's advisory output is now a first-class **Recommendations** surface in both apps, alongside Critical Issues. Each finding renders as a card with a plain-language Headline / Investigation / Remediation block (from the `FactAdvice`/`FactRemediation` shared-library data layer) and routes the reader into the relevant in-app view or MCP tool instead of dumping raw DMV queries. Advise-only recommendations include server-config advisories (MAXDOP / cost threshold for parallelism / max server memory), per-database config (autogrowth, percent-growth on large files), server-health facts (Lock Pages in Memory, Instant File Initialization, recent memory dumps), and missing-index / plan-warning recommendations mined from collected plans — missing-index `CREATE` statements are surfaced as copy-paste text. A subset is **appliable in place behind informed, two-sided consent**: always-safe `ALTER DATABASE SET` config fixes, and the destructive **RCSI** (enable read-committed snapshot) and **clear cached plan** (`DBCC FREEPROCCACHE` / unforce) fixes, which gate behind an acknowledge-each-risk dialog that quantifies both the risk of changing and the risk of doing nothing from the finding's own monitoring data. The advice and remediation T-SQL also render across every notification surface — email (HTML and plain text), Teams and Slack webhook payloads, and the in-app Alert Details window — and through the `analyze_server` and `get_analysis_findings` MCP tools +- **Low volume free-space alert** ([#754]) in both apps — a new **Volume Free Space** alert (default on) fires when a monitored server's disk volume drops below a free-space percentage **or** a fixed GB amount (set either threshold to `0` to disable that dimension; if both are set, either breach fires). It reads the per-volume size/free data already collected by the database-size collector, evaluates every volume on the server, and fires one alert per server naming the worst (lowest-free) volume with up to five breaching volumes in the context — with the same cooldown, mute, alert-history, tray, and email plumbing as the existing tempdb-space alert. Defaults: 10% / 5 GB. Azure SQL DB has no volume data, so the alert never fires there +- **Failed SQL Agent job alert** ([#749]) in both apps — complements the existing job-duration alerts with a **Failed Agent Job** alert (default on) that issues a live `msdb.dbo.sysjobhistory` query at alert-check time for job-outcome rows (`step_id = 0`, `run_status = 0`) that failed within a configurable look-back window (default 60 minutes). The read degrades gracefully when the login lacks msdb / `SQLAgentReaderRole` access (returns empty, never faults the alert cycle) and is skipped entirely on Azure SQL DB, which has no SQL Agent +- **Installer: optional custom data/log file locations** ([#768]) — two optional CLI flags, `--data-path` and `--log-path` (both `--flag VALUE` and `--flag=VALUE` forms accepted), place the `PerformanceMonitor` database's `.mdf`/`.ldf` on specific server-side volumes at install time; an omitted flag falls back to the instance default path as before. The paths apply only on first creation (the create block is guarded by `IF DB_ID(N'PerformanceMonitor') IS NULL`), and Azure SQL Managed Instance ignores them. The path is validated and escaped (control characters and the dangerous filename characters are rejected; single quotes are doubled in both the C# injection layer and the dynamic `CREATE DATABASE`) because a data-file `FILENAME` literal cannot be parameterized + +[#749]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/749 +[#754]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/754 +[#768]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/768 [#972]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/972 [#979]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/979 [#980]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/980 @@ -46,8 +68,15 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 [#1012]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/1012 [#1035]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/1035 [#1050]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/1050 +[#1085]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/1085 [#1086]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/1086 [#1091]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/1091 +[#1092]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/1092 +[#1096]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/1096 +[#1103]: https://github.com/erikdarlingdata/PerformanceMonitor/issues/1103 +[#1116]: https://github.com/erikdarlingdata/PerformanceMonitor/pull/1116 +[#1121]: https://github.com/erikdarlingdata/PerformanceMonitor/pull/1121 +[#1122]: https://github.com/erikdarlingdata/PerformanceMonitor/pull/1122 ## [2.11.0] - 2026-05-19 diff --git a/Dashboard/Dashboard.csproj b/Dashboard/Dashboard.csproj index 583cffe2..3b8a7a57 100644 --- a/Dashboard/Dashboard.csproj +++ b/Dashboard/Dashboard.csproj @@ -7,10 +7,10 @@ PerformanceMonitorDashboard.Program PerformanceMonitorDashboard SQL Server Performance Monitor Dashboard - 2.11.0 - 2.11.0.0 - 2.11.0.0 - 2.11.0 + 3.0.0 + 3.0.0.0 + 3.0.0.0 + 3.0.0 Darling Data, LLC Copyright © 2026 Darling Data, LLC EDD.ico diff --git a/Installer.Core/Installer.Core.csproj b/Installer.Core/Installer.Core.csproj index 89d130d5..adb1ed46 100644 --- a/Installer.Core/Installer.Core.csproj +++ b/Installer.Core/Installer.Core.csproj @@ -7,10 +7,10 @@ Installer.Core Installer.Core SQL Server Performance Monitor Installer Core - 2.11.0 - 2.11.0.0 - 2.11.0.0 - 2.11.0 + 3.0.0 + 3.0.0.0 + 3.0.0.0 + 3.0.0 Darling Data, LLC Copyright (c) 2026 Darling Data, LLC true diff --git a/Installer/PerformanceMonitorInstaller.csproj b/Installer/PerformanceMonitorInstaller.csproj index 4f66ac64..d8aea42b 100644 --- a/Installer/PerformanceMonitorInstaller.csproj +++ b/Installer/PerformanceMonitorInstaller.csproj @@ -20,10 +20,10 @@ PerformanceMonitorInstaller SQL Server Performance Monitor Installer - 2.11.0 - 2.11.0.0 - 2.11.0.0 - 2.11.0 + 3.0.0 + 3.0.0.0 + 3.0.0.0 + 3.0.0 Darling Data, LLC Copyright © 2026 Darling Data, LLC Installation utility for SQL Server Performance Monitor - Supports SQL Server 2016-2025 diff --git a/Lite/PerformanceMonitorLite.csproj b/Lite/PerformanceMonitorLite.csproj index 718b489f..5238b1ea 100644 --- a/Lite/PerformanceMonitorLite.csproj +++ b/Lite/PerformanceMonitorLite.csproj @@ -8,10 +8,10 @@ PerformanceMonitorLite PerformanceMonitorLite SQL Server Performance Monitor Lite - 2.11.0 - 2.11.0.0 - 2.11.0.0 - 2.11.0 + 3.0.0 + 3.0.0.0 + 3.0.0.0 + 3.0.0 Darling Data, LLC Copyright © 2026 Darling Data, LLC Lightweight SQL Server performance monitoring - no installation required on target servers diff --git a/upgrades/2.11.0-to-2.12.0/01_extend_blocked_process_report_columns.sql b/upgrades/2.11.0-to-3.0.0/01_extend_blocked_process_report_columns.sql similarity index 100% rename from upgrades/2.11.0-to-2.12.0/01_extend_blocked_process_report_columns.sql rename to upgrades/2.11.0-to-3.0.0/01_extend_blocked_process_report_columns.sql diff --git a/upgrades/2.11.0-to-2.12.0/02_make_other_process_cpu_nullable.sql b/upgrades/2.11.0-to-3.0.0/02_make_other_process_cpu_nullable.sql similarity index 100% rename from upgrades/2.11.0-to-2.12.0/02_make_other_process_cpu_nullable.sql rename to upgrades/2.11.0-to-3.0.0/02_make_other_process_cpu_nullable.sql diff --git a/upgrades/2.11.0-to-2.12.0/03_add_transaction_mutex_ignored_wait.sql b/upgrades/2.11.0-to-3.0.0/03_add_transaction_mutex_ignored_wait.sql similarity index 100% rename from upgrades/2.11.0-to-2.12.0/03_add_transaction_mutex_ignored_wait.sql rename to upgrades/2.11.0-to-3.0.0/03_add_transaction_mutex_ignored_wait.sql diff --git a/upgrades/2.11.0-to-2.12.0/04_add_server_health_columns.sql b/upgrades/2.11.0-to-3.0.0/04_add_server_health_columns.sql similarity index 100% rename from upgrades/2.11.0-to-2.12.0/04_add_server_health_columns.sql rename to upgrades/2.11.0-to-3.0.0/04_add_server_health_columns.sql diff --git a/upgrades/2.11.0-to-2.12.0/upgrade.txt b/upgrades/2.11.0-to-3.0.0/upgrade.txt similarity index 100% rename from upgrades/2.11.0-to-2.12.0/upgrade.txt rename to upgrades/2.11.0-to-3.0.0/upgrade.txt From 36c6842ee327d7605a4f632728f6b01bedf5d9df Mon Sep 17 00:00:00 2001 From: Erik Darling <2136037+erikdarlingdata@users.noreply.github.com> Date: Mon, 15 Jun 2026 21:12:00 -0400 Subject: [PATCH 2/5] Release prep: README sync for v3.0.0 (index_object_stats collector + FinOps object/index analysis) Lite collector table/count 24->25 with the new index_object_stats row; Dashboard collector count 32->33; FinOps description gains the per-object table/index size/growth/usage/locking analysis (#1103). Co-Authored-By: Claude Opus 4.8 (1M context) --- README.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 0a288075..0e9e5c23 100644 --- a/README.md +++ b/README.md @@ -56,7 +56,7 @@ All release binaries are digitally signed via [SignPath](https://signpath.io) ## What You Get -🔍 **32 specialized T-SQL collectors** running on configurable schedules with named presets (Off, Aggressive, Balanced, Low-Impact) — wait stats, query performance, blocking chains, deadlock graphs, memory grants, file I/O, tempdb, perfmon counters, FinOps/capacity, and more. Query text and execution plan collection can be disabled per-collector for sensitive environments. Switch presets with a pair of SQL Agent jobs to get quiet-hours / overnight windows without writing any code. +🔍 **33 specialized T-SQL collectors** running on configurable schedules with named presets (Off, Aggressive, Balanced, Low-Impact) — wait stats, query performance, blocking chains, deadlock graphs, memory grants, file I/O, tempdb, perfmon counters, FinOps/capacity, and more. Query text and execution plan collection can be disabled per-collector for sensitive environments. Switch presets with a pair of SQL Agent jobs to get quiet-hours / overnight windows without writing any code. 🚨 **Real-time alerts** for blocking, deadlocks, and high CPU — system tray notifications, styled HTML emails with full XML attachments, and webhook notifications for external integrations @@ -104,7 +104,7 @@ Data starts flowing within 1–5 minutes. That's it. No installation on your ser ### Lite Collectors -24 collectors run on independent, configurable schedules: +25 collectors run on independent, configurable schedules: | Collector | Default | Source | |---|---|---| @@ -128,6 +128,7 @@ Data starts flowing within 1–5 minutes. That's it. No installation on your ser | running_jobs | 5 min | `msdb` job history with duration vs avg/p95 | | database_size_stats | 15 min | `sys.master_files` + `FILEPROPERTY` + `dm_os_volume_stats` | | server_properties | 15 min | `SERVERPROPERTY()` hardware and licensing metadata | +| index_object_stats | Daily | `sys.dm_db_partition_stats` + `sys.dm_db_index_usage_stats` + `sys.dm_db_index_operational_stats` | | server_config | On connect | `sys.configurations` | | database_config | On connect | `sys.databases` | | database_scoped_config | On connect | Database-scoped configurations | @@ -254,7 +255,7 @@ ORDER BY collection_time DESC; ### What Gets Installed - **PerformanceMonitor database** with collection tables and reporting views -- **32 collector stored procedures** for gathering metrics (including SQL Agent job monitoring) +- **33 collector stored procedures** for gathering metrics (including SQL Agent job monitoring) - **Configurable collection** — query text and execution plan capture can be disabled per-collector via `config.collection_schedule` (`collect_query`, `collect_plan` columns) for sensitive or high-volume environments - **Delta framework** for calculating per-second rates from cumulative DMVs - **Community dependencies:** sp_WhoIsActive, sp_HealthParser, sp_HumanEventsBlockViewer, sp_BlitzLock @@ -369,7 +370,7 @@ Plus a NOC-style landing page with server health cards (green/yellow/red severit | **Blocking** | Blocking/deadlock trends, blocked process reports, deadlock history | | **Perfmon** | Selectable SQL Server performance counters over time | | **Configuration** | Server configuration, database configuration, scoped configuration, trace flags | -| **FinOps** | Utilization & provisioning analysis, database resource breakdown, storage growth (7d/30d), idle database detection, index analysis via sp_IndexCleanup, application connections, server inventory, cost optimization recommendations (enterprise feature audit, CPU/memory right-sizing, compression savings, dormant databases, dev/test detection), column-level filtering on all grids | +| **FinOps** | Utilization & provisioning analysis, database resource breakdown, storage growth (7d/30d), idle database detection, index analysis via sp_IndexCleanup, per-object table/index size, growth, usage, and locking/contention analysis, application connections, server inventory, cost optimization recommendations (enterprise feature audit, CPU/memory right-sizing, compression savings, dormant databases, dev/test detection), column-level filtering on all grids | Both editions feature auto-refresh, configurable time ranges, chart drill-down to Active Queries, right-click CSV export, system tray integration, dark and light themes, and timezone display options (server time, local time, or UTC). From 5892a74d9a600bca295db68313e362ab7df65b71 Mon Sep 17 00:00:00 2001 From: Erik Darling <2136037+erikdarlingdata@users.noreply.github.com> Date: Mon, 15 Jun 2026 23:31:44 -0400 Subject: [PATCH 3/5] Fix: blocked-process & deadlock XML processors no longer loop on un-parseable events The second-phase parsers (collect.process_blocked_process_xml -> sp_HumanEventsBlockViewer, collect.process_deadlock_xml -> sp_BlitzLock) only marked a captured event processed when the parse produced >= 1 row. Events that legitimately parse to zero rows -- a self-block or non-lock wait (e.g. a memory-grant RESOURCE_SEMAPHORE wait that tripped blocked_process_threshold, reported as a session blocking itself), or an unparseable deadlock graph -- were never marked, so every cycle re-ran the CPU-intensive parser over the same dead events and re-logged a perpetual NO_RESULTS; the staging table never drained. Both processors now mark events processed after any clean parse run and log SUCCESS. The XACT_STATE()/CATCH paths still roll back and retry genuine failures. The blocked-process processor's half-open parse window (event_time < @end_date) also dropped single-timestamp batches (MIN = MAX, the common case when a monitor loop emits all reports at one instant); the upper bound is padded +1s to match the deadlock processor, which already did this. Validated: sql2017's real 11-event self-block backlog drains (SUCCESS, no re-loop); sql2025's 733 real chains + 81k parsed deadlocks confirm the parse path is unchanged. Regression test tools/test_blocked_process_processor.sql (real self-block + two-session samples at a single timestamp) passes all assertions. Co-Authored-By: Claude Opus 4.8 (1M context) --- CHANGELOG.md | 1 + install/23_process_blocked_process_xml.sql | 92 ++++++++++++------- install/25_process_deadlock_xml.sql | 54 ++++++----- tools/test_blocked_process_processor.sql | 100 +++++++++++++++++++++ 4 files changed, 191 insertions(+), 56 deletions(-) create mode 100644 tools/test_blocked_process_processor.sql diff --git a/CHANGELOG.md b/CHANGELOG.md index f467de85..eb511c0e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -17,6 +17,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - **Lite: blocking and deadlock alerts no longer re-fire for the same events every cooldown** ([#1091]) — the overview alert engine treated the blocking and deadlock counts as a level: each check compared the rolling **1-hour** count against the threshold, so a single deadlock (or blocked-process report) kept the count above the threshold for the whole hour it lingered in the window, and the alert re-fired every cooldown (the reporter saw the same "2 deadlocks in the last hour" notification every five minutes for an hour). The Dashboard already edge-triggers off a delta; Lite now does too. Both alerts are gated by a new `RollingCountAlertGate` that fires only when the rolling count climbs above the count recorded at the last fired alert — a genuinely new event. The watermark decays as old events age out of the window (so a later rise re-alerts), resets when the window empties, and advances only when an alert actually fires (so an event arriving during a cooldown is reported once the cooldown elapses rather than being swallowed). Covered by `RollingCountAlertGateTests` - **Lite: blocking/deadlock XE sessions now self-heal and failures are surfaced** ([#1086]) — the `PerformanceMonitor_BlockedProcess` and `PerformanceMonitor_Deadlock` Extended Events sessions were created only when a server tab was opened; the recurring background collection loop never created or retried them. A server monitored without an open tab (e.g. app minimized to tray after a restart), or a first attempt that failed (connection not ready, missing `ALTER ANY EVENT SESSION`), left blocking/deadlock capture permanently dead — while the collectors read the non-existent ring buffer, got zero rows, and reported **OK**. The session ensure now runs inside the collector itself on every cycle (cheap existence check once created), so both the tab-open path and the background loop create/start/retry it. A failed ensure can no longer be masked: it fails the collector run, shows in the status-bar collector health (including permission failures, which previously didn't count as "erroring"), and fires a one-time tray notification ("Capture Not Running") on the transition. The Azure SQL DB database-scoped sessions also gain `STARTUP_STATE = ON` so they restart automatically after a failover - **Dashboard: blocking/deadlock XE sessions self-heal, Azure SQL DB sessions are actually created, and a missing session raises a Capture Down alert** — same silent-failure family as [#1086], worse on the Dashboard side. (1) The server-scoped sessions were created once at install and never re-ensured: if later stopped or dropped, `collect.blocked_process_xml_collector` and `collect.deadlock_xml_collector` swallowed the missing-session error and logged `SUCCESS` with zero rows forever. Both procs now ensure (create/start) the session at the top of every run. (2) On Azure SQL DB, the code comments claimed the database-scoped sessions were "auto-created by the collection procedures" — nothing anywhere created them, so blocking/deadlock capture was 100% non-functional on Azure SQL DB; the procs now create and start them (`database_xml_deadlock_report` for deadlocks — the Azure read also filtered on the wrong event name and would have returned nothing even with a session present). (3) Honest logging: when the session is genuinely absent and can't be created (typically missing `ALTER ANY EVENT SESSION` on-prem / `CREATE ANY DATABASE EVENT SESSION` on Azure SQL DB), the run logs `SESSION_MISSING` with the real error instead of `SUCCESS`. (4) The alert engine reads that status and raises a **Capture Down** alert through the standard pipeline — snoozable tray notification, email, webhook, alert history, cooldown, and mute — with a **Capture Restored** clear when the session comes back. Note: on Azure SQL DB the blocked-process *threshold* cannot be set via `sp_configure` and Microsoft documents no default, so the blocked-process session may exist yet capture nothing there; deadlock capture has no such dependency +- **Blocked-process and deadlock XML processors no longer loop on un-parseable events** — the second-phase parsers (`collect.process_blocked_process_xml` → `sp_HumanEventsBlockViewer`, and `collect.process_deadlock_xml` → `sp_BlitzLock`) only marked a captured event processed when the parse produced at least one row. Events that legitimately yield zero — a *self-block* or non-lock wait (e.g. a memory-grant `RESOURCE_SEMAPHORE` wait that tripped `blocked_process_threshold`, which SQL Server reports as a session blocked by itself), or a deadlock graph the parser can't reconstruct — were never marked, so every collection cycle re-ran the CPU-intensive parser over the same dead events and re-logged a perpetual `NO_RESULTS` while the staging table never drained. Both processors now mark events processed after any clean parse run and log `SUCCESS`; genuine parse failures still roll back and retry. Separately, the blocked-process processor's parse window was half-open (`event_time < @end_date`), so a batch of reports sharing one timestamp — the common case, since a blocked-process monitor loop emits every report at a single instant — fell outside `[MIN, MAX)` and was silently dropped; the upper bound is now inclusive (matching the deadlock processor). Covered by `tools/test_blocked_process_processor.sql` using real self-block and two-session samples - **Lite and Dashboard UI no longer goes blank or disappears after sleep/wake** ([#1050]) — closing a laptop lid (or locking the screen) and then resuming could leave the app running with no usable window: notifications kept firing but the window was gone from the desktop and taskbar, and relaunching showed an empty window until a full exit/restart. Two causes, both fixed. (1) WPF's GPU render thread can lose its rendering surface across a sleep/wake or RDP reconnect and never recover, leaving a live-but-blank window; both apps now use software rendering (`RenderOptions.ProcessRenderMode = SoftwareOnly`) to remove the GPU dependency — charts are unaffected because ScottPlot already renders via SkiaSharp. (2) When Windows turned the sleep-driven minimize into a hidden window, the minimize-to-tray logic left it hidden with no automatic way back; a new shared resume guard now restores the window from the tray on resume/unlock if it was visible beforehand (a window the user deliberately sent to the tray is left alone) - **"Silence All Alerts" now suppresses email too** ([#1035]) — right-clicking a monitored instance and choosing *Silence All Alerts* hid tray notifications and Alerts-tab badges, but two email paths ignored the silenced state and kept sending: connection up/down emails (*Server Unreachable* / *Server Restored*) and analysis-finding emails (the narrative findings from the analysis engine, which include CPU/memory/blocking stories). Only the threshold-alert path (High CPU, blocking, deadlocks, etc.) honored silencing. Both gaps are closed — a silenced server now produces no tray, email, or alert-history row from any path. The analysis path was the likely source of the reporter's "High CPU" email, since the threshold-based High CPU alert was already suppressed. The shared `AnalysisNotificationService` (used by Lite too) gains an optional per-server silence predicate; Lite has no silencing feature and passes none diff --git a/install/23_process_blocked_process_xml.sql b/install/23_process_blocked_process_xml.sql index 4fbd8084..db28a87f 100644 --- a/install/23_process_blocked_process_xml.sql +++ b/install/23_process_blocked_process_xml.sql @@ -145,9 +145,20 @@ BEGIN The proc expects local time inputs and converts to UTC internally Raw table event_time is UTC (from XE @timestamp attribute) */ + /* + Pad the upper bound by one second. sp_HumanEventsBlockViewer + filters the source table with a half-open window + (event_time < @end_date), so without the pad the newest + event(s) - and an entire batch sharing a single timestamp + (MIN = MAX, the common case because a blocked-process monitor + loop emits every report at one instant) - fall outside + [MIN, MAX) and are never parsed. The local/UTC basis itself + round-trips correctly: this proc shifts UTC event_time to local + and sp_HumanEventsBlockViewer shifts it back to UTC internally. + */ SELECT @start_date_local = DATEADD(MINUTE, @utc_offset_minutes, @start_date), - @end_date_local = DATEADD(MINUTE, @utc_offset_minutes, @end_date); + @end_date_local = DATEADD(SECOND, 1, DATEADD(MINUTE, @utc_offset_minutes, @end_date)); IF @debug = 1 BEGIN @@ -357,31 +368,46 @@ BEGIN ); END; - /* - Mark raw XML rows as processed - Only mark the rows in the date range we just processed - */ - UPDATE bx - SET bx.is_processed = 1 - FROM collect.blocked_process_xml AS bx - WHERE bx.is_processed = 0 - AND (@start_date IS NULL OR bx.event_time >= @start_date) - AND (@end_date IS NULL OR bx.event_time <= @end_date); - - SELECT - @rows_marked = ROWCOUNT_BIG(); + END; - IF @debug = 1 - BEGIN - RAISERROR(N'Marked %I64d raw XML rows as processed (%I64d parsed blocking events)', 0, 1, @rows_marked, @rows_parsed) WITH NOWAIT; - END; + /* + Mark the raw XML rows we handed to sp_HumanEventsBlockViewer as + processed - UNCONDITIONALLY after a clean parse run, not only when + @rows_parsed > 0. The viewer legitimately returns zero rows for + events that carry no lock-blocking chain between distinct sessions: + self-blocks, and non-lock GENERIC/NL waits such as a memory-grant + RESOURCE_SEMAPHORE wait that tripped blocked_process_threshold. + Those never parse, so gating the mark on @rows_parsed > 0 left them + unprocessed forever - the processor re-ran the viewer over the same + dead events every cycle and re-logged NO_RESULTS indefinitely. + + Safe because the upper bound was padded (+1s) above, so the viewer's + half-open window actually covers every unprocessed event - we never + mark a row the viewer did not get to see. Genuine failures never + reach here either: the XACT_STATE() = -1 check and the CATCH block + both roll back without marking, so a real parse failure still + retries next run. Raw XML is retained (is_processed = 1, not + deleted); data-retention handles cleanup. event_time is UTC, + matching @start_date / @end_date. + */ + IF @rows_parsed = 0 AND @debug = 1 + BEGIN + RAISERROR(N'sp_HumanEventsBlockViewer produced 0 parsed results for %d XML event(s) - no lock-blocking chains (self-block / non-lock waits); events still marked processed', 0, 1, @rows_available) WITH NOWAIT; END; - ELSE + + UPDATE bx + SET bx.is_processed = 1 + FROM collect.blocked_process_xml AS bx + WHERE bx.is_processed = 0 + AND (@start_date IS NULL OR bx.event_time >= @start_date) + AND (@end_date IS NULL OR bx.event_time <= @end_date); + + SELECT + @rows_marked = ROWCOUNT_BIG(); + + IF @debug = 1 BEGIN - IF @debug = 1 - BEGIN - RAISERROR(N'sp_HumanEventsBlockViewer produced 0 parsed results for %d XML events - rows left unprocessed for retry', 0, 1, @rows_available) WITH NOWAIT; - END; + RAISERROR(N'Marked %I64d raw XML rows as processed (%I64d parsed blocking events)', 0, 1, @rows_marked, @rows_parsed) WITH NOWAIT; END; END; @@ -400,18 +426,18 @@ BEGIN VALUES ( N'process_blocked_process_xml', - CASE WHEN @rows_available = 0 THEN N'SUCCESS' - WHEN @rows_parsed > 0 THEN N'SUCCESS' - ELSE N'NO_RESULTS' - END, + /* + A clean parse run is SUCCESS even when it produced 0 blocking + chains: the events were processed and marked, they simply carried + no lock-blocking between distinct sessions (self-block / non-lock + waits). Genuine failures take the CATCH path and log ERROR. This + ends the perpetual NO_RESULTS this collector used to emit for + un-parseable-by-design events. + */ + N'SUCCESS', @rows_available, DATEDIFF(MILLISECOND, @start_time, SYSDATETIME()), - CASE WHEN @rows_available > 0 AND @rows_parsed = 0 - THEN N'sp_HumanEventsBlockViewer returned 0 parsed results for ' - + CAST(@rows_available AS nvarchar(20)) - + N' XML events - rows left unprocessed for retry' - ELSE NULL - END + NULL ); IF @debug = 1 diff --git a/install/25_process_deadlock_xml.sql b/install/25_process_deadlock_xml.sql index 263b6b6c..23808747 100644 --- a/install/25_process_deadlock_xml.sql +++ b/install/25_process_deadlock_xml.sql @@ -232,33 +232,41 @@ BEGIN AND d.event_date <= @end_date_local OPTION(RECOMPILE); - IF @rows_parsed > 0 + /* + Mark the raw XML rows we handed to sp_BlitzLock as processed - + UNCONDITIONALLY after a clean parse run, not only when + @rows_parsed > 0. sp_BlitzLock legitimately returns zero rows for + deadlock graphs it cannot parse (malformed/partial graphs, or + non-deadlock events captured by the session). Gating the mark on + @rows_parsed > 0 left those unprocessed forever - the processor + re-ran sp_BlitzLock over the same dead events every cycle and + re-logged NO_RESULTS indefinitely. Genuine failures never reach + here: the XACT_STATE() = -1 check and the CATCH block both roll back + without marking, so a real parse failure still retries next run. Raw + XML is retained (is_processed = 1, not deleted); data-retention + handles cleanup. The +1s pad on @end_date above guarantees the parse + window covers every unprocessed event, so we never mark a row + sp_BlitzLock did not get to see. event_time is UTC, matching + @start_date / @end_date. + */ + IF @rows_parsed = 0 AND @debug = 1 BEGIN - /* - Mark raw XML rows as processed - Only mark the rows in the date range we just processed - */ - UPDATE dx - SET dx.is_processed = 1 - FROM collect.deadlock_xml AS dx - WHERE dx.is_processed = 0 - AND (@start_date IS NULL OR dx.event_time >= @start_date) - AND (@end_date IS NULL OR dx.event_time <= @end_date); + RAISERROR(N'sp_BlitzLock produced 0 parsed results for %d XML event(s) - no parseable deadlock graphs; events still marked processed', 0, 1, @rows_available) WITH NOWAIT; + END; - SELECT - @rows_marked = ROWCOUNT_BIG(); + UPDATE dx + SET dx.is_processed = 1 + FROM collect.deadlock_xml AS dx + WHERE dx.is_processed = 0 + AND (@start_date IS NULL OR dx.event_time >= @start_date) + AND (@end_date IS NULL OR dx.event_time <= @end_date); - IF @debug = 1 - BEGIN - RAISERROR(N'Marked %I64d raw XML rows as processed (%I64d parsed deadlocks)', 0, 1, @rows_marked, @rows_parsed) WITH NOWAIT; - END; - END; - ELSE + SELECT + @rows_marked = ROWCOUNT_BIG(); + + IF @debug = 1 BEGIN - IF @debug = 1 - BEGIN - RAISERROR(N'sp_BlitzLock produced 0 parsed results for %d XML events - rows left unprocessed for retry', 0, 1, @rows_available) WITH NOWAIT; - END; + RAISERROR(N'Marked %I64d raw XML rows as processed (%I64d parsed deadlocks)', 0, 1, @rows_marked, @rows_parsed) WITH NOWAIT; END; END; diff --git a/tools/test_blocked_process_processor.sql b/tools/test_blocked_process_processor.sql new file mode 100644 index 00000000..477b5ebb --- /dev/null +++ b/tools/test_blocked_process_processor.sql @@ -0,0 +1,100 @@ +/* +================================================================================ +Regression test: collect.process_blocked_process_xml (blocked-process XML processor) +================================================================================ +Guards the fixes in install/23_process_blocked_process_xml.sql: + + 1. Mark-on-clean-run: events sp_HumanEventsBlockViewer legitimately yields 0 + rows for (self-blocks; non-lock GENERIC/NL waits such as a memory-grant + RESOURCE_SEMAPHORE wait that tripped blocked_process_threshold) are marked + is_processed = 1 instead of looping forever. The old @rows_parsed > 0 gate + left them unprocessed and re-logged NO_RESULTS every cycle. + + 2. Inclusive +1s window: a batch of events sharing one timestamp (MIN = MAX, + the common case - a monitor loop emits every report at one instant) is + parsed instead of dropped by the viewer's half-open event_time < @end_date. + +Both sample events are stamped at the SAME sentinel timestamp (2099) so the run +exercises the single-timestamp window path directly. + +USAGE: run against an IDLE PerformanceMonitor instance (no pending blocked-process +backlog) - e.g. a quiet test box. Aborts if a real backlog is present; cleans up +its own sentinel rows on exit. Requires sp_HumanEventsBlockViewer installed. + +Samples are REAL captures: the two-session block came from a live server (spid +96 blocked by spid 100 on a KEY lock); the self-block is +a real non-lock self-referential report (spid 61). inputbuf text was +replaced with a placeholder to keep the script compact. +================================================================================ +*/ +USE PerformanceMonitor; +GO +SET NOCOUNT ON; + +DECLARE + @sentinel datetime2(7) = '2099-12-31 23:59:59.000', + @fail integer = 0, + @msg nvarchar(400), + @raw_unprocessed integer, + @chains integer, + @self_chains integer, + @status nvarchar(50); + +/* Guard: an idle instance ensures the processor's auto-derive sees ONLY our rows. */ +IF EXISTS (SELECT 1 FROM collect.blocked_process_xml WHERE is_processed = 0) +BEGIN + RAISERROR('ABORT: this instance has pending unprocessed blocked-process rows. Run the regression on an idle test instance.', 16, 1); + RETURN; +END; + +/* Remove any sentinel residue from a previous aborted run. */ +DELETE FROM collect.blocking_BlockedProcessReport WHERE event_time >= '2099-01-01'; +DELETE FROM collect.blocked_process_xml WHERE event_time >= '2099-01-01'; + +/* Inject the two sample events at the same sentinel timestamp. */ +INSERT INTO collect.blocked_process_xml (collection_time, event_time, blocked_process_xml, is_processed) +VALUES + (@sentinel, @sentinel, CONVERT(xml, N'505800071432565X64038860x00000001LOCKSELECT 1; /*regression-sample*/SELECT 1; /*regression-sample*/hammerdb_tpcc'), 0), + (@sentinel, @sentinel, CONVERT(xml, N'90460000000NL00x00000008GENERICSELECT 1; /*regression-sample*/SELECT 1; /*regression-sample*/'), 0); + +/* Run the processor (NULL dates -> auto-derive bounds from the unprocessed rows). */ +EXEC collect.process_blocked_process_xml @debug = 1; + +/* -------------------------------- assertions -------------------------------- */ +SELECT @raw_unprocessed = COUNT(*) FROM collect.blocked_process_xml WHERE event_time = @sentinel AND is_processed = 0; +SELECT @chains = COUNT(*) FROM collect.blocking_BlockedProcessReport WHERE event_time >= '2099-01-01' AND blocking_spid IS NOT NULL; +SELECT @self_chains = COUNT(*) FROM collect.blocking_BlockedProcessReport WHERE event_time >= '2099-01-01' AND blocking_spid = 61; +SELECT TOP 1 @status = collection_status FROM config.collection_log WHERE collector_name = 'process_blocked_process_xml' ORDER BY collection_time DESC; + +IF @raw_unprocessed = 0 + PRINT 'PASS 1: both sample events marked processed (self-block drains; no perpetual retry)'; +ELSE +BEGIN SET @fail += 1; PRINT 'FAIL 1: a sentinel raw row was left unprocessed (stuck-retry not fixed)'; END; + +IF @chains >= 1 + PRINT 'PASS 2: two-session block parsed to a chain (happy path + single-timestamp +1s window)'; +ELSE +BEGIN SET @fail += 1; PRINT 'FAIL 2: two-session block did NOT parse - inclusive-window or happy-path regression'; END; + +IF @self_chains = 0 + PRINT 'PASS 3: self-block produced no chain (correctly filtered, yet still marked processed)'; +ELSE +BEGIN SET @fail += 1; PRINT 'FAIL 3: self-block leaked a chain row (spid 61)'; END; + +IF @status = 'SUCCESS' + PRINT 'PASS 4: collection_log status = SUCCESS (not the old perpetual NO_RESULTS)'; +ELSE +BEGIN SET @fail += 1; SET @msg = 'FAIL 4: collection_log status = ' + ISNULL(@status, '(null)') + ' (expected SUCCESS)'; PRINT @msg; END; + +/* --------------------------------- cleanup ---------------------------------- */ +DELETE FROM collect.blocking_BlockedProcessReport WHERE event_time >= '2099-01-01'; +DELETE FROM collect.blocked_process_xml WHERE event_time >= '2099-01-01'; + +IF @fail = 0 + PRINT '=== RESULT: ALL PASS ==='; +ELSE +BEGIN + SET @msg = '=== RESULT: ' + CONVERT(nvarchar(10), @fail) + ' assertion(s) FAILED ==='; + RAISERROR(@msg, 16, 1); +END; +GO From b1446972530c93a12703e8c7e5470fcd92d1ebd6 Mon Sep 17 00:00:00 2001 From: Erik Darling <2136037+erikdarlingdata@users.noreply.github.com> Date: Tue, 16 Jun 2026 08:00:45 -0400 Subject: [PATCH 4/5] Lite: show "Azure SQL Database ()" instead of legacy "SQL Azure" edition SERVERPROPERTY('Edition') returns the legacy "SQL Azure" string on Azure SQL Database, which showed verbatim in the FinOps Server Inventory Edition column. Both the Server Inventory live query (LocalDataService.FinOps.ServerProperties) and the server_properties collector (RemoteCollectorService.ServerProperties) now map engine edition 5 to "Azure SQL Database ()" using DATABASEPROPERTYEX(DB_NAME(),'Edition') -- e.g. "Azure SQL Database (General Purpose)". On-prem editions are unchanged; the recommendation engine only string-matches "Enterprise", which neither value contains, so logic is unaffected. Verified live against an Azure SQL DB GP-serverless test instance: the Server Inventory Edition column now reads "Azure SQL Database (General Purpose)". Co-Authored-By: Claude Opus 4.8 (1M context) --- CHANGELOG.md | 1 + .../LocalDataService.FinOps.ServerProperties.cs | 15 ++++++++++++++- .../RemoteCollectorService.ServerProperties.cs | 14 +++++++++++++- 3 files changed, 28 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index eb511c0e..ad735985 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -13,6 +13,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Fixed +- **Lite: Azure SQL Database shows its real product name in FinOps → Server Inventory** — the Edition column displayed the legacy `SQL Azure` value that `SERVERPROPERTY('Edition')` returns for Azure SQL DB; it now reads `Azure SQL Database ()` (e.g. `Azure SQL Database (General Purpose)`), derived from `DATABASEPROPERTYEX(DB_NAME(), 'Edition')`, for any engine-edition-5 instance. Applied to both the live inventory query and the stored `server_properties` value so every edition display is consistent; on-prem editions are unchanged - **Dashboard: "Deadlocks Cleared" no longer flaps right after every deadlock** ([#1091]) — deadlock detection is edge-triggered off a delta against the cumulative perfmon counter, so the check immediately after a deadlock saw a zero delta and fired a *"Deadlocks Cleared"* notification ~one interval (≈60s) after every *"Deadlock Detected"*. The alert now stays active and clears only once a deadlock-quiet window (1 hour) has elapsed since the last new deadlock, so the detect/clear pair lines up with Lite, whose rolling 1-hour count drains about an hour after the last deadlock. Each new deadlock resets the window. The clear message is now *"No deadlocks in the last hour"* (was *"No deadlocks since last check"*). Covered by `DeadlockAlertClearPolicyTests` - **Lite: blocking and deadlock alerts no longer re-fire for the same events every cooldown** ([#1091]) — the overview alert engine treated the blocking and deadlock counts as a level: each check compared the rolling **1-hour** count against the threshold, so a single deadlock (or blocked-process report) kept the count above the threshold for the whole hour it lingered in the window, and the alert re-fired every cooldown (the reporter saw the same "2 deadlocks in the last hour" notification every five minutes for an hour). The Dashboard already edge-triggers off a delta; Lite now does too. Both alerts are gated by a new `RollingCountAlertGate` that fires only when the rolling count climbs above the count recorded at the last fired alert — a genuinely new event. The watermark decays as old events age out of the window (so a later rise re-alerts), resets when the window empties, and advances only when an alert actually fires (so an event arriving during a cooldown is reported once the cooldown elapses rather than being swallowed). Covered by `RollingCountAlertGateTests` - **Lite: blocking/deadlock XE sessions now self-heal and failures are surfaced** ([#1086]) — the `PerformanceMonitor_BlockedProcess` and `PerformanceMonitor_Deadlock` Extended Events sessions were created only when a server tab was opened; the recurring background collection loop never created or retried them. A server monitored without an open tab (e.g. app minimized to tray after a restart), or a first attempt that failed (connection not ready, missing `ALTER ANY EVENT SESSION`), left blocking/deadlock capture permanently dead — while the collectors read the non-existent ring buffer, got zero rows, and reported **OK**. The session ensure now runs inside the collector itself on every cycle (cheap existence check once created), so both the tab-open path and the background loop create/start/retry it. A failed ensure can no longer be masked: it fails the collector run, shows in the status-bar collector health (including permission failures, which previously didn't count as "erroring"), and fires a one-time tray notification ("Capture Not Running") on the transition. The Azure SQL DB database-scoped sessions also gain `STARTUP_STATE = ON` so they restart automatically after a failover diff --git a/Lite/Services/LocalDataService.FinOps.ServerProperties.cs b/Lite/Services/LocalDataService.FinOps.ServerProperties.cs index b05da0c5..e4279b30 100644 --- a/Lite/Services/LocalDataService.FinOps.ServerProperties.cs +++ b/Lite/Services/LocalDataService.FinOps.ServerProperties.cs @@ -71,7 +71,20 @@ FROM sys.dm_hadr_availability_replica_states AS ars END; SELECT - CONVERT(nvarchar(256), SERVERPROPERTY('Edition')), + /* Azure SQL DB reports the legacy 'SQL Azure' for SERVERPROPERTY('Edition'); + show the actual product name + service tier (e.g. 'Azure SQL Database + (General Purpose)') instead. */ + CASE + WHEN CONVERT(int, SERVERPROPERTY('EngineEdition')) = 5 + THEN N'Azure SQL Database' + + ISNULL(N' (' + + CASE CONVERT(nvarchar(128), DATABASEPROPERTYEX(DB_NAME(), 'Edition')) + WHEN N'GeneralPurpose' THEN N'General Purpose' + WHEN N'BusinessCritical' THEN N'Business Critical' + ELSE CONVERT(nvarchar(128), DATABASEPROPERTYEX(DB_NAME(), 'Edition')) + END + N')', N'') + ELSE CONVERT(nvarchar(256), SERVERPROPERTY('Edition')) + END, CONVERT(nvarchar(128), SERVERPROPERTY('ProductVersion')), CONVERT(nvarchar(128), SERVERPROPERTY('ProductLevel')), CONVERT(nvarchar(128), SERVERPROPERTY('ProductUpdateLevel')), diff --git a/Lite/Services/RemoteCollectorService.ServerProperties.cs b/Lite/Services/RemoteCollectorService.ServerProperties.cs index a367f788..78bf195e 100644 --- a/Lite/Services/RemoteCollectorService.ServerProperties.cs +++ b/Lite/Services/RemoteCollectorService.ServerProperties.cs @@ -40,7 +40,19 @@ private async Task CollectServerPropertiesAsync(ServerConnection server, Ca server_name = CONVERT(nvarchar(128), SERVERPROPERTY(N'ServerName')), edition = - CONVERT(nvarchar(128), SERVERPROPERTY(N'Edition')), + /* Azure SQL DB reports the legacy 'SQL Azure' for SERVERPROPERTY('Edition'); + store the actual product name + service tier instead. */ + CASE + WHEN CONVERT(int, SERVERPROPERTY(N'EngineEdition')) = 5 + THEN N'Azure SQL Database' + + ISNULL(N' (' + + CASE CONVERT(nvarchar(128), DATABASEPROPERTYEX(DB_NAME(), N'Edition')) + WHEN N'GeneralPurpose' THEN N'General Purpose' + WHEN N'BusinessCritical' THEN N'Business Critical' + ELSE CONVERT(nvarchar(128), DATABASEPROPERTYEX(DB_NAME(), N'Edition')) + END + N')', N'') + ELSE CONVERT(nvarchar(128), SERVERPROPERTY(N'Edition')) + END, product_version = CONVERT(nvarchar(128), SERVERPROPERTY(N'ProductVersion')), product_level = From 641179b3a3a8f6b929113abcb455f82e0bc9e441 Mon Sep 17 00:00:00 2001 From: Erik Darling <2136037+erikdarlingdata@users.noreply.github.com> Date: Tue, 16 Jun 2026 08:15:38 -0400 Subject: [PATCH 5/5] Dashboard + SQL collectors: same Azure edition normalization as Lite Mirror the Lite edition fix (b144697) across the Dashboard and the SQL-side collectors so "SQL Azure" is never shown or stored for an Azure SQL DB anywhere: - Dashboard FinOps Server Inventory live query (DatabaseService.FinOps.Inventory) - install/53_collect_server_properties.sql (stored server_properties.edition) - install/42_scheduled_master_collector.sql (server inventory edition) All map engine edition 5 to "Azure SQL Database ()" via DATABASEPROPERTYEX(DB_NAME(),'Edition'); on-prem editions take the unchanged ELSE branch. The licensing-recommendation queries (Lite + Dashboard) are left raw and identical -- they only do a Contains("Enterprise") check and never display the edition for Azure. Verified: Dashboard CASE returns "Azure SQL Database (General Purpose)" against a live Azure SQL DB; both install scripts deploy clean (0 errors) to SQL2019 and the Dashboard builds with 0 warnings/errors. Co-Authored-By: Claude Opus 4.8 (1M context) --- CHANGELOG.md | 2 +- .../Services/DatabaseService.FinOps.Inventory.cs | 15 ++++++++++++++- install/42_scheduled_master_collector.sql | 15 ++++++++++++++- install/53_collect_server_properties.sql | 14 +++++++++++++- 4 files changed, 42 insertions(+), 4 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index ad735985..5f523d85 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -13,7 +13,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Fixed -- **Lite: Azure SQL Database shows its real product name in FinOps → Server Inventory** — the Edition column displayed the legacy `SQL Azure` value that `SERVERPROPERTY('Edition')` returns for Azure SQL DB; it now reads `Azure SQL Database ()` (e.g. `Azure SQL Database (General Purpose)`), derived from `DATABASEPROPERTYEX(DB_NAME(), 'Edition')`, for any engine-edition-5 instance. Applied to both the live inventory query and the stored `server_properties` value so every edition display is consistent; on-prem editions are unchanged +- **Lite and Dashboard: Azure SQL Database shows its real product name in FinOps → Server Inventory** — the Edition column displayed the legacy `SQL Azure` value that `SERVERPROPERTY('Edition')` returns for Azure SQL DB; it now reads `Azure SQL Database ()` (e.g. `Azure SQL Database (General Purpose)`), derived from `DATABASEPROPERTYEX(DB_NAME(), 'Edition')`, for any engine-edition-5 instance. Normalized at every edition display/storage site across **both** apps — the live inventory queries (Lite + Dashboard) and the SQL-side collectors (`install/42`, `install/53`) plus Lite's `server_properties` collector — so the value is consistent app-wide; on-prem editions are unchanged. (The licensing-recommendation queries are left raw and identical in both apps: they only do an `Enterprise` substring check and never display the edition for Azure.) - **Dashboard: "Deadlocks Cleared" no longer flaps right after every deadlock** ([#1091]) — deadlock detection is edge-triggered off a delta against the cumulative perfmon counter, so the check immediately after a deadlock saw a zero delta and fired a *"Deadlocks Cleared"* notification ~one interval (≈60s) after every *"Deadlock Detected"*. The alert now stays active and clears only once a deadlock-quiet window (1 hour) has elapsed since the last new deadlock, so the detect/clear pair lines up with Lite, whose rolling 1-hour count drains about an hour after the last deadlock. Each new deadlock resets the window. The clear message is now *"No deadlocks in the last hour"* (was *"No deadlocks since last check"*). Covered by `DeadlockAlertClearPolicyTests` - **Lite: blocking and deadlock alerts no longer re-fire for the same events every cooldown** ([#1091]) — the overview alert engine treated the blocking and deadlock counts as a level: each check compared the rolling **1-hour** count against the threshold, so a single deadlock (or blocked-process report) kept the count above the threshold for the whole hour it lingered in the window, and the alert re-fired every cooldown (the reporter saw the same "2 deadlocks in the last hour" notification every five minutes for an hour). The Dashboard already edge-triggers off a delta; Lite now does too. Both alerts are gated by a new `RollingCountAlertGate` that fires only when the rolling count climbs above the count recorded at the last fired alert — a genuinely new event. The watermark decays as old events age out of the window (so a later rise re-alerts), resets when the window empties, and advances only when an alert actually fires (so an event arriving during a cooldown is reported once the cooldown elapses rather than being swallowed). Covered by `RollingCountAlertGateTests` - **Lite: blocking/deadlock XE sessions now self-heal and failures are surfaced** ([#1086]) — the `PerformanceMonitor_BlockedProcess` and `PerformanceMonitor_Deadlock` Extended Events sessions were created only when a server tab was opened; the recurring background collection loop never created or retried them. A server monitored without an open tab (e.g. app minimized to tray after a restart), or a first attempt that failed (connection not ready, missing `ALTER ANY EVENT SESSION`), left blocking/deadlock capture permanently dead — while the collectors read the non-existent ring buffer, got zero rows, and reported **OK**. The session ensure now runs inside the collector itself on every cycle (cheap existence check once created), so both the tab-open path and the background loop create/start/retry it. A failed ensure can no longer be masked: it fails the collector run, shows in the status-bar collector health (including permission failures, which previously didn't count as "erroring"), and fires a one-time tray notification ("Capture Not Running") on the transition. The Azure SQL DB database-scoped sessions also gain `STARTUP_STATE = ON` so they restart automatically after a failover diff --git a/Dashboard/Services/DatabaseService.FinOps.Inventory.cs b/Dashboard/Services/DatabaseService.FinOps.Inventory.cs index ba4048ee..d1d26cf1 100644 --- a/Dashboard/Services/DatabaseService.FinOps.Inventory.cs +++ b/Dashboard/Services/DatabaseService.FinOps.Inventory.cs @@ -126,7 +126,20 @@ IF @on_pos > 0 SELECT edition = - CONVERT(nvarchar(256), SERVERPROPERTY('Edition')), + /* Azure SQL DB reports the legacy 'SQL Azure' for SERVERPROPERTY('Edition'); + show the actual product name + service tier instead. Mirrors Lite's + LocalDataService.FinOps.ServerProperties. */ + CASE + WHEN CONVERT(int, SERVERPROPERTY('EngineEdition')) = 5 + THEN N'Azure SQL Database' + + ISNULL(N' (' + + CASE CONVERT(nvarchar(128), DATABASEPROPERTYEX(DB_NAME(), 'Edition')) + WHEN N'GeneralPurpose' THEN N'General Purpose' + WHEN N'BusinessCritical' THEN N'Business Critical' + ELSE CONVERT(nvarchar(128), DATABASEPROPERTYEX(DB_NAME(), 'Edition')) + END + N')', N'') + ELSE CONVERT(nvarchar(256), SERVERPROPERTY('Edition')) + END, product_version = CONVERT(nvarchar(128), SERVERPROPERTY('ProductVersion')), product_level = diff --git a/install/42_scheduled_master_collector.sql b/install/42_scheduled_master_collector.sql index c1a07b2e..c003d222 100644 --- a/install/42_scheduled_master_collector.sql +++ b/install/42_scheduled_master_collector.sql @@ -121,7 +121,20 @@ BEGIN sql_version = CONVERT(nvarchar(128), SERVERPROPERTY('ProductVersion')) + N' - ' + CONVERT(nvarchar(128), SERVERPROPERTY('ProductLevel')), - edition = CONVERT(nvarchar(128), SERVERPROPERTY('Edition')), + edition = + /* Azure SQL DB reports the legacy 'SQL Azure' for SERVERPROPERTY('Edition'); + store the actual product name + service tier instead. */ + CASE + WHEN CONVERT(int, SERVERPROPERTY('EngineEdition')) = 5 + THEN N'Azure SQL Database' + + ISNULL(N' (' + + CASE CONVERT(nvarchar(128), DATABASEPROPERTYEX(DB_NAME(), 'Edition')) + WHEN N'GeneralPurpose' THEN N'General Purpose' + WHEN N'BusinessCritical' THEN N'Business Critical' + ELSE CONVERT(nvarchar(128), DATABASEPROPERTYEX(DB_NAME(), 'Edition')) + END + N')', N'') + ELSE CONVERT(nvarchar(128), SERVERPROPERTY('Edition')) + END, physical_memory_mb = osi.physical_memory_kb / 1024, cpu_count = osi.cpu_count, environment_type = diff --git a/install/53_collect_server_properties.sql b/install/53_collect_server_properties.sql index 499da32d..901da4ea 100644 --- a/install/53_collect_server_properties.sql +++ b/install/53_collect_server_properties.sql @@ -397,7 +397,19 @@ BEGIN server_name = CONVERT(sysname, SERVERPROPERTY(N'ServerName')), edition = - CONVERT(sysname, SERVERPROPERTY(N'Edition')), + /* Azure SQL DB reports the legacy 'SQL Azure' for SERVERPROPERTY('Edition'); + store the actual product name + service tier instead. */ + CASE + WHEN @engine_edition = 5 + THEN N'Azure SQL Database' + + ISNULL(N' (' + + CASE CONVERT(nvarchar(128), DATABASEPROPERTYEX(DB_NAME(), N'Edition')) + WHEN N'GeneralPurpose' THEN N'General Purpose' + WHEN N'BusinessCritical' THEN N'Business Critical' + ELSE CONVERT(nvarchar(128), DATABASEPROPERTYEX(DB_NAME(), N'Edition')) + END + N')', N'') + ELSE CONVERT(sysname, SERVERPROPERTY(N'Edition')) + END, product_version = CONVERT(sysname, SERVERPROPERTY(N'ProductVersion')), product_level =