Skip to content

Add blame-hang-timeout to .NET test runs so hangs fail fast#4142

Open
mattleibow wants to merge 1 commit into
mainfrom
mattleibow/dev-test-hang-timeout-xunit3
Open

Add blame-hang-timeout to .NET test runs so hangs fail fast#4142
mattleibow wants to merge 1 commit into
mainfrom
mattleibow/dev-test-hang-timeout-xunit3

Conversation

@mattleibow

Copy link
Copy Markdown
Contributor

Summary

Adds a process-level hang-timeout safety net so a stuck .NET Core test fails fast (~15 min) instead of burning to the 180-min job cap.

This came out of the investigation into intermittent macOS (.NET Core) test-job hangs — see #4139 for the full root cause.

Background

The macOS (.NET Core) test job intermittently hangs for 30 min – 3 h (vs a healthy ~8 min) and sometimes hits the 180-min job timeout. Root cause: a GC-finalizer-bound stress test (SKBitmapThreadingTest.ImageScalingMultipleThreadsTest) can take 13–15 min or crash the test host on a contended Microsoft-hosted macOS VM. When the host crashes, AzDO retries the entire tests-netcore step up to 3× (retryCountOnTaskFailure: 3), bounded only by timeoutInMinutes: 180. There is currently no per-test or process-level hang timeout.

Change

In the shared RunDotNetTest helper (scripts/infra/tests/test-shared.cake), append the VSTest blame-hang collector flags to the ArgumentCustomization. This single place covers all .NET Core desktop test projects run by the tests-netcore target:

.Append("--blame-hang-timeout").Append("15m")
.Append("--blame-hang-dump-type").Append("none");
  • 15m is safely above the worst observed passing run (~793s ≈ 13 min) so it won't false-fire, while bounding a true hang well under the 180-min cap.
  • --blame-hang-timeout auto-enables blame-hang mode (no separate --blame needed).
  • none dump type avoids large dump artifacts (can revisit to mini/full later if a stack trace is wanted).

The device/WASM runner path (RunDeviceRunnersTest) is intentionally left unchanged: those tests run in-process inside a MAUI/Blazor host (DeviceRunners) and do not use the VSTest blame collector, so these flags don't apply there.

Verification

C#-only/infra change — bootstrapped with externals-download (no native rebuild). Ran a filtered subset of the netcore tests locally; the runner accepted the --blame-hang-timeout/--blame-hang-dump-type flags and executed tests normally (no argument-parsing rejection).

Refs #4139

The macOS (.NET Core) test job intermittently hangs for 30 min to 3 h
(vs a healthy ~8 min) and occasionally hits the 180-min job timeout.
There is currently no per-test or process-level hang timeout, so a stuck
test host burns to the job cap and AzDO retries the whole tests-netcore
step up to 3x.

Add a process-level safety net via the VSTest blame-hang collector in the
shared RunDotNetTest helper (covers all .NET Core desktop test projects
run by the tests-netcore target). 15m is safely above the worst observed
passing run (~13 min) so it won't false-fire, while bounding a true hang.
A 'none' dump type avoids large artifacts.

--blame-hang-timeout auto-enables blame mode, so no separate --blame flag
is needed. The device/WASM runner path (RunDeviceRunnersTest) is
intentionally left unchanged: those tests run in-process inside a
MAUI/Blazor host and do not use the VSTest blame collector.

Refs #4139

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions

Copy link
Copy Markdown
Contributor

📦 Try the packages from this PR

Warning

Do not run these scripts without first reviewing the code in this PR.

Step 1 — Download the packages

bash / macOS / Linux:

curl -fsSL https://raw.githubusercontent.com/mono/SkiaSharp/main/scripts/get-skiasharp-pr.sh | bash -s -- 4142

PowerShell / Windows:

iex "& { $(irm https://raw.githubusercontent.com/mono/SkiaSharp/main/scripts/get-skiasharp-pr.ps1) } 4142"

Step 2 — Add the local NuGet source

dotnet nuget add source ~/.skiasharp/hives/pr-4142/packages --name skiasharp-pr-4142
More options
Option Description
--successful-only / -SuccessfulOnly Only use successful builds
--force / -Force Overwrite previously downloaded packages
--list / -List List available artifacts without downloading
--build-id ID / -BuildId ID Download from a specific build

Or download manually from Azure Pipelines — look for the nuget artifact on the build for this PR.

Remove the source when you're done:

dotnet nuget remove source skiasharp-pr-4142

@github-actions

Copy link
Copy Markdown
Contributor

📖 Documentation Preview

The documentation for this PR has been deployed and is available at:

🔗 View Staging Site
🔗 View Staging Docs
🔗 View Staging Gallery (Blazor)
🔗 View Staging Gallery (Uno Platform)
🔗 View Staging SkiaFiddle

This preview will be updated automatically when you push new commits to this PR.


This comment is automatically updated by the documentation staging workflow.

mattleibow added a commit that referenced this pull request Jun 11, 2026
Follow-up from dual-model PR review:

- Add the same always-run managed-only smoke test to the migrated
  SkiaSharp.Views.Gtk4.Tests project. Its other tests initialise native
  GTK4 in their constructors and skip every test on a headless/GTK-less
  agent; under Microsoft.Testing.Platform that all-skipped run would exit
  8 (failure). The smoke test exercises pure-managed SkiaSharp geometry
  types (SKPointI/SKSizeI — no GTK, no native call). Verified: the suite
  runs 30 tests (1 executed, 29 skipped) and exits 0. (This project is
  not currently wired into CI, but it was migrated to MTP, so the guard
  keeps it safe if it is ever run standalone or added to a leg.)

- Document in test-shared.cake why the hang-dump uses `--hangdump-type
  Mini` rather than #4142's VSTest `none`: MTP only detects a per-test
  hang via the HangDump extension, which always writes a dump (no "none"
  type exists); the global `--timeout` aborts without a dump but is a
  whole-session timeout, not per-test, so it is unsuitable. Mini is the
  smallest dump and only materialises on an actual hang.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
mattleibow added a commit that referenced this pull request Jun 16, 2026
Migrate test suite from xUnit v2 to xUnit v3 (#4143)

Context: #4139

The macOS CI legs intermittently hung under the xUnit v2 console runner, which
had no per-test timeout to recover a wedged test host. Rather than only paper
over the hang, this migrates the entire test suite to xUnit v3 and drops two
pieces of hand-rolled infrastructure in favour of native v3 features.

Runners:
  * Desktop `dotnet test` projects move to Microsoft.Testing.Platform (MTP):
    OutputType=Exe, xunit.v3 + Microsoft.Testing.Extensions.TrxReport/HangDump,
    dropping Microsoft.NET.Test.Sdk, xunit.runner.visualstudio and
    XunitXml.TestLogger.
  * Device (MAUI) and WASM (Blazor) in-app runners move to DeviceRunners
    *.Xunit3 (.AddXunit3()) at 0.1.0-preview.11.
  * SkiaSharp.Views.Gtk4.Tests is migrated and wired into the tests-netcore
    leg so its conversion tests run in CI for the first time.
  * net48 (x86 and x64) is unchanged: each architecture now builds its own
    runnable exe, so the old console-runner `is32` bitness selector is removed
    as dead code.

Native dynamic skip (drops Xunit.SkippableFact):
  * [SkippableFact]/[SkippableTheory] -> [Fact]/[Theory]
  * Skip.If/Skip.IfNot -> Assert.SkipWhen/Assert.SkipUnless
  * throw new SkipException(...) -> Assert.Skip(...)

Assembly fixtures (drops the custom test framework): CustomTestFramework.cs and
AssemblyFixtureAttribute.cs are deleted and the GarbageCleanupFixture rebinds to
v3's native Xunit.AssemblyFixtureAttribute. ITestOutputHelper moves from
Xunit.Abstractions to Xunit, and IAsyncLifetime now returns ValueTask.

Zero-executed-test guard, with no masking: MTP exits 8 when zero tests run, and
a dynamically-skipped test counts as not-run, so a fully-skipped suite also
exits 8 (v2/VSTest treated all-skipped as success). Instead of suppressing exit
8 — which would also hide a real zero-discovery misconfiguration — the
hardware-gated Vulkan and Direct3D suites each gain an always-run SmokeTest
exercising a backend type that needs no GPU runtime (GRVkImageInfo,
GRD3DTextureResourceInfo). The headless Linux Gtk4 leg is handled the same way:
gtk_init/gtk_init_check call native exit() with no display, so the three
display-dependent tests are gated behind a managed DISPLAY/WAYLAND_DISPLAY check
before any GTK call while the ~26 conversion tests still run (libgtk-4-1 is
installed on the agent). No --ignore-exit-code or allowNoTests masking remains.

Hang protection / #4142 reconciliation: RunDotNetTest forwards MTP hang-dump
args (--hangdump --hangdump-timeout 15m --hangdump-type Mini) — the MTP
equivalent of #4142's VSTest --blame-hang-timeout. MTP HangDump has no "none"
type, so the smallest (Mini) is used; these supersede #4142's VSTest blame flags
on merge.

Packaging: DeviceRunners.*.Xunit3 are now mirrored to the dnceng dotnet-public
feed, so nuget.config restores exclusively from the two dnceng mirrors with no
nuget.org source and no packageSourceMapping. No package uses a floating `*`
version; SkiaSharp.Tests.Integration requires an explicit -p:SkiaSharpVersion=
and fails fast via a ValidateVersions target. CI test-results publishing is
switched from xUnit/TestResults.xml to VSTest/*.trx to match MTP's output so the
desktop legs (including the 32-bit run) keep publishing.

Closes #4139.

Co-authored-by: Matthew Leibowitz <mattleibow@live.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant