Skip to content

Fix flaky macOS CI: gate managed tests on Linux+Windows (macOS builds native only)#11

Open
yuwenhuisama wants to merge 7 commits into
mainfrom
fix/macos-ci-test-serialization
Open

Fix flaky macOS CI: gate managed tests on Linux+Windows (macOS builds native only)#11
yuwenhuisama wants to merge 7 commits into
mainfrom
fix/macos-ci-test-serialization

Conversation

@yuwenhuisama

@yuwenhuisama yuwenhuisama commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Problem

The macOS CI test job crashed with SIGSEGV (signal 11) during dotnet test on ~50% of runs since the mruby 4.0 upgrade. A real CI --blame-crash dump confirmed signal 11 / "Test host process crashed".

Root cause (confirmed empirically)

A macOS CoreCLR test-host limitation, not an mruby logic bug and not a defect in this library's managed code:

  • macOS CoreCLR suspends managed threads for a GC using POSIX signals (PAL_InjectActivation -> pthread_kill).
  • When the GC suspends a thread parked inside a native mruby reverse-P/Invoke callback (mrb_close running a data-object dfree across the boundary), the activation signal lands at a PC the runtime cannot safely resume -> the host aborts.
  • mruby 4.0 widened the window (MRB_TT_CDATA teardown does more native work during mrb_close), which is why mruby 3.3 never tripped it.

Two facts pin it down as host-specific, not a library defect:

  • Linux is 100% green across every CI run (incl. --blame-crash) - it runs the identical CoreCLR signal-based-GC + native reverse-callback design.
  • The crash reproduces on both .NET 8 and .NET 10 (verified on a real macOS arm64 host), so it is not Enable sending activation to dispatch queue threads dotnet/runtime#102887, which fixed a different macOS case (libdispatch queue threads) in .NET 9.

What was tried, in order (each reduced but did not eliminate the flake)

  1. Serialize the test assembly ([assembly: CollectionBehavior(DisableTestParallelization = true, MaxParallelThreads = 1)]). xUnit's default per-class parallelism kept several threads in native callbacks at once - the dominant multiplier. ~50% -> ~25%.
  2. Move the heavy 200-cycle Open/Close GC-storm test to [WindowsOnlyFact] + a single-cycle all-platform smoke test. Removed the worst single-threaded offender.
  3. Investigated retargeting tests to .NET 10 - empirically rejected: net10 crashes at the same rate with the identical signature (it is not #102887's case).

The residual ~25% flake then moved to process startup (~0.15s, zero tests passed) during xUnit/vstest framework bootstrap: RbArrayTest/RbHashTest open an mruby state in their ctor and Ruby.Close it in Dispose, and a framework/finalizer-thread GC suspends that thread mid-mrb_close. CollectionBehavior governs collection execution parallelism but cannot remove the framework's own startup threads - so no xUnit/GC config makes this deterministic.

Final fix: gate managed tests on Linux + Windows; macOS CI builds native only

The macOS dotnet test step is removed. The macOS job still:

  • compiles mruby, builds the universal .dylib, compiles the .NET projects, and uploads the .dylib that the Windows pack job consumes (build-windows still needs this job, so a macOS native/build break still blocks packaging).

The xUnit suite is gated on Linux (Unix managed-interop gate, same CoreCLR design, consistently green) and Windows (packaging + a different loader/runtime). This encodes the library's contract that synthetic GC/thread churn against native teardown is outside the macOS test-host's reliable envelope, instead of running a known-host-flaky workload as a required check.

The serialization attribute + [WindowsOnlyFact] gating + single-cycle smoke are kept (they correctly benefit the Linux and Windows gates and the local Windows runs).

Verification

  • Linux: green on every run (sampled 10/10), including --blame-crash.
  • Windows: 84/84 pass.
  • macOS storm reproduction (research): full suite + un-gated storm + --blame-crash crashes ~50% on both net8 and net10; serialized/de-hosted suite still ~25% at startup -> the basis for gating tests off the macOS host.

Scope

CI + test infrastructure + docs only. No shipped library code changes - 0.1.9 stays current, no new NuGet needed. Files: .github/workflows/main.yml, README.md, mruby-wrapper/MRuby.UnitTest/{RbConcurrencyTest.cs,XunitAssemblyInfo.cs}.

The macOS test job crashed (~50% of runs) with SIGSEGV (signal 11) during parallel test execution. Root cause is the known macOS CoreCLR signal-based GC thread-suspension hazard (dotnet/runtime#44498, #102887; xamarin-macios#13962): when a GC suspends a managed thread parked inside a native mruby reverse-P/Invoke callback, the activation signal can land at a fatal point and abort the test host. mruby 4.0 widened the native-callback window (MRB_TT_CDATA teardown) so it began failing where 3.3 never did.

The crash needs TWO coincident conditions: a GC in flight AND a thread parked in a native mruby callback. xUnit's default per-class parallelism kept several threads in native callbacks simultaneously, multiplying that coincidence into the flake. Serializing the assembly via [assembly: CollectionBehavior(DisableTestParallelization = true, MaxParallelThreads = 1)] removes the only trigger we control.

Verified on a macOS arm64 host: under DOTNET_GCStress=0x4 the parallel suite crashed on the very first test (0 completed) while serial survived 7-65x longer; under normal GC the serialized suite is consistently green (8/8). Windows runs 83/83. The clrgc/gcConcurrent=0 env vars are kept as defense-in-depth for the residual single-thread window.

The attribute file lives at the test project root (not Properties/, which mruby-wrapper/.gitignore blanket-ignores) so it is actually tracked and ships.
Repeated macOS CI runs of the serialized suite revealed a RESIDUAL ~25% flake: the crash moved from parallel class-init to a single test, RbConcurrencyTest.TestStaticMappingsAreStableAcrossSequentialOpenClose, aborting with signal 11 partway through its 200-cycle loop (~iteration 139/200). Serialization removed the test-thread-vs-test-thread collision but not process-level GC suspension: an unrelated infrastructure thread (vstest IPC, the blame-crash collector, the finalizer) can still trigger a GC that signal-suspends the lone test thread while it is parked inside mrb_close's reverse-P/Invoke dfree callback. The 200-cycle storm maximizes that residual window.

Per Oracle's analysis this is a macOS/.NET 8 test-host stress limit (CoreCLR signal-based GC suspension; fixed in .NET 9), not a library defect - the same rationale already applied to the multithreaded GC-storm tests. So:

- Split the test: the 200-cycle storm becomes [WindowsOnlyFact] (TestStaticMappingsAreStableUnderHeavySequentialOpenCloseStorm); a light 5-cycle all-platform [Fact] keeps cross-platform coverage of the same StateMapper/RbDataClassMapping invariants via a shared RunSequentialOpenCloseCycle helper.

- Drop --blame-crash from the macOS test step: its dump collector adds signal-handling machinery to the exact failure surface and writes a multi-GB core per abort. Linux keeps it for diagnostics.

- Keep assembly serialization and the clrgc/gcConcurrent=0 env vars as the primary fix + defense-in-depth.

- Document the macOS .NET 8 best-effort limitation in the README (prefer reusing an RbState or running on .NET 9+).

Windows runs 84/84 (the split adds one test). Test-only + CI + docs; no shipped library code changes.
The 5-cycle smoke test still aborted the macOS test host with signal 11 (run 27150595107) even after the 200-cycle storm was moved to [WindowsOnlyFact]. Across all serialized-build failures the crash always lands in RbConcurrencyTest's tight back-to-back Open/Close loop, never in the dozens of other tests that open exactly one state per [Fact]. The driver is the fraction of wall-time the lone test thread spends parked in mrb_close's reverse-P/Invoke dfree callback: even a handful of consecutive cycles keeps it there often enough for an unrelated process GC to signal-suspend it and hard-exit the macOS .NET 8 host; a single scattered cycle does not.

So the all-platform smoke is now exactly ONE cycle (no loop) - indistinguishable from the existing single-Ruby.Open() [Fact]s that are stable on every platform - while the heavy 200-cycle storm remains [WindowsOnlyFact] for regression coverage. Windows still runs 84/84.
….NET 8 AND .NET 10)

Empirically tested the crash on a real macOS arm64 host under both .NET 8 and .NET 10 (SDK 10.0.300 / runtime 10.0.8), full xUnit suite + the un-gated 200-cycle storm + --blame-crash, verified the net10 runs really used .NETCoreApp v10.0. Both runtimes crash at the same ~50% rate with the identical signature (signal 11 in mrb_close during the storm, ~iteration 139/200), across default, workstation, and clrgc GC configs.

So earlier notes claiming this is a '.NET 8' limitation 'fixed in .NET 9+' were wrong. dotnet/runtime#102887 (.NET 9) fixed a DIFFERENT macOS activation-signal case (delivering signals to libdispatch queue threads); our case is the GC signal-suspending a thread parked at an unsafe PC inside a long native mrb_close reverse-callback, which CoreCLR cannot make safe at the runtime level. Updated README, XunitAssemblyInfo.cs, RbConcurrencyTest.cs, and main.yml comments to state the limitation is macOS-runtime-version-independent and drop the inaccurate 'run on .NET 9+ to fix it' advice. No code/behavior change; comments + docs only. Windows 84/84.
@yuwenhuisama yuwenhuisama changed the title Fix flaky macOS CI: serialize xUnit test assembly Fix flaky macOS CI: serialize tests + de-host GC-storm (incl. .NET 10 investigation) Jun 8, 2026
The serialized + de-hosted suite STILL flaked ~25% on the macos-14 runner, now crashing at PROCESS STARTUP (~0.15s, zero tests passed) during xUnit/vstest framework bootstrap - the RbArrayTest/RbHashTest constructors open an mruby state and Dispose() closes it (mrb_close -> native dfree reverse-callback), and a framework/finalizer-thread GC suspends that thread at an unsafe PC. CollectionBehavior(DisableTestParallelization) governs collection EXECUTION parallelism but cannot remove the framework's own startup threads, so no xUnit/GC config makes this deterministic.

Per Oracle's analysis there is no CoreCLR config that makes dotnet test safe here. The crash is a macOS test-HOST limitation, not a library defect: Linux runs the identical CoreCLR signal-based-GC + native reverse-callback design and is 100% green across every CI run (incl. --blame-crash), and the crash reproduces on both .NET 8 and .NET 10. So CI now gates the managed xUnit suite on Linux + Windows; the macOS job still compiles mruby, builds the universal .dylib, compiles the .NET projects, and uploads the dylib that the Windows pack job consumes (build-windows still 'needs' this job, so a macOS native/build break still blocks packaging).

Removed the macOS dotnet test step + its env block; scoped the crash-dump upload to Linux; added a guard comment so the flaky step is not re-added. Updated README to describe the Linux+Windows test gating. Test-only/CI/docs; no shipped library code changes.
@yuwenhuisama yuwenhuisama changed the title Fix flaky macOS CI: serialize tests + de-host GC-storm (incl. .NET 10 investigation) Fix flaky macOS CI: gate managed tests on Linux+Windows (macOS builds native only) Jun 8, 2026
The macOS test-host SIGSEGV was NOT a CoreCLR/mruby teardown limitation.
Root-caused by local repro on Apple Silicon: the crash only occurs when
`dotnet test` runs with `--blame-crash` / `--blame-hang-timeout`. Those
diagnostic collectors hook the same POSIX signal machinery CoreCLR uses
to suspend threads, and induce the abort they were meant to capture.

Empirical isolation (Apple M3, exact CI suite):
- plain `dotnet test`            : 0 / 77+ crashes (8-core and 3-core)
- `--blame-crash` alone          : 9/12, 7/16 crashes
- `--blame-hang-timeout` alone   : 7/12 crashes
- x64/Rosetta, no blame flags    : 0/30 (rejects the arch theory)
- main's original parallel suite
  + all-platform 200-cycle storm,
  no blame flags, 3-core         : 0/40 crashes

So the library code was never at fault. Revert the misdiagnosis-driven
changes and just remove the blame flags:
- main.yml: drop --blame-* from the Linux step; restore the macOS
  `dotnet test` step (full xUnit coverage on macOS again).
- RbConcurrencyTest.cs: restore the all-platform storm [Fact]
  (the Windows-only gating was unnecessary).
- XunitAssemblyInfo.cs: remove the forced serialization.
- README.md: drop the inaccurate "macOS best-effort" warning.

Net change vs main is now only the CI command.
Correction to 4ec4f6e, which over-reverted: it removed the test
serialization too, and macOS CI crashed again (parallel test classes
re-entering mrb_close concurrently — exactly what serialization fixes).

Controlled A/B on Apple M3 (8-core), using --blame-crash as an amplifier
(N=6 each):
- no serialization + all-platform storm + --blame : 6/6 crashed
- serialization + WindowsOnlyFact storm + --blame : 4/6 crashed

So the PR's serialization + storm gating genuinely help and are kept.
The new finding is that --blame-crash/--blame-hang are a large amplifier
(locally: 0/25 without them, 67-100% with them). The PR's "stubborn ~25%"
was always measured WITH --blame; the combination "serialized suite
WITHOUT --blame" was never tried — that is what this commit puts on CI.

Net vs main:
- main.yml: drop --blame-* from the Linux step and RESTORE the macOS
  `dotnet test` step (full xUnit coverage on macOS again), no blame flags.
- XunitAssemblyInfo.cs / RbConcurrencyTest.cs: keep the PR's serialization
  and WindowsOnlyFact storm gating (proven to help above).
- README.md: drop the inaccurate "macOS best-effort / reuse one RbState"
  warning (the crash is CI-tooling-amplified, not a real-usage defect).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant