Skip to content

diagnose: per-test SIGALRM watchdog to identify macOS-debug hang#224

Closed
ser-vasilich wants to merge 1 commit into
masterfrom
diagnose/macos-debug-hang
Closed

diagnose: per-test SIGALRM watchdog to identify macOS-debug hang#224
ser-vasilich wants to merge 1 commit into
masterfrom
diagnose/macos-debug-hang

Conversation

@ser-vasilich
Copy link
Copy Markdown
Collaborator

What

Adds a SIGALRM-based per-test timeout watchdog to test/main.c. When a test
exceeds the timeout (default 90 s), the SIGALRM handler writes the offending
test name to stderr via async-signal-safe write(2) and calls _exit(124)
so CI logs capture the culprit and the job finishes instead of hanging.

Why a draft PR

PR #223's macOS-debug job hung for 27+ minutes (vs ~150 s on master), suggesting
a platform-specific infinite loop or UB triggered only under macOS + ASan/UBSan.
Without per-test timeout, the suite stalls and we never learn which test is the
culprit. This draft PR isolates the watchdog so CI can identify the hanging test
on its own, then we can investigate the actual bug separately.

The watchdog is opt-out via RAY_TEST_TIMEOUT_S=0; the default 90 s comfortably
exceeds the slowest legitimate tests (1.05M-row HLL, splayed I/O round-trips).

Expected outcome

  • ubuntu-{debug,release}, macos-release: green in ~2-3 min as on master
  • macos-debug: should fail with TIMEOUT in test: <name> after ≤90 s per test,
    pinpointing which test introduces the hang

Once we know the test name, we can reproduce locally / inspect for UB and open a
proper fix PR.

🤖 Generated with Claude Code

A test hang in the suite currently stalls the whole CI job and we
learn nothing about which test caused it (macOS+ASan on PR #223 hung
26+ min vs 150 s on master).  Install a SIGALRM-based watchdog: when
a test exceeds the timeout the handler writes its name to stderr
using async-signal-safe write(2) + _exit(124), so the CI log
captures the culprit and the job actually finishes.

Default 90 s comfortably exceeds the slowest legitimate tests
(1.05M-row HLL, splayed I/O round-trips); override via
RAY_TEST_TIMEOUT_S env var (set to 0 to disable).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ser-vasilich ser-vasilich marked this pull request as ready for review June 3, 2026 15:40
@ser-vasilich
Copy link
Copy Markdown
Collaborator Author

Closing as a one-off diagnostic — the macOS-debug hang on PR #223 turned out to be a transient CI runner flake (master is green for the same code). The watchdog itself is small and uncontroversial; can be reopened if a future hang makes it worth carrying as standing infrastructure.

@ser-vasilich ser-vasilich deleted the diagnose/macos-debug-hang branch June 4, 2026 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant