diagnose: per-test SIGALRM watchdog to identify macOS-debug hang#224
Closed
ser-vasilich wants to merge 1 commit into
Closed
diagnose: per-test SIGALRM watchdog to identify macOS-debug hang#224ser-vasilich wants to merge 1 commit into
ser-vasilich wants to merge 1 commit into
Conversation
A test hang in the suite currently stalls the whole CI job and we learn nothing about which test caused it (macOS+ASan on PR #223 hung 26+ min vs 150 s on master). Install a SIGALRM-based watchdog: when a test exceeds the timeout the handler writes its name to stderr using async-signal-safe write(2) + _exit(124), so the CI log captures the culprit and the job actually finishes. Default 90 s comfortably exceeds the slowest legitimate tests (1.05M-row HLL, splayed I/O round-trips); override via RAY_TEST_TIMEOUT_S env var (set to 0 to disable). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
Closing as a one-off diagnostic — the macOS-debug hang on PR #223 turned out to be a transient CI runner flake (master is green for the same code). The watchdog itself is small and uncontroversial; can be reopened if a future hang makes it worth carrying as standing infrastructure. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a SIGALRM-based per-test timeout watchdog to
test/main.c. When a testexceeds the timeout (default 90 s), the SIGALRM handler writes the offending
test name to stderr via async-signal-safe
write(2)and calls_exit(124)—so CI logs capture the culprit and the job finishes instead of hanging.
Why a draft PR
PR #223's macOS-debug job hung for 27+ minutes (vs ~150 s on master), suggesting
a platform-specific infinite loop or UB triggered only under macOS + ASan/UBSan.
Without per-test timeout, the suite stalls and we never learn which test is the
culprit. This draft PR isolates the watchdog so CI can identify the hanging test
on its own, then we can investigate the actual bug separately.
The watchdog is opt-out via
RAY_TEST_TIMEOUT_S=0; the default 90 s comfortablyexceeds the slowest legitimate tests (1.05M-row HLL, splayed I/O round-trips).
Expected outcome
TIMEOUT in test: <name>after ≤90 s per test,pinpointing which test introduces the hang
Once we know the test name, we can reproduce locally / inspect for UB and open a
proper fix PR.
🤖 Generated with Claude Code