[core] Enable process_group_cleanup_enabled by default#64368
[core] Enable process_group_cleanup_enabled by default#64368kevin85421 wants to merge 3 commits into
Conversation
Flip the default of RAY_process_group_cleanup_enabled from false to true so per-worker process-group cleanup is used out of the box. The deprecated subreaper-based cleanup remains available by disabling this flag, and when both are enabled the raylet already prefers process-group cleanup. Update doc/source/ray-core/user-spawn-processes.rst to reflect the new default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
There was a problem hiding this comment.
Code Review
This pull request enables per-worker process-group-based cleanup by default by changing the default value of process_group_cleanup_enabled from false to true. The documentation has been updated accordingly to reflect this change and note that it is the preferred cleanup mechanism. No review comments were provided, and the changes are straightforward and correct.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
I agree, thanks for opening it @kevin85421! Did you test w/ the flag turned on and did it solve the issue for you? |
|
Some failing tests that are likely related: https://buildkite.com/ray-project/premerge/builds/68996#019f02ba-31ba-4a2c-b8bd-fbee51ba24b4 |
|
Hi @edoakes
Yes, we verified this with our FSDP training jobs, and it cleaned up the processes successfully.
I will fix the CI failures and ping you when this PR is ready for review. |
With process_group_cleanup_enabled defaulting to true, every worker disconnect triggered killpg(SIGTERM) + killpg(SIGKILL) against the worker's own process group. Because the worker is its own process-group leader, a graceful exit (__ray_terminate__) was being signal-killed before it could finish its shutdown sequence (atexit / __ray_shutdown__ handlers), breaking test_actor_failures.py shutdown tests. The same mechanism tore down Serve's HAProxyManager actor and its HAProxy subprocess mid-graceful-shutdown, surfacing as test_grpc.py::test_serving_grpc_requests failures. Split per-worker process-group cleanup by disconnect type: - Non-graceful (crash): the worker is already gone, so signal the group immediately (SIGTERM, then SIGKILL) to reap orphaned descendants. - Graceful: poll for the worker process to exit on its own, then SIGKILL the group to reap any orphaned descendants. The process group outlives the dead leader as long as members remain, so the pgid stays valid for this post-exit sweep. This preserves orphan cleanup on graceful actor deletion (test_nested_subprocess_cleanup_with_pg_cleanup) without interrupting the worker's own shutdown. Add a regression test asserting __ray_shutdown__ runs on graceful termination with PG cleanup enabled. Verified locally (macOS, Apple clang): test_kill_subprocesses.py (2 passed, 7 linux-only skipped) and the test_actor_failures.py shutdown suite (8 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
49166d7 to
492e111
Compare
Why are these changes needed?
This is my second time needs this feature (first time: #57638).
I built a Ray script similar to Slurm's
sbatchto streamline migrating our training pipelines from Slurm to KubeRay. It uses subprocess to launch torchrun processes, and the torchrun processes typically launch their own child processes. There's no easy way for the application side to ensure the processes are cleaned up thoroughly. Therefore, I'd suggest enablingRAY_process_group_cleanup_enabledby default.Related issue number
Follow-up to #56476.
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.🤖 Generated with Claude Code