[core] Enable process_group_cleanup_enabled by default by kevin85421 · Pull Request #64368 · ray-project/ray

kevin85421 · 2026-06-26T06:36:32Z

Why are these changes needed?

This is my second time needs this feature (first time: #57638).

I built a Ray script similar to Slurm's sbatch to streamline migrating our training pipelines from Slurm to KubeRay. It uses subprocess to launch torchrun processes, and the torchrun processes typically launch their own child processes. There's no easy way for the application side to ensure the processes are cleaned up thoroughly. Therefore, I'd suggest enabling RAY_process_group_cleanup_enabled by default.

Related issue number

Follow-up to #56476.

Checks

I've signed off every commit (by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the related rules for more info.

🤖 Generated with Claude Code

Flip the default of RAY_process_group_cleanup_enabled from false to true so per-worker process-group cleanup is used out of the box. The deprecated subreaper-based cleanup remains available by disabling this flag, and when both are enabled the raylet already prefers process-group cleanup. Update doc/source/ray-core/user-spawn-processes.rst to reflect the new default. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>

gemini-code-assist

Code Review

This pull request enables per-worker process-group-based cleanup by default by changing the default value of process_group_cleanup_enabled from false to true. The documentation has been updated accordingly to reflect this change and note that it is the preferred cleanup mechanism. No review comments were provided, and the changes are straightforward and correct.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

edoakes · 2026-06-26T19:39:49Z

I agree, thanks for opening it @kevin85421!

Did you test w/ the flag turned on and did it solve the issue for you?

edoakes · 2026-06-26T19:40:21Z

Some failing tests that are likely related: https://buildkite.com/ray-project/premerge/builds/68996#019f02ba-31ba-4a2c-b8bd-fbee51ba24b4

kevin85421 · 2026-06-26T20:32:58Z

Hi @edoakes

Did you test w/ the flag turned on and did it solve the issue for you?

Yes, we verified this with our FSDP training jobs, and it cleaned up the processes successfully.

Some failing tests that are likely related: https://buildkite.com/ray-project/premerge/builds/68996#019f02ba-31ba-4a2c-b8bd-fbee51ba24b4

I will fix the CI failures and ping you when this PR is ready for review.

With process_group_cleanup_enabled defaulting to true, every worker disconnect triggered killpg(SIGTERM) + killpg(SIGKILL) against the worker's own process group. Because the worker is its own process-group leader, a graceful exit (__ray_terminate__) was being signal-killed before it could finish its shutdown sequence (atexit / __ray_shutdown__ handlers), breaking test_actor_failures.py shutdown tests. The same mechanism tore down Serve's HAProxyManager actor and its HAProxy subprocess mid-graceful-shutdown, surfacing as test_grpc.py::test_serving_grpc_requests failures. Split per-worker process-group cleanup by disconnect type: - Non-graceful (crash): the worker is already gone, so signal the group immediately (SIGTERM, then SIGKILL) to reap orphaned descendants. - Graceful: poll for the worker process to exit on its own, then SIGKILL the group to reap any orphaned descendants. The process group outlives the dead leader as long as members remain, so the pgid stays valid for this post-exit sweep. This preserves orphan cleanup on graceful actor deletion (test_nested_subprocess_cleanup_with_pg_cleanup) without interrupting the worker's own shutdown. Add a regression test asserting __ray_shutdown__ runs on graceful termination with PG cleanup enabled. Verified locally (macOS, Apple clang): test_kill_subprocesses.py (2 passed, 7 linux-only skipped) and the test_actor_failures.py shutdown suite (8 passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>

kevin85421 requested a review from a team as a code owner June 26, 2026 06:36

gemini-code-assist Bot reviewed Jun 26, 2026

View reviewed changes

kevin85421 added the go add ONLY when ready to merge, run all tests label Jun 26, 2026

Merge branch 'master' into enable-process-group-cleanup-by-default

67a9977

ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Jun 26, 2026

kevin85421 marked this pull request as draft June 26, 2026 21:28

kevin85421 force-pushed the enable-process-group-cleanup-by-default branch from 49166d7 to 492e111 Compare June 26, 2026 21:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[core] Enable process_group_cleanup_enabled by default#64368

[core] Enable process_group_cleanup_enabled by default#64368
kevin85421 wants to merge 3 commits into
ray-project:masterfrom
kevin85421:enable-process-group-cleanup-by-default

kevin85421 commented Jun 26, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

edoakes commented Jun 26, 2026

Uh oh!

edoakes commented Jun 26, 2026

Uh oh!

kevin85421 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

kevin85421 commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

edoakes commented Jun 26, 2026

Uh oh!

edoakes commented Jun 26, 2026

Uh oh!

kevin85421 commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kevin85421 commented Jun 26, 2026 •

edited

Loading