Skip to content

[core] Enable process_group_cleanup_enabled by default#64368

Draft
kevin85421 wants to merge 3 commits into
ray-project:masterfrom
kevin85421:enable-process-group-cleanup-by-default
Draft

[core] Enable process_group_cleanup_enabled by default#64368
kevin85421 wants to merge 3 commits into
ray-project:masterfrom
kevin85421:enable-process-group-cleanup-by-default

Conversation

@kevin85421

@kevin85421 kevin85421 commented Jun 26, 2026

Copy link
Copy Markdown
Member

Why are these changes needed?

This is my second time needs this feature (first time: #57638).

I built a Ray script similar to Slurm's sbatch to streamline migrating our training pipelines from Slurm to KubeRay. It uses subprocess to launch torchrun processes, and the torchrun processes typically launch their own child processes. There's no easy way for the application side to ensure the processes are cleaned up thoroughly. Therefore, I'd suggest enabling RAY_process_group_cleanup_enabled by default.

Related issue number

Follow-up to #56476.

Checks

  • I've signed off every commit (by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the related rules for more info.

🤖 Generated with Claude Code

Flip the default of RAY_process_group_cleanup_enabled from false to true
so per-worker process-group cleanup is used out of the box. The deprecated
subreaper-based cleanup remains available by disabling this flag, and when
both are enabled the raylet already prefers process-group cleanup.

Update doc/source/ray-core/user-spawn-processes.rst to reflect the new
default.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
@kevin85421 kevin85421 requested a review from a team as a code owner June 26, 2026 06:36

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables per-worker process-group-based cleanup by default by changing the default value of process_group_cleanup_enabled from false to true. The documentation has been updated accordingly to reflect this change and note that it is the preferred cleanup mechanism. No review comments were provided, and the changes are straightforward and correct.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@kevin85421 kevin85421 added the go add ONLY when ready to merge, run all tests label Jun 26, 2026
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Jun 26, 2026
@edoakes

edoakes commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

I agree, thanks for opening it @kevin85421!

Did you test w/ the flag turned on and did it solve the issue for you?

@edoakes

edoakes commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

@kevin85421

Copy link
Copy Markdown
Member Author

Hi @edoakes

Did you test w/ the flag turned on and did it solve the issue for you?

Yes, we verified this with our FSDP training jobs, and it cleaned up the processes successfully.

Some failing tests that are likely related: https://buildkite.com/ray-project/premerge/builds/68996#019f02ba-31ba-4a2c-b8bd-fbee51ba24b4

I will fix the CI failures and ping you when this PR is ready for review.

@kevin85421 kevin85421 marked this pull request as draft June 26, 2026 21:28
With process_group_cleanup_enabled defaulting to true, every worker
disconnect triggered killpg(SIGTERM) + killpg(SIGKILL) against the
worker's own process group. Because the worker is its own process-group
leader, a graceful exit (__ray_terminate__) was being signal-killed
before it could finish its shutdown sequence (atexit / __ray_shutdown__
handlers), breaking test_actor_failures.py shutdown tests. The same
mechanism tore down Serve's HAProxyManager actor and its HAProxy
subprocess mid-graceful-shutdown, surfacing as
test_grpc.py::test_serving_grpc_requests failures.

Split per-worker process-group cleanup by disconnect type:
- Non-graceful (crash): the worker is already gone, so signal the group
  immediately (SIGTERM, then SIGKILL) to reap orphaned descendants.
- Graceful: poll for the worker process to exit on its own, then SIGKILL
  the group to reap any orphaned descendants. The process group outlives
  the dead leader as long as members remain, so the pgid stays valid for
  this post-exit sweep. This preserves orphan cleanup on graceful actor
  deletion (test_nested_subprocess_cleanup_with_pg_cleanup) without
  interrupting the worker's own shutdown.

Add a regression test asserting __ray_shutdown__ runs on graceful
termination with PG cleanup enabled.

Verified locally (macOS, Apple clang): test_kill_subprocesses.py (2
passed, 7 linux-only skipped) and the test_actor_failures.py shutdown
suite (8 passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
@kevin85421 kevin85421 force-pushed the enable-process-group-cleanup-by-default branch from 49166d7 to 492e111 Compare June 26, 2026 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants