[RayJob][SidecarMode] Optionally wait for head-node schedulability before submit (prevent counterpart to #4399)#4944
Open
tmrtmrtmrtmr wants to merge 1 commit into
Conversation
868846b to
76211be
Compare
…fore submit Add an alpha feature gate SidecarWaitForHeadSchedulable (off by default) that appends a second readiness wait-loop to the SidecarMode submitter, polling api/node_schedulable_healthz until the head node is in the GCS actor-scheduler resource view, before `ray job submit`. Prevents the transient head-affinity placement failure that ray-project#4399 (SidecarSubmitterRestart) recovers from via restart. Off by default since the Ray endpoint (ray-project/ray#64285) is not yet released. Refs ray-project#4399, ray-project#4943; depends on ray-project/ray#64285. Signed-off-by: Timur Vankov <timur.vankov@silvermont.team> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
76211be to
ce1a861
Compare
Author
|
Ray-side implementation PR (the endpoint this waits on) is up: ray-project/ray#64287 (draft). Design issue: ray-project/ray#64285. |
tmrtmrtmrtmr
pushed a commit
to tmrtmrtmrtmr/ray
that referenced
this pull request
Jun 23, 2026
Add GET /api/node_schedulable_healthz returning success once the head node is in the GCS actor-scheduler resource view (placeable for a hard node-affinity actor). Binds GetAllTotalResources into the Python GcsClient (new sync accessor shim), adds HealthChecker.check_head_node_schedulable, the dashboard-head route, and tests. WIP: not built locally (no Bazel) — see PR description for unverified items. Refs ray-project#64285; refs ray-project/kuberay#4944, ray-project/kuberay#4399. Signed-off-by: Timur Vankov <timur.vankov@silvermont.team> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tmrtmrtmrtmr
pushed a commit
to tmrtmrtmrtmr/ray
that referenced
this pull request
Jun 23, 2026
Add GET /api/node_schedulable_healthz returning success once the head node is in the GCS actor-scheduler resource view (placeable for a hard node-affinity actor). Binds GetAllTotalResources into the Python GcsClient (new sync accessor shim), adds HealthChecker.check_head_node_schedulable, the dashboard-head route, and tests. WIP: not built locally (no Bazel) — see PR description for unverified items. Refs ray-project#64285; refs ray-project/kuberay#4944, ray-project/kuberay#4399. Signed-off-by: Timur Vankov <timur.vankov@silvermont.team> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
In SidecarMode, the submitter waits only for
api/gcs_healthzbeforeray job submit. That reflects GCS-process liveness, not head-node schedulability: when the entrypoint runs as the default head-pinned driver, there is a bootstrap window where GCS is up but the head node is not yet in the GCS actor-scheduler resource view, so the supervisor actor (soft=False) is transiently infeasible and the job fails.This adds an alpha feature gate
SidecarWaitForHeadSchedulable(off by default). When enabled, the SidecarMode submitter waits onapi/node_schedulable_healthzinstead ofapi/gcs_healthz— that endpoint reflects resource-view membership, which subsumes GCS health, so it replaces (not stacks) the existing wait. The job is only submitted once the head node is actually placeable.This is the prevent counterpart to the restart-based recovery in #4399 (
SidecarSubmitterRestart); the two are complementary. Off by default because the endpoint is new in Ray (ray-project/ray#64287, design ray-project/ray#64285) — enabling it against an older Ray would loop forever.Related issue number
Implements #4943. Related to #4399. Depends on ray-project/ray#64287.
Checks
go test ./controllers/ray/common/...): gate off keeps the GCS wait byte-identical; gate on swaps it to the schedulable wait (asserts the GCS path is gone and the schedulable path is present).