Skip to content

[RayJob][SidecarMode] Optionally wait for head-node schedulability before submit (prevent counterpart to #4399)#4944

Open
tmrtmrtmrtmr wants to merge 1 commit into
ray-project:masterfrom
tmrtmrtmrtmr:feat/sidecar-wait-head-schedulable
Open

[RayJob][SidecarMode] Optionally wait for head-node schedulability before submit (prevent counterpart to #4399)#4944
tmrtmrtmrtmr wants to merge 1 commit into
ray-project:masterfrom
tmrtmrtmrtmr:feat/sidecar-wait-head-schedulable

Conversation

@tmrtmrtmrtmr

@tmrtmrtmrtmr tmrtmrtmrtmr commented Jun 23, 2026

Copy link
Copy Markdown

Why are these changes needed?

In SidecarMode, the submitter waits only for api/gcs_healthz before ray job submit. That reflects GCS-process liveness, not head-node schedulability: when the entrypoint runs as the default head-pinned driver, there is a bootstrap window where GCS is up but the head node is not yet in the GCS actor-scheduler resource view, so the supervisor actor (soft=False) is transiently infeasible and the job fails.

This adds an alpha feature gate SidecarWaitForHeadSchedulable (off by default). When enabled, the SidecarMode submitter waits on api/node_schedulable_healthz instead of api/gcs_healthz — that endpoint reflects resource-view membership, which subsumes GCS health, so it replaces (not stacks) the existing wait. The job is only submitted once the head node is actually placeable.

This is the prevent counterpart to the restart-based recovery in #4399 (SidecarSubmitterRestart); the two are complementary. Off by default because the endpoint is new in Ray (ray-project/ray#64287, design ray-project/ray#64285) — enabling it against an older Ray would loop forever.

Related issue number

Implements #4943. Related to #4399. Depends on ray-project/ray#64287.

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests (go test ./controllers/ray/common/...): gate off keeps the GCS wait byte-identical; gate on swaps it to the schedulable wait (asserts the GCS path is gone and the schedulable path is present).

…fore submit

Add an alpha feature gate SidecarWaitForHeadSchedulable (off by default) that
appends a second readiness wait-loop to the SidecarMode submitter, polling
api/node_schedulable_healthz until the head node is in the GCS actor-scheduler
resource view, before `ray job submit`. Prevents the transient head-affinity
placement failure that ray-project#4399 (SidecarSubmitterRestart) recovers from via restart.
Off by default since the Ray endpoint (ray-project/ray#64285) is not yet released.

Refs ray-project#4399, ray-project#4943; depends on ray-project/ray#64285.

Signed-off-by: Timur Vankov <timur.vankov@silvermont.team>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@tmrtmrtmrtmr

Copy link
Copy Markdown
Author

Ray-side implementation PR (the endpoint this waits on) is up: ray-project/ray#64287 (draft). Design issue: ray-project/ray#64285.

tmrtmrtmrtmr pushed a commit to tmrtmrtmrtmr/ray that referenced this pull request Jun 23, 2026
Add GET /api/node_schedulable_healthz returning success once the head node
is in the GCS actor-scheduler resource view (placeable for a hard
node-affinity actor). Binds GetAllTotalResources into the Python GcsClient
(new sync accessor shim), adds HealthChecker.check_head_node_schedulable,
the dashboard-head route, and tests.

WIP: not built locally (no Bazel) — see PR description for unverified items.

Refs ray-project#64285; refs ray-project/kuberay#4944, ray-project/kuberay#4399.

Signed-off-by: Timur Vankov <timur.vankov@silvermont.team>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tmrtmrtmrtmr pushed a commit to tmrtmrtmrtmr/ray that referenced this pull request Jun 23, 2026
Add GET /api/node_schedulable_healthz returning success once the head node
is in the GCS actor-scheduler resource view (placeable for a hard
node-affinity actor). Binds GetAllTotalResources into the Python GcsClient
(new sync accessor shim), adds HealthChecker.check_head_node_schedulable,
the dashboard-head route, and tests.

WIP: not built locally (no Bazel) — see PR description for unverified items.

Refs ray-project#64285; refs ray-project/kuberay#4944, ray-project/kuberay#4399.

Signed-off-by: Timur Vankov <timur.vankov@silvermont.team>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant