Skip to content

[dashboard] Add a head-node-schedulable health endpoint (api/node_schedulable_healthz)#64287

Open
tmrtmrtmrtmr wants to merge 1 commit into
ray-project:masterfrom
tmrtmrtmrtmr:feat/node-schedulable-healthz
Open

[dashboard] Add a head-node-schedulable health endpoint (api/node_schedulable_healthz)#64287
tmrtmrtmrtmr wants to merge 1 commit into
ray-project:masterfrom
tmrtmrtmrtmr:feat/node-schedulable-healthz

Conversation

@tmrtmrtmrtmr

@tmrtmrtmrtmr tmrtmrtmrtmr commented Jun 23, 2026

Copy link
Copy Markdown

Description

Adds a dashboard-head HTTP endpoint GET /api/node_schedulable_healthz that returns success once the head node is present in the GCS actor-scheduler resource view — i.e. once a hard NodeAffinitySchedulingStrategy(head, soft=False) actor (the default Ray Jobs driver/supervisor placement) can actually be placed on it.

Tooling that submits a head-pinned entrypoint can wait on this instead of /api/gcs_healthz. gcs_healthz only reflects GCS-process liveness and can be green before the head node enters the resource view, so the supervisor actor is transiently infeasible and the job fails (no soft=False fallback). Same class as #32167 / #35387; a liveness probe can't close it because liveness ≠ resource-view membership.

The signal already exists internally (GcsResourceManager::HandleGetAllTotalResources iterates the same cluster_resource_manager_.GetResourceView() that hard-affinity placement reads, head node included) but no Python GcsClient method or HTTP route surfaces it. This only surfaces it — no scheduler/resource-manager changes.

Layers: sync NodeResourceInfoAccessor::GetAllTotalResources (mirrors GetAllResourceUsage) → GcsClient.get_all_total_resourcesHealthChecker.check_head_node_schedulable → the route. Unit tests cover not-in-view → 503 and in-view → 200/success.

Related issues

Related to #64285 (design and analysis). Downstream consumer: ray-project/kuberay#4944.

Additional information

Draft — authored without a Bazel build environment, so the C++/Cython layers are validated by CI's first build rather than locally.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new health check endpoint /api/node_schedulable_healthz to verify if the head node is schedulable. To support this, it exposes the synchronous GetAllTotalResources GCS RPC through Cython and implements a check in HealthChecker to ensure the head node ID is present in the resource view. Feedback on these changes suggests declaring CGetAllTotalResourcesReply under its correct protobuf header block to prevent compilation issues, and increasing the RPC timeout in HealthChecker from 0.1s to 1.0s to avoid flaky timeouts under heavy load.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/ray/includes/common.pxd Outdated
Comment thread python/ray/dashboard/modules/reporter/utils.py
@tmrtmrtmrtmr tmrtmrtmrtmr force-pushed the feat/node-schedulable-healthz branch from 1a5b157 to 8ef1a62 Compare June 23, 2026 19:51
@tmrtmrtmrtmr

Copy link
Copy Markdown
Author

Addressed the review feedback: moved CGetAllTotalResourcesReply to its own gcs_service.pb.h extern block, and bumped the check_head_node_schedulable RPC timeout 0.1s → 1.0s. Also dropped the integration test in favor of the mocked unit tests and trimmed inline comments.

@tmrtmrtmrtmr tmrtmrtmrtmr marked this pull request as ready for review June 23, 2026 19:54
@tmrtmrtmrtmr tmrtmrtmrtmr requested a review from a team as a code owner June 23, 2026 19:54
Add GET /api/node_schedulable_healthz returning success once the head node
is in the GCS actor-scheduler resource view (placeable for a hard
node-affinity actor). Binds GetAllTotalResources into the Python GcsClient
(new sync accessor shim), adds HealthChecker.check_head_node_schedulable,
the dashboard-head route, and tests.

WIP: not built locally (no Bazel) — see PR description for unverified items.

Refs ray-project#64285; refs ray-project/kuberay#4944, ray-project/kuberay#4399.

Signed-off-by: Timur Vankov <timur.vankov@silvermont.team>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@tmrtmrtmrtmr tmrtmrtmrtmr force-pushed the feat/node-schedulable-healthz branch from 8ef1a62 to 8e9d937 Compare June 23, 2026 20:43
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant