[dashboard] Add a head-node-schedulable health endpoint (api/node_schedulable_healthz)#64287
[dashboard] Add a head-node-schedulable health endpoint (api/node_schedulable_healthz)#64287tmrtmrtmrtmr wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new health check endpoint /api/node_schedulable_healthz to verify if the head node is schedulable. To support this, it exposes the synchronous GetAllTotalResources GCS RPC through Cython and implements a check in HealthChecker to ensure the head node ID is present in the resource view. Feedback on these changes suggests declaring CGetAllTotalResourcesReply under its correct protobuf header block to prevent compilation issues, and increasing the RPC timeout in HealthChecker from 0.1s to 1.0s to avoid flaky timeouts under heavy load.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
1a5b157 to
8ef1a62
Compare
|
Addressed the review feedback: moved |
Add GET /api/node_schedulable_healthz returning success once the head node is in the GCS actor-scheduler resource view (placeable for a hard node-affinity actor). Binds GetAllTotalResources into the Python GcsClient (new sync accessor shim), adds HealthChecker.check_head_node_schedulable, the dashboard-head route, and tests. WIP: not built locally (no Bazel) — see PR description for unverified items. Refs ray-project#64285; refs ray-project/kuberay#4944, ray-project/kuberay#4399. Signed-off-by: Timur Vankov <timur.vankov@silvermont.team> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
8ef1a62 to
8e9d937
Compare
Description
Adds a dashboard-head HTTP endpoint
GET /api/node_schedulable_healthzthat returnssuccessonce the head node is present in the GCS actor-scheduler resource view — i.e. once a hardNodeAffinitySchedulingStrategy(head, soft=False)actor (the default Ray Jobs driver/supervisor placement) can actually be placed on it.Tooling that submits a head-pinned entrypoint can wait on this instead of
/api/gcs_healthz.gcs_healthzonly reflects GCS-process liveness and can be green before the head node enters the resource view, so the supervisor actor is transiently infeasible and the job fails (nosoft=Falsefallback). Same class as #32167 / #35387; a liveness probe can't close it because liveness ≠ resource-view membership.The signal already exists internally (
GcsResourceManager::HandleGetAllTotalResourcesiterates the samecluster_resource_manager_.GetResourceView()that hard-affinity placement reads, head node included) but no PythonGcsClientmethod or HTTP route surfaces it. This only surfaces it — no scheduler/resource-manager changes.Layers: sync
NodeResourceInfoAccessor::GetAllTotalResources(mirrorsGetAllResourceUsage) →GcsClient.get_all_total_resources→HealthChecker.check_head_node_schedulable→ the route. Unit tests cover not-in-view → 503 and in-view → 200/success.Related issues
Related to #64285 (design and analysis). Downstream consumer: ray-project/kuberay#4944.
Additional information
Draft — authored without a Bazel build environment, so the C++/Cython layers are validated by CI's first build rather than locally.