[serve] Fix proxy update loop getting stuck when a proxy's node is removed#64403
Open
nadongjun wants to merge 2 commits into
Open
[serve] Fix proxy update loop getting stuck when a proxy's node is removed#64403nadongjun wants to merge 2 commits into
nadongjun wants to merge 2 commits into
Conversation
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates the proxy state management in Ray Serve to handle ActorUnschedulableError when checking if a proxy actor is shut down. Specifically, in proxy_state.py, the is_shutdown method now catches ActorUnschedulableError in addition to RayActorError, treating permanently unschedulable proxy actors as ready for shutdown. Additionally, comprehensive unit tests have been added in test_proxy_state.py to verify this behavior under various actor states. There are no review comments, so no further feedback is provided.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When a Serve proxy actor is pinned to a node that later gets removed (autoscaler consolidation, spot reclaim, or abrupt termination), calling
check_healthon it raisesActorUnschedulableError.ActorProxyWrapper.is_shutdown()only catches RayActorError and GetTimeoutError, so this exception propagates up through kill -> ProxyState.shutdown ->ProxyStateManager.update. The controller's proxy update loop then hangs every cycle before it can start a new proxy, soproxiesstays empty and all ingress returns 503. Restarting the head node does not help, because the detached proxy state is restored from GCS.We hit this in production(ray 2.56) with HAProxy mode (
RAY_SERVE_ENABLE_HA_PROXY=1) under repeated head/proxy node churn: the fallback proxy ended up pinned to a removed node, so update() got stuck and never reached_start_proxies_if_needed. As a result the proxy actor'ssetup()was never called,HAProxyManagernever started the HAProxy process, port 8000 had no listener, and external requests got 503.ActorUnschedulableErrorinis_shutdown()and treat the actor as ready for shutdown, so the controller can drop the stale proxy and start a new one.Related issues
Additional information
ActorUnschedulableErroris aRayErrorbut not aRayActorError, which is why the existingexcept RayActorErrormisses it. Only the unschedulable path changes; the dead, pending, and alive paths behave as before.