Skip to content

[serve] Fix proxy update loop getting stuck when a proxy's node is removed#64403

Open
nadongjun wants to merge 2 commits into
ray-project:masterfrom
nadongjun:fix-proxy-is-shutdown-unschedulable
Open

[serve] Fix proxy update loop getting stuck when a proxy's node is removed#64403
nadongjun wants to merge 2 commits into
ray-project:masterfrom
nadongjun:fix-proxy-is-shutdown-unschedulable

Conversation

@nadongjun

Copy link
Copy Markdown
Contributor

Description

When a Serve proxy actor is pinned to a node that later gets removed (autoscaler consolidation, spot reclaim, or abrupt termination), calling check_health on it raises ActorUnschedulableError.

ActorProxyWrapper.is_shutdown() only catches RayActorError and GetTimeoutError, so this exception propagates up through kill -> ProxyState.shutdown -> ProxyStateManager.update. The controller's proxy update loop then hangs every cycle before it can start a new proxy, so proxies stays empty and all ingress returns 503. Restarting the head node does not help, because the detached proxy state is restored from GCS.

We hit this in production(ray 2.56) with HAProxy mode (RAY_SERVE_ENABLE_HA_PROXY=1) under repeated head/proxy node churn: the fallback proxy ended up pinned to a removed node, so update() got stuck and never reached _start_proxies_if_needed. As a result the proxy actor's setup() was never called, HAProxyManager never started the HAProxy process, port 8000 had no listener, and external requests got 503.

  • Fix: catch ActorUnschedulableError in is_shutdown() and treat the actor as ready for shutdown, so the controller can drop the stale proxy and start a new one.

Related issues

Additional information

ActorUnschedulableError is a RayError but not a RayActorError, which is why the existing except RayActorError misses it. Only the unschedulable path changes; the dead, pending, and alive paths behave as before.

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
@nadongjun nadongjun requested a review from a team as a code owner June 29, 2026 01:39

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the proxy state management in Ray Serve to handle ActorUnschedulableError when checking if a proxy actor is shut down. Specifically, in proxy_state.py, the is_shutdown method now catches ActorUnschedulableError in addition to RayActorError, treating permanently unschedulable proxy actors as ready for shutdown. Additionally, comprehensive unit tests have been added in test_proxy_state.py to verify this behavior under various actor states. There are no review comments, so no further feedback is provided.

@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue community-contribution Contributed by the community labels Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant