Skip to content

[serve] Router skips cache invalidation on gRPC request failure #63261

Description

@jeffreywang88

What happened + What you expected to happen

When a DeploymentHandle uses _by_reference=False (gRPC transport), the AsyncioRouter's request-completion callback never invalidates the queue-length cache entry for a failed replica. After a gRPC failure, the next request may still be routed to that replica by power-of-2-choices until either (a) the rejection path on the next request to that replica invalidates the cache, or (b) the controller's long-poll pushes a new replica set.

The actor path (_by_reference=True, default) is unaffected.

Root cause

AsyncioRouter._process_finished_request dispatches on the type of the result passed to the done-callback:

def _process_finished_request(
self,
replica_id: ReplicaID,
internal_request_id: str,
replica_actor_id: Optional[ray.ActorID],
result: Union[Any, RayError],
) -> None:
if RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE:
self._metrics_manager.dec_num_running_requests_for_replica(replica_id)
# Notify request router that request completed (for cleanup, e.g., token release)
if self.request_router:
self.request_router.on_request_completed(replica_id, internal_request_id)
actor_died_error = self._get_actor_died_error(result)
if actor_died_error is not None:
self._handle_actor_died_error(
replica_id, replica_actor_id, actor_died_error
)
elif isinstance(result, ActorUnavailableError):
# There are network issues, or replica has died but GCS is down so
# ActorUnavailableError will be raised until GCS recovers. For the
# time being, invalidate the cache entry so that we don't try to
# send requests to this replica without actively probing, and retry
# routing request.
if self.request_router:
self.request_router.on_replica_actor_unavailable(replica_id)

The shape of result depends on the transport:

Transport _by_reference result in done-callback Source
Actor True (default) Deserialized return value or RayError subclass ObjectRef._on_completedasync_callback in _raylet.pyx
gRPC False grpc.aio.Call object gRPCReplicaResult.add_done_callbackgrpc.aio.Call.add_done_callback

On the gRPC path result is a grpc.aio.Call, so _get_actor_died_error returns None and the isinstance(result, ActorUnavailableError) check is False. Both branches are skipped silently.

Versions / Dependencies

master

Reproduction script

N/A

Issue Severity

None

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tgood-first-issueGreat starter issue for someone just starting to contribute to RayserveRay Serve Related Issue

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    In progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions