[llm][kv][5/N] Implement request routing with KV awareness#64108
[llm][kv][5/N] Implement request routing with KV awareness#64108jeffreywang88 wants to merge 7 commits into
Conversation
KvRouter.select_worker
There was a problem hiding this comment.
Code Review
This pull request implements the core KV-aware routing logic by updating KVAwareActor to delegate worker selection, implementing state initialization and replica selection in KVAwareRouter, and adding unit tests for worker selection. Feedback highlights a critical issue in choose_replicas where a lack of error handling and fallback mechanisms for empty candidate lists, missing token IDs, or invalid worker IDs could cause routing crashes.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| # TODO: fall back to default routing when there are no token ids to score | ||
| # on (``pending_request`` is None, or a body the tokenizer skipped). This | ||
| # branch implements the KV-scoring happy path only. | ||
| token_ids = pending_request.kwargs[REQUEST_TOKEN_IDS_KWARG] | ||
|
|
||
| worker_id_to_replica = { | ||
| get_worker_id(replica.replica_id.unique_id): replica | ||
| for replica in candidate_replicas | ||
| } | ||
| selection = await self._kv_router_actor.select_worker.remote( | ||
| pending_request.metadata.request_id, | ||
| token_ids, | ||
| list(worker_id_to_replica), | ||
| ) | ||
| chosen = worker_id_to_replica[selection["worker_id"]] | ||
| return [[chosen]] |
There was a problem hiding this comment.
The current implementation of choose_replicas does not handle cases where pending_request is None or when REQUEST_TOKEN_IDS_KWARG is missing from pending_request.kwargs, which will lead to an AttributeError or KeyError and crash the routing task. Additionally, if candidate_replicas is empty or if the selected worker_id is not found in worker_id_to_replica, it can raise an IndexError or KeyError.
We should implement a robust fallback mechanism:
- If
candidate_replicasis empty, return[]. - If
pending_requestisNoneor lacks token IDs, fall back to default routing by returning[candidate_replicas]. - If the selected
worker_idis missing or invalid, fall back to the first candidate replica.
if not candidate_replicas:
return []
if (
pending_request is None
or pending_request.kwargs is None
or REQUEST_TOKEN_IDS_KWARG not in pending_request.kwargs
):
return [candidate_replicas]
token_ids = pending_request.kwargs[REQUEST_TOKEN_IDS_KWARG]
worker_id_to_replica = {
get_worker_id(replica.replica_id.unique_id): replica
for replica in candidate_replicas
}
selection = await self._kv_router_actor.select_worker.remote(
pending_request.metadata.request_id,
token_ids,
list(worker_id_to_replica),
)
chosen_worker_id = selection.get("worker_id")
if chosen_worker_id in worker_id_to_replica:
chosen = worker_id_to_replica[chosen_worker_id]
else:
chosen = candidate_replicas[0]
return [[chosen]]There was a problem hiding this comment.
This is now outdated.
-
pending_request is Noneis handled by returning the candidates unchanged; that can happen from Serve’s lazy queue cleanup path after another routing task consumed the request metadata. -
For real requests, missing/empty token IDs are intentionally errors now as the ingress surfaces this as HTTP 400.
-
For invalid Dynamo output,
allowed_worker_idsis built from the current candidates, so a selected worker outside that set would be a selector contract violation. Now, it fails loudly rather than route to an arbitrary fallback replica.
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
cc992c6 to
9857eac
Compare
f9be1dc to
9c52388
Compare
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
9c52388 to
aad5499
Compare
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
KvRouter.select_workerThere was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 867cbc3. Configure here.
|
|
||
| @pytest.mark.asyncio | ||
| @pytest.mark.timeout(600) | ||
| async def test_routes_to_higher_overlap_replica(self, kv_aware_handle): |
There was a problem hiding this comment.
we should have a release test that measure that if we have too much requests with the same prefix, they are still load balanced and not greedily go to the same replica.
There was a problem hiding this comment.
Agreed, token loads are implemented as a request routing indicator in #64400, and there's a test specifically for this scenario: https://github.com/ray-project/ray/pull/64400/changes#diff-16bbbb92eb588df44fb671239d980a99e78ce7c17115e1bc97b87f9bb5fd0469R435.
0672485 to
e671dbc
Compare
# Conflicts: # python/ray/llm/_internal/serve/core/ingress/builder.py # python/ray/llm/_internal/serve/core/ingress/router.py # python/ray/llm/_internal/serve/core/ingress/tokenizer.py # python/ray/llm/_internal/serve/routing_policies/kv_aware/utils.py # python/ray/llm/tests/serve/cpu/deployments/routers/test_router.py # python/ray/llm/tests/serve/cpu/deployments/routers/test_tokenizer.py # release/llm_tests/kv_router_test/test_kv_events.py

Description
After #64224, each request's prompt token IDs are available before routing. Now,
KVAwareRouterreceives request token IDs and scores candidate replicas by KV-cache overlap and load and routes to the best one. Scoring is delegated to the deployment-scopedKVRouterActor, which owns the DynamoSelectionService.choose_replicasmaps the candidate replicas to their Dynamo worker ids, callsKVRouterActor.select_worker(request_id, token_ids, allowed_worker_ids), and routes to the selected worker's replica.select_worker: forwardstoken_idsand the allowed worker set toSelectionService.select(query-only, no load booking) and returns the chosen worker plus its overlap/score.ValueError, surfaced by the ingress as HTTP 400.ai-dynamonot installed →RuntimeError(HTTP 503) instead of silently routing without KV awareness.test_kv_aware_router.py,test_router.py): routing/scoring boundary, token-id validation, no-pending-request fallback, and the ingress 400 mapping.test_kv_event_ingestion.py: replica KV events reach the selection service and are scored.test_kv_events.py: a prompt cached on one replica is routed back to it through the fullKVAwareRouterpath, plus/tokenizechat-template parity.Additional information