[llm][kv][5/N] Implement request routing with KV awareness by jeffreywang88 · Pull Request #64108 · ray-project/ray

jeffreywang88 · 2026-06-15T18:54:13Z

Description

After #64224, each request's prompt token IDs are available before routing. Now, KVAwareRouter receives request token IDs and scores candidate replicas by KV-cache overlap and load and routes to the best one. Scoring is delegated to the deployment-scoped KVRouterActor, which owns the Dynamo SelectionService.

Scoring: choose_replicas maps the candidate replicas to their Dynamo worker ids, calls KVRouterActor.select_worker(request_id, token_ids, allowed_worker_ids), and routes to the selected worker's replica.
select_worker: forwards token_ids and the allowed worker set to SelectionService.select (query-only, no load booking) and returns the chosen worker plus its overlap/score.
Failure handling:
- No pending request (a routing task with no queued request) → return all candidates so the base router picks load-based.
- Missing/empty prompt token IDs → ValueError, surfaced by the ingress as HTTP 400.
- ai-dynamo not installed → RuntimeError (HTTP 503) instead of silently routing without KV awareness.
Tests:
- CPU (test_kv_aware_router.py, test_router.py): routing/scoring boundary, token-id validation, no-pending-request fallback, and the ingress 400 mapping.
- test_kv_event_ingestion.py: replica KV events reach the selection service and are scored.
- test_kv_events.py: a prompt cached on one replica is routed back to it through the full KVAwareRouter path, plus /tokenize chat-template parity.

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

gemini-code-assist

Code Review

This pull request implements the core KV-aware routing logic by updating KVAwareActor to delegate worker selection, implementing state initialization and replica selection in KVAwareRouter, and adding unit tests for worker selection. Feedback highlights a critical issue in choose_replicas where a lack of error handling and fallback mechanisms for empty candidate lists, missing token IDs, or invalid worker IDs could cause routing crashes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-15T18:55:42Z

+        # TODO: fall back to default routing when there are no token ids to score
+        # on (``pending_request`` is None, or a body the tokenizer skipped). This
+        # branch implements the KV-scoring happy path only.
+        token_ids = pending_request.kwargs[REQUEST_TOKEN_IDS_KWARG]
+
+        worker_id_to_replica = {
+            get_worker_id(replica.replica_id.unique_id): replica
+            for replica in candidate_replicas
+        }
+        selection = await self._kv_router_actor.select_worker.remote(
+            pending_request.metadata.request_id,
+            token_ids,
+            list(worker_id_to_replica),
+        )
+        chosen = worker_id_to_replica[selection["worker_id"]]
+        return [[chosen]]


The current implementation of choose_replicas does not handle cases where pending_request is None or when REQUEST_TOKEN_IDS_KWARG is missing from pending_request.kwargs, which will lead to an AttributeError or KeyError and crash the routing task. Additionally, if candidate_replicas is empty or if the selected worker_id is not found in worker_id_to_replica, it can raise an IndexError or KeyError.

We should implement a robust fallback mechanism:

If candidate_replicas is empty, return [].

If pending_request is None or lacks token IDs, fall back to default routing by returning [candidate_replicas].

If the selected worker_id is missing or invalid, fall back to the first candidate replica.

if not candidate_replicas: return [] if ( pending_request is None or pending_request.kwargs is None or REQUEST_TOKEN_IDS_KWARG not in pending_request.kwargs ): return [candidate_replicas] token_ids = pending_request.kwargs[REQUEST_TOKEN_IDS_KWARG] worker_id_to_replica = { get_worker_id(replica.replica_id.unique_id): replica for replica in candidate_replicas } selection = await self._kv_router_actor.select_worker.remote( pending_request.metadata.request_id, token_ids, list(worker_id_to_replica), ) chosen_worker_id = selection.get("worker_id") if chosen_worker_id in worker_id_to_replica: chosen = worker_id_to_replica[chosen_worker_id] else: chosen = candidate_replicas[0] return [[chosen]]

This is now outdated.

pending_request is None is handled by returning the candidates unchanged; that can happen from Serve’s lazy queue cleanup path after another routing task consumed the request metadata.

For real requests, missing/empty token IDs are intentionally errors now as the ingress surfaces this as HTTP 400.

For invalid Dynamo output, allowed_worker_ids is built from the current candidates, so a selected worker outside that set would be a selector contract violation. Now, it fails loudly rather than route to an arbitrary fallback replica.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 867cbc3. Configure here.}

kouroshHakha · 2026-06-24T20:03:00Z

+
+    @pytest.mark.asyncio
+    @pytest.mark.timeout(600)
+    async def test_routes_to_higher_overlap_replica(self, kv_aware_handle):


we should have a release test that measure that if we have too much requests with the same prefix, they are still load balanced and not greedily go to the same replica.

Agreed, token loads are implemented as a request routing indicator in #64400, and there's a test specifically for this scenario: https://github.com/ray-project/ray/pull/64400/changes#diff-16bbbb92eb588df44fb671239d980a99e78ce7c17115e1bc97b87f9bb5fd0469R435.

# Conflicts: # python/ray/llm/_internal/serve/core/ingress/builder.py # python/ray/llm/_internal/serve/core/ingress/router.py # python/ray/llm/_internal/serve/core/ingress/tokenizer.py # python/ray/llm/_internal/serve/routing_policies/kv_aware/utils.py # python/ray/llm/tests/serve/cpu/deployments/routers/test_router.py # python/ray/llm/tests/serve/cpu/deployments/routers/test_tokenizer.py # release/llm_tests/kv_router_test/test_kv_events.py

jeffreywang88 changed the title ~~[serve][llm] KV-aware scoring: route via KvRouter.select_worker~~ [DRAFT][llm][kv] KV-aware scoring via KvRouter.select_worker Jun 15, 2026

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

jeffreywang88 added 3 commits June 22, 2026 15:56

[serve][llm] Add Tokenizer that tokenizes requests via replica /tokenize

3dd3b1b

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

[serve][llm] Pass request token IDs into choose_replica from LLMRouter

597f655

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

[serve][llm] Add CPU tests for pre-routing tokenization

0672485

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang88 force-pushed the kv-scoring branch from cc992c6 to 9857eac Compare June 22, 2026 23:15

jeffreywang88 changed the base branch from tok-plus-kv-connector to pre-routing-tokenization June 22, 2026 23:16

jeffreywang88 force-pushed the kv-scoring branch 2 times, most recently from f9be1dc to 9c52388 Compare June 23, 2026 07:01

jeffreywang88 added 2 commits June 23, 2026 00:19

[serve][llm] Score KV-aware requests via SelectionService.select

190fabe

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

[serve][llm] Add CPU and GPU tests for KV request scoring

aad5499

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang88 force-pushed the kv-scoring branch from 9c52388 to aad5499 Compare June 23, 2026 07:19

Minor edits

867cbc3

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang88 changed the title ~~[DRAFT][llm][kv] KV-aware scoring via KvRouter.select_worker~~ [llm][kv][5/N] Implement request routing with KV awareness Jun 23, 2026

jeffreywang88 marked this pull request as ready for review June 23, 2026 18:55

jeffreywang88 requested a review from a team as a code owner June 23, 2026 18:55

cursor Bot reviewed Jun 23, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/routing_policies/kv_aware/kv_aware_router.py

ray-gardener Bot added serve Ray Serve Related Issue llm labels Jun 23, 2026

kouroshHakha reviewed Jun 24, 2026

View reviewed changes

jeffreywang88 force-pushed the pre-routing-tokenization branch from 0672485 to e671dbc Compare June 26, 2026 08:18

jeffreywang88 mentioned this pull request Jun 27, 2026

[serve][llm] KV cache aware routing & management tracker #64389

Open

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[llm][kv][5/N] Implement request routing with KV awareness#64108

[llm][kv][5/N] Implement request routing with KV awareness#64108
jeffreywang88 wants to merge 7 commits into
pre-routing-tokenizationfrom
kv-scoring

jeffreywang88 commented Jun 15, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

jeffreywang88 Jun 23, 2026 •

edited

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

kouroshHakha Jun 24, 2026

Uh oh!

jeffreywang88 Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jeffreywang88 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional information

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang88 Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kouroshHakha Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang88 Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jeffreywang88 commented Jun 15, 2026 •

edited

Loading

jeffreywang88 Jun 23, 2026 •

edited

Loading