Skip to content

[llm][kv][5/N] Implement request routing with KV awareness#64108

Open
jeffreywang88 wants to merge 7 commits into
pre-routing-tokenizationfrom
kv-scoring
Open

[llm][kv][5/N] Implement request routing with KV awareness#64108
jeffreywang88 wants to merge 7 commits into
pre-routing-tokenizationfrom
kv-scoring

Conversation

@jeffreywang88

@jeffreywang88 jeffreywang88 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Description

After #64224, each request's prompt token IDs are available before routing. Now, KVAwareRouter receives request token IDs and scores candidate replicas by KV-cache overlap and load and routes to the best one. Scoring is delegated to the deployment-scoped KVRouterActor, which owns the Dynamo SelectionService.

image
  • Scoring: choose_replicas maps the candidate replicas to their Dynamo worker ids, calls KVRouterActor.select_worker(request_id, token_ids, allowed_worker_ids), and routes to the selected worker's replica.
  • select_worker: forwards token_ids and the allowed worker set to SelectionService.select (query-only, no load booking) and returns the chosen worker plus its overlap/score.
  • Failure handling:
    • No pending request (a routing task with no queued request) → return all candidates so the base router picks load-based.
    • Missing/empty prompt token IDs → ValueError, surfaced by the ingress as HTTP 400.
    • ai-dynamo not installed → RuntimeError (HTTP 503) instead of silently routing without KV awareness.
  • Tests:
    • CPU (test_kv_aware_router.py, test_router.py): routing/scoring boundary, token-id validation, no-pending-request fallback, and the ingress 400 mapping.
    • test_kv_event_ingestion.py: replica KV events reach the selection service and are scored.
    • test_kv_events.py: a prompt cached on one replica is routed back to it through the full KVAwareRouter path, plus /tokenize chat-template parity.

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@jeffreywang88 jeffreywang88 changed the title [serve][llm] KV-aware scoring: route via KvRouter.select_worker [DRAFT][llm][kv] KV-aware scoring via KvRouter.select_worker Jun 15, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the core KV-aware routing logic by updating KVAwareActor to delegate worker selection, implementing state initialization and replica selection in KVAwareRouter, and adding unit tests for worker selection. Feedback highlights a critical issue in choose_replicas where a lack of error handling and fallback mechanisms for empty candidate lists, missing token IDs, or invalid worker IDs could cause routing crashes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +97 to +112
# TODO: fall back to default routing when there are no token ids to score
# on (``pending_request`` is None, or a body the tokenizer skipped). This
# branch implements the KV-scoring happy path only.
token_ids = pending_request.kwargs[REQUEST_TOKEN_IDS_KWARG]

worker_id_to_replica = {
get_worker_id(replica.replica_id.unique_id): replica
for replica in candidate_replicas
}
selection = await self._kv_router_actor.select_worker.remote(
pending_request.metadata.request_id,
token_ids,
list(worker_id_to_replica),
)
chosen = worker_id_to_replica[selection["worker_id"]]
return [[chosen]]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation of choose_replicas does not handle cases where pending_request is None or when REQUEST_TOKEN_IDS_KWARG is missing from pending_request.kwargs, which will lead to an AttributeError or KeyError and crash the routing task. Additionally, if candidate_replicas is empty or if the selected worker_id is not found in worker_id_to_replica, it can raise an IndexError or KeyError.

We should implement a robust fallback mechanism:

  1. If candidate_replicas is empty, return [].
  2. If pending_request is None or lacks token IDs, fall back to default routing by returning [candidate_replicas].
  3. If the selected worker_id is missing or invalid, fall back to the first candidate replica.
        if not candidate_replicas:
            return []

        if (
            pending_request is None
            or pending_request.kwargs is None
            or REQUEST_TOKEN_IDS_KWARG not in pending_request.kwargs
        ):
            return [candidate_replicas]

        token_ids = pending_request.kwargs[REQUEST_TOKEN_IDS_KWARG]

        worker_id_to_replica = {
            get_worker_id(replica.replica_id.unique_id): replica
            for replica in candidate_replicas
        }
        selection = await self._kv_router_actor.select_worker.remote(
            pending_request.metadata.request_id,
            token_ids,
            list(worker_id_to_replica),
        )
        chosen_worker_id = selection.get("worker_id")
        if chosen_worker_id in worker_id_to_replica:
            chosen = worker_id_to_replica[chosen_worker_id]
        else:
            chosen = candidate_replicas[0]
        return [[chosen]]

@jeffreywang88 jeffreywang88 Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now outdated.

  • pending_request is None is handled by returning the candidates unchanged; that can happen from Serve’s lazy queue cleanup path after another routing task consumed the request metadata.

  • For real requests, missing/empty token IDs are intentionally errors now as the ingress surfaces this as HTTP 400.

  • For invalid Dynamo output, allowed_worker_ids is built from the current candidates, so a selected worker outside that set would be a selector contract violation. Now, it fails loudly rather than route to an arbitrary fallback replica.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang88 jeffreywang88 changed the base branch from tok-plus-kv-connector to pre-routing-tokenization June 22, 2026 23:16
@jeffreywang88 jeffreywang88 force-pushed the kv-scoring branch 2 times, most recently from f9be1dc to 9c52388 Compare June 23, 2026 07:01
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang88 jeffreywang88 changed the title [DRAFT][llm][kv] KV-aware scoring via KvRouter.select_worker [llm][kv][5/N] Implement request routing with KV awareness Jun 23, 2026
@jeffreywang88 jeffreywang88 marked this pull request as ready for review June 23, 2026 18:55
@jeffreywang88 jeffreywang88 requested a review from a team as a code owner June 23, 2026 18:55

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 867cbc3. Configure here.

@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue llm labels Jun 23, 2026

@pytest.mark.asyncio
@pytest.mark.timeout(600)
async def test_routes_to_higher_overlap_replica(self, kv_aware_handle):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should have a release test that measure that if we have too much requests with the same prefix, they are still load balanced and not greedily go to the same replica.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, token loads are implemented as a request routing indicator in #64400, and there's a test specifically for this scenario: https://github.com/ray-project/ray/pull/64400/changes#diff-16bbbb92eb588df44fb671239d980a99e78ce7c17115e1bc97b87f9bb5fd0469R435.

# Conflicts:
#	python/ray/llm/_internal/serve/core/ingress/builder.py
#	python/ray/llm/_internal/serve/core/ingress/router.py
#	python/ray/llm/_internal/serve/core/ingress/tokenizer.py
#	python/ray/llm/_internal/serve/routing_policies/kv_aware/utils.py
#	python/ray/llm/tests/serve/cpu/deployments/routers/test_router.py
#	python/ray/llm/tests/serve/cpu/deployments/routers/test_tokenizer.py
#	release/llm_tests/kv_router_test/test_kv_events.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

llm serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants