[llm][kv][4/N] Pre-routing tokenization by jeffreywang88 · Pull Request #64224 · ray-project/ray

jeffreywang88 · 2026-06-18T23:36:59Z

Description

KV-aware routing scores replicas by prompt-token KV-cache overlap, so the router needs the request's token IDs before it picks a replica.

This PR adds pre-routing tokenization: the ingress LLMRouter tokenizes each request via a replica's /tokenize endpoint and forwards the token IDs into choose_replica, so a KV-aware request router (introduced in a follow-up) can score on them.

Tokenizer: tokenizes the body via the LLMServer /tokenize endpoint (add_generation_prompt/add_special_tokens mirror the generation path so the IDs match the engine's prefill tokens). Returns None for bodies not routed on (truncated/empty, multi-prompt, no messages/prompt); raises TokenizeError on rejection.
Router wiring (core/ingress/router.py): pre_routing_tokenization flag builds a Tokenizer; route tokenizes before _pick_replica, forwards request_token_ids via choose_replica (into PendingRequest.kwargs), and maps TokenizeError → HTTPException.
Not yet consumed: KVAwareRouter.choose_replicas is still a stub; reading the token IDs and forwarding to KVRouterActor.select_worker lands in a follow-up.
Tests: test_tokenizer.py (Tokenizer unit + router wiring + builder gate) and a real-Tokenizer assertion in test_router.py.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

gemini-code-assist

Code Review

This pull request introduces pre-routing tokenization for KV-aware routing in Ray LLM. It adds a Tokenizer class to tokenize incoming chat or completion requests via the replica's /tokenize endpoint, passing the resulting token IDs to the replica selection mechanism. This pre-routing tokenization is conditionally enabled only when a KVAwareRouter is configured. The review feedback focuses on improving robustness: handling non-integer error codes and missing attributes in tokenization responses, catching unexpected exceptions during tokenization to gracefully fall back to token-less routing, and defensively verifying that the retrieved request router is a class before calling issubclass.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

…Actor Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

…GPU e2e) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

kouroshHakha · 2026-06-24T19:38:22Z

+            if not isinstance(payload, dict):
+                return None
+
+            if "messages" in payload:


this is a bit brittle. For example, how do we support new api formats e.g. anthropic sdk, etc. Is there a better way. (worth pondering about)?

Fair enough, but I'd rather cross that bridge when we get there. Probably need to introduce some schema that the payload is validated against and dispatch off that request type.

Noted this gap in #64389.

kouroshHakha · 2026-06-24T19:39:45Z

+        # /tokenize yields a single response; drain the stream fully so the
+        # handle response is cleaned up.
+        resp = None
+        async for chunk in self._handle.options(stream=True).tokenize.remote(


why use streaming? is that how servers api are written ? (they return generators? )

yeah our LLMServer(LLMServerProtocol) implements tokenize to return generators:

ray/python/ray/llm/_internal/serve/core/protocol.py

Lines 174 to 178 in ff2cfc6

async def tokenize(

self,

request: "TokenizeRequest",

raw_request_info: Optional[RawRequestInfo] = None,

) -> AsyncGenerator[Union["TokenizeResponse", "ErrorResponse"], None]:

Adjusting the return type would cause breaking changes and is out-of-scope of this PR.

Noted this gap in #64389.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

…ation # Conflicts: # release/llm_tests/kv_router_test/test_kv_event_ingestion.py # release/llm_tests/kv_router_test/test_kv_events.py

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 6b5f0d6. Configure here.}

cursor · 2026-06-28T21:02:16Z

+                if not isinstance(payload["prompt"], str):
+                    # TODO (jeffreywang): Multi-prompt (list) tokenization is unsupported;
+                    # fall back to token-less routing.
+                    return None


Empty messages skips prompt tokenize

Medium Severity

_build_tokenize_request chooses the chat tokenize path whenever the messages key exists, but _parse_routing_payload treats empty or null messages as no routing signal and still routes on a truthy prompt. Those bodies reach the tokenizer with a valid routing payload yet return None or chat tokens that do not match the completion prefill.

^{Reviewed by Cursor Bugbot for commit 6b5f0d6. Configure here.}

jeffreywang88 requested a review from a team as a code owner June 18, 2026 23:37

gemini-code-assist Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/core/ingress/tokenizer.py

Comment thread python/ray/llm/_internal/serve/core/ingress/router.py Outdated

Comment thread python/ray/llm/_internal/serve/routing_policies/kv_aware/utils.py Outdated

cursor Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread python/ray/llm/_internal/serve/core/ingress/tokenizer.py

ray-gardener Bot added serve Ray Serve Related Issue llm labels Jun 19, 2026

jeffreywang88 force-pushed the kv-aware-routing-event-plane branch from 108973f to e2e4cfa Compare June 22, 2026 19:00

jeffreywang88 added 4 commits June 22, 2026 12:42

[serve][llm] Emit engine KV events for KVAwareRouter deployments

60d043e

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

[serve][llm] Ingest KV events via Dynamo SelectionService in KVRouter…

eee024b

…Actor Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

[serve][llm] Group KV router CPU tests under kv/ and add KV-events tests

af63bed

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

[serve][llm] Add Dynamo KV-router release tests (selection service + …

b021b1c

…GPU e2e) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang88 force-pushed the kv-aware-routing-event-plane branch from e2e4cfa to b021b1c Compare June 22, 2026 19:43

Self-review

627e01f

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang88 force-pushed the pre-routing-tokenization branch 2 times, most recently from 7b7452c to 0672485 Compare June 22, 2026 22:56

jeffreywang88 added the go add ONLY when ready to merge, run all tests label Jun 22, 2026

jeffreywang88 mentioned this pull request Jun 23, 2026

[llm][kv][5/N] Implement request routing with KV awareness #64108

Open

kouroshHakha reviewed Jun 24, 2026

View reviewed changes

jeffreywang88 added 10 commits June 24, 2026 21:06

CR feedback round 1

eac08a0

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

CR feedback round 2

08aae97

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Add assign_replica_kv_events_endpoint test

adb7c6b

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Remove unreachable conditional

685c2d4

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Refine comment

55fb0c4

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Fix black

9b88fac

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

Merge branch 'master' into kv-aware-routing-event-plane

87aaa77

[serve][llm] Add Tokenizer that tokenizes requests via replica /tokenize

7d5d9e0

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

[serve][llm] Pass request token IDs into choose_replica from LLMRouter

43d90cf

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

[serve][llm] Add CPU tests for pre-routing tokenization

e671dbc

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>

jeffreywang88 force-pushed the pre-routing-tokenization branch from 0672485 to e671dbc Compare June 26, 2026 08:18

jeffreywang88 mentioned this pull request Jun 27, 2026

[serve][llm] KV cache aware routing & management tracker #64389

Open

17 tasks

Base automatically changed from kv-aware-routing-event-plane to master June 27, 2026 19:48

Merge remote-tracking branch 'origin/master' into pre-routing-tokeniz…

6b5f0d6

…ation # Conflicts: # release/llm_tests/kv_router_test/test_kv_event_ingestion.py # release/llm_tests/kv_router_test/test_kv_events.py

cursor Bot reviewed Jun 28, 2026

View reviewed changes

	async def tokenize(
	self,
	request: "TokenizeRequest",
	raw_request_info: Optional[RawRequestInfo] = None,
	) -> AsyncGenerator[Union["TokenizeResponse", "ErrorResponse"], None]:

Uh oh!

Conversation

jeffreywang88 commented Jun 18, 2026

Description

Related issues

Additional information

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kouroshHakha Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang88 Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang88 Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kouroshHakha Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang88 Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang88 Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

jeffreywang88 Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 28, 2026

Choose a reason for hiding this comment

Empty messages skips prompt tokenize

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants