[llm][kv][4/N] Pre-routing tokenization#64224
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces pre-routing tokenization for KV-aware routing in Ray LLM. It adds a Tokenizer class to tokenize incoming chat or completion requests via the replica's /tokenize endpoint, passing the resulting token IDs to the replica selection mechanism. This pre-routing tokenization is conditionally enabled only when a KVAwareRouter is configured. The review feedback focuses on improving robustness: handling non-integer error codes and missing attributes in tokenization responses, catching unexpected exceptions during tokenization to gracefully fall back to token-less routing, and defensively verifying that the retrieved request router is a class before calling issubclass.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
108973f to
e2e4cfa
Compare
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
…Actor Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
…GPU e2e) Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
e2e4cfa to
b021b1c
Compare
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
7b7452c to
0672485
Compare
| if not isinstance(payload, dict): | ||
| return None | ||
|
|
||
| if "messages" in payload: |
There was a problem hiding this comment.
this is a bit brittle. For example, how do we support new api formats e.g. anthropic sdk, etc. Is there a better way. (worth pondering about)?
There was a problem hiding this comment.
Fair enough, but I'd rather cross that bridge when we get there. Probably need to introduce some schema that the payload is validated against and dispatch off that request type.
| # /tokenize yields a single response; drain the stream fully so the | ||
| # handle response is cleaned up. | ||
| resp = None | ||
| async for chunk in self._handle.options(stream=True).tokenize.remote( |
There was a problem hiding this comment.
why use streaming? is that how servers api are written ? (they return generators? )
There was a problem hiding this comment.
yeah our LLMServer(LLMServerProtocol) implements tokenize to return generators:
ray/python/ray/llm/_internal/serve/core/protocol.py
Lines 174 to 178 in ff2cfc6
There was a problem hiding this comment.
Adjusting the return type would cause breaking changes and is out-of-scope of this PR.
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
0672485 to
e671dbc
Compare
…ation # Conflicts: # release/llm_tests/kv_router_test/test_kv_event_ingestion.py # release/llm_tests/kv_router_test/test_kv_events.py
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 6b5f0d6. Configure here.
| if not isinstance(payload["prompt"], str): | ||
| # TODO (jeffreywang): Multi-prompt (list) tokenization is unsupported; | ||
| # fall back to token-less routing. | ||
| return None |
There was a problem hiding this comment.
Empty messages skips prompt tokenize
Medium Severity
_build_tokenize_request chooses the chat tokenize path whenever the messages key exists, but _parse_routing_payload treats empty or null messages as no routing signal and still routes on a truthy prompt. Those bodies reach the tokenizer with a valid routing payload yet return None or chat tokens that do not match the completion prefill.
Reviewed by Cursor Bugbot for commit 6b5f0d6. Configure here.


Description
KV-aware routing scores replicas by prompt-token KV-cache overlap, so the router needs the request's token IDs before it picks a replica.
This PR adds pre-routing tokenization: the ingress
LLMRoutertokenizes each request via a replica's/tokenizeendpoint and forwards the token IDs intochoose_replica, so a KV-aware request router (introduced in a follow-up) can score on them.LLMServer/tokenizeendpoint (add_generation_prompt/add_special_tokensmirror the generation path so the IDs match the engine's prefill tokens). ReturnsNonefor bodies not routed on (truncated/empty, multi-prompt, nomessages/prompt); raisesTokenizeErroron rejection.core/ingress/router.py):pre_routing_tokenizationflag builds aTokenizer;routetokenizes before_pick_replica, forwardsrequest_token_idsviachoose_replica(intoPendingRequest.kwargs), and mapsTokenizeError→HTTPException.KVAwareRouter.choose_replicasis still a stub; reading the token IDs and forwarding toKVRouterActor.select_workerlands in a follow-up.test_tokenizer.py(Tokenizer unit + router wiring + builder gate) and a real-Tokenizerassertion intest_router.py.Related issues
Additional information