Skip to content

[llm][kv][4/N] Pre-routing tokenization#64224

Open
jeffreywang88 wants to merge 16 commits into
masterfrom
pre-routing-tokenization
Open

[llm][kv][4/N] Pre-routing tokenization#64224
jeffreywang88 wants to merge 16 commits into
masterfrom
pre-routing-tokenization

Conversation

@jeffreywang88

Copy link
Copy Markdown
Contributor

Description

KV-aware routing scores replicas by prompt-token KV-cache overlap, so the router needs the request's token IDs before it picks a replica.

This PR adds pre-routing tokenization: the ingress LLMRouter tokenizes each request via a replica's /tokenize endpoint and forwards the token IDs into choose_replica, so a KV-aware request router (introduced in a follow-up) can score on them.

  • Tokenizer: tokenizes the body via the LLMServer /tokenize endpoint (add_generation_prompt/add_special_tokens mirror the generation path so the IDs match the engine's prefill tokens). Returns None for bodies not routed on (truncated/empty, multi-prompt, no messages/prompt); raises TokenizeError on rejection.
  • Router wiring (core/ingress/router.py): pre_routing_tokenization flag builds a Tokenizer; route tokenizes before _pick_replica, forwards request_token_ids via choose_replica (into PendingRequest.kwargs), and maps TokenizeErrorHTTPException.
  • Not yet consumed: KVAwareRouter.choose_replicas is still a stub; reading the token IDs and forwarding to KVRouterActor.select_worker lands in a follow-up.
  • Tests: test_tokenizer.py (Tokenizer unit + router wiring + builder gate) and a real-Tokenizer assertion in test_router.py.
image

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

@jeffreywang88 jeffreywang88 requested a review from a team as a code owner June 18, 2026 23:37

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces pre-routing tokenization for KV-aware routing in Ray LLM. It adds a Tokenizer class to tokenize incoming chat or completion requests via the replica's /tokenize endpoint, passing the resulting token IDs to the replica selection mechanism. This pre-routing tokenization is conditionally enabled only when a KVAwareRouter is configured. The review feedback focuses on improving robustness: handling non-integer error codes and missing attributes in tokenization responses, catching unexpected exceptions during tokenization to gracefully fall back to token-less routing, and defensively verifying that the retrieved request router is a class before calling issubclass.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/ray/llm/_internal/serve/core/ingress/tokenizer.py
Comment thread python/ray/llm/_internal/serve/core/ingress/router.py Outdated
Comment thread python/ray/llm/_internal/serve/routing_policies/kv_aware/utils.py Outdated
Comment thread python/ray/llm/_internal/serve/core/ingress/tokenizer.py
@ray-gardener ray-gardener Bot added serve Ray Serve Related Issue llm labels Jun 19, 2026
@jeffreywang88 jeffreywang88 force-pushed the kv-aware-routing-event-plane branch from 108973f to e2e4cfa Compare June 22, 2026 19:00
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
…Actor

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
…GPU e2e)

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang88 jeffreywang88 force-pushed the kv-aware-routing-event-plane branch from e2e4cfa to b021b1c Compare June 22, 2026 19:43
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang88 jeffreywang88 force-pushed the pre-routing-tokenization branch 2 times, most recently from 7b7452c to 0672485 Compare June 22, 2026 22:56
@jeffreywang88 jeffreywang88 added the go add ONLY when ready to merge, run all tests label Jun 22, 2026
Comment thread python/ray/llm/_internal/serve/core/ingress/tokenizer.py Outdated
if not isinstance(payload, dict):
return None

if "messages" in payload:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bit brittle. For example, how do we support new api formats e.g. anthropic sdk, etc. Is there a better way. (worth pondering about)?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, but I'd rather cross that bridge when we get there. Probably need to introduce some schema that the payload is validated against and dispatch off that request type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted this gap in #64389.

Comment thread python/ray/llm/_internal/serve/core/ingress/tokenizer.py Outdated
# /tokenize yields a single response; drain the stream fully so the
# handle response is cleaned up.
resp = None
async for chunk in self._handle.options(stream=True).tokenize.remote(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use streaming? is that how servers api are written ? (they return generators? )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah our LLMServer(LLMServerProtocol) implements tokenize to return generators:

async def tokenize(
self,
request: "TokenizeRequest",
raw_request_info: Optional[RawRequestInfo] = None,
) -> AsyncGenerator[Union["TokenizeResponse", "ErrorResponse"], None]:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusting the return type would cause breaking changes and is out-of-scope of this PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted this gap in #64389.

Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang88 jeffreywang88 force-pushed the pre-routing-tokenization branch from 0672485 to e671dbc Compare June 26, 2026 08:18
Base automatically changed from kv-aware-routing-event-plane to master June 27, 2026 19:48
…ation

# Conflicts:
#	release/llm_tests/kv_router_test/test_kv_event_ingestion.py
#	release/llm_tests/kv_router_test/test_kv_events.py

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6b5f0d6. Configure here.

if not isinstance(payload["prompt"], str):
# TODO (jeffreywang): Multi-prompt (list) tokenization is unsupported;
# fall back to token-less routing.
return None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty messages skips prompt tokenize

Medium Severity

_build_tokenize_request chooses the chat tokenize path whenever the messages key exists, but _parse_routing_payload treats empty or null messages as no routing signal and still routes on a truthy prompt. Those bodies reach the tokenizer with a valid routing payload yet return None or chat tokens that do not match the completion prefill.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6b5f0d6. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests llm serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants