Skip to content

feat(vllm-model): consume native dynamo token data#1784

Draft
jthomson04 wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
jthomson04:codex/dynamo-native-token-transport
Draft

feat(vllm-model): consume native dynamo token data#1784
jthomson04 wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
jthomson04:codex/dynamo-native-token-transport

Conversation

@jthomson04

@jthomson04 jthomson04 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

  • consume Dynamo-native token IDs from response.nvext.engine_data when a chat-completion response includes them
  • attach prompt_token_ids, generation_token_ids, and generation_log_probs from the native response data without calling /tokenize
  • keep the existing /tokenize fallback only for normal vLLM-style responses that do not include engine_data
  • fail fast on malformed engine_data instead of silently falling back to /tokenize

The NeMo-RL Dynamo wrapper is responsible for requesting nvext.engine_data and supplying Dynamo nvext.token_data; this Gym change only consumes the response-side native token data.

Companion NeMo-RL PR: jthomson04/RL#9

Validation

  • uv run pytest responses_api_models/vllm_model/tests/test_app.py -q -k TokenIDInformation -> 3 passed, 66 deselected
  • uv run ruff check responses_api_models/vllm_model/app.py responses_api_models/vllm_model/tests/test_app.py

…nize

When per-message prompt_token_ids/generation_token_ids are attached to
assistant messages (training mode), populate the top-level
required_prefix_token_ids field on both the chat-completion request and
the separate tokenize request. Mirrors NeMoRLOpenAIChatRequestMixin
auto-derive in nemo-rl's custom vLLM serving (vllm_worker_async.py).

Without this, Dynamo - which has the splice machinery server-side but
no auto-derive - re-tokenizes the chat history each turn, breaking the
byte-level token-contiguity invariant on multi-turn rollouts. The fix
must apply to BOTH endpoints because the contiguity assert in
nemo_rl/environments/nemo_gym.py reads prompt_token_ids from the
tokenize response, not the chat response. Patching only chat fails at
the tokenize step.

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
@jthomson04 jthomson04 force-pushed the codex/dynamo-native-token-transport branch from 683aa26 to 0be56ef Compare June 26, 2026 20:08
@jthomson04 jthomson04 changed the title feat(vllm-model): support dynamo native token transport feat(vllm-model): consume native dynamo token data Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant