[rollout] feat: Add Multimodal Continuous Token for AgentLoop#6804
[rollout] feat: Add Multimodal Continuous Token for AgentLoop#6804gxlvera wants to merge 62 commits into
Conversation
- KimiVLContinuousTokenBuilder: inherits ContinuousTokenBuilder (direct concatenation, no newline insertion). Uses <|media_start|>/<|media_end|> vision tokens. Handles merge_kernel_size as list [2,2]. - GLM4VContinuousTokenBuilder: inherits GLMContinuousTokenBuilder (observation/user boundary removal). Uses <|begin_of_image|>/<|end_of_image|> wrappers. Same pixel_values/image_grid_thw format as Qwen2.5-VL. Both verified on H100 with CT vs Legacy comparison (3/3 MATCH each). Models tested: moonshotai/Kimi-VL-A3B-Instruct, zai-org/GLM-4.5V.
- Fix _slice_mm_delta: preserve mm_processor_kwargs in return dict (all 4 builders) - Fix Kimi-VL: pass full pixel_values when image_grid_thw unavailable (no silent drop) - Remove unused pad token IDs (_image_pad_id, _media_pad_id, _image_token_id) - Remove unused model_type parameter and attribute from QwenVL/MiMoVL - Update tests to match new __init__ signatures
…er-then-slice Instead of re-rendering ALL images through the processor and slicing out the delta, now renders ONLY new images incrementally. pixel_values are context-independent per image (confirmed byte-identical on H100), so this produces the same result with less compute. Before: O(all_images) processor calls per merge with new images After: O(new_images) processor calls per merge Full render is still used for token_ids (chat template needs full context). CT vs Legacy verified 4/4 models x 3 scenarios = 12/12 MATCH.
Add GLM4V and KimiVL to model support tables and comparison results. Updated total: 5 models, 12/12 scenarios MATCH.
When merge_tokens uses full render for token_ids, the rendered sequence already contains correct boundary tokens (e.g. \n after <|im_end|> for Qwen). Using _merge_token_ids would double-insert the boundary. Fix: use direct concatenation in the VL image path since full render produces the ground-truth token sequence. The text-only path still uses _merge_token_ids correctly (it computes tokens incrementally without full render, so boundary insertion is needed there). Found by comparing with slime's approach which revealed the asymmetry.
Replaces 2 processor calls with 1 using dummy+trim pattern: - Synthetic prefix + new messages → single processor call - Trim synthetic prefix tokens → incremental token_ids + pixel_values - Apply _merge_token_ids for boundary handling Qwen/MiMo/Kimi: fully single-call (1 processor call per merge) GLM4V: 2 calls (full render for token_ids due to boundary deletion, incremental for pixel_values) — GLM boundary handler deletes tokens from runtime, which breaks the dummy+trim assumption for token_ids. Verified 4/4 models PASS on H100 with CT vs Legacy comparison.
Replaces 2 processor calls with 1 using dummy+trim pattern: - Synthetic prefix + new messages → single processor call - Trim synthetic prefix tokens → incremental token_ids + pixel_values - Apply _merge_token_ids for boundary handling Qwen/MiMo/Kimi: fully single-call (1 processor call per merge) GLM4V: 2 calls (full render for token_ids due to boundary deletion, incremental for pixel_values) — GLM boundary handler deletes tokens from runtime, which breaks the dummy+trim assumption for token_ids. Verified 4/4 models PASS on H100 with CT vs Legacy comparison.
Merge学姐's refactoring from gxl-ct-dev: - Rename tokenize_incremental_messages -> tokenize_non_assistant_incremental_messages - Rename _merge_token_ids -> _merge_non_assistant_token_ids - Remove ct_align_response_metadata (moved elsewhere) - Add _DUMMY_TOOL_NAME constant - Restructure MergeResult docstring Resolved conflicts in continuous_token.py and test files. 113 tests pass.
…gxl-ct-dev-mm # Conflicts: # verl/utils/continuous_token.py
|
|
There was a problem hiding this comment.
Code Review
This pull request extends the Continuous Token (CT) framework to support vision-language (VL) models (such as Qwen2.5-VL, MiMo-VL, GLM-4V, and Kimi-VL) in multi-turn agentic rollouts. It introduces specialized multimodal token builders, integrates them into the agent loops to handle incremental image token expansion, and adds comprehensive unit and integration tests to verify token-level correctness against legacy re-encoding paths. The review feedback highlights opportunities to simplify the code by removing redundant hasattr checks for supports_multimodal in agent_loop.py, as this method is already defined on the base builder class.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| tokenizer_name_or_path=model_config.tokenizer_path, | ||
| ) | ||
| builder_cls = get_continuous_token_builder_class(resolved_family) | ||
| needs_processor = hasattr(builder_cls, "supports_multimodal") and builder_cls.supports_multimodal() |
There was a problem hiding this comment.
The hasattr(builder_cls, "supports_multimodal") check is redundant because supports_multimodal is defined on the base class ContinuousTokenBuilder from which all registered builders inherit. This violates the repository's general rule against over-defensive programming.
| needs_processor = hasattr(builder_cls, "supports_multimodal") and builder_cls.supports_multimodal() | |
| needs_processor = builder_cls.supports_multimodal() |
References
- Avoid over-defensive programming, such as wrapping inspect.signature in try-except blocks or using redundant getattr and None checks, unless there is a concrete, documented risk of runtime failure.
| if ( | ||
| self.continuous_token_builder is not None | ||
| and hasattr(self.continuous_token_builder, "supports_multimodal") | ||
| and self.continuous_token_builder.supports_multimodal() | ||
| ): |
There was a problem hiding this comment.
The hasattr(self.continuous_token_builder, "supports_multimodal") check is redundant because supports_multimodal is defined on the base class ContinuousTokenBuilder. This violates the repository's general rule against over-defensive programming.
if (
self.continuous_token_builder is not None
and self.continuous_token_builder.supports_multimodal()
):References
- Avoid over-defensive programming, such as wrapping inspect.signature in try-except blocks or using redundant getattr and None checks, unless there is a concrete, documented risk of runtime failure.
|
The class design will be refactored soon, we will add a VLContinuousTokenBuilder which will be extends by all modal family. So the duplicated logics in each model-specific VLCTB will be deleted. |
Per design review: replace 4 copy-pasted VL builders with a shared mixin
that combines via Python MRO with the appropriate text-family builder.
VLContinuousTokenMixin
- shared: build_initial_tokens, merge_tokens, render_tokens_with_mm,
_render_incremental_with_mm, _extract_images_from_messages,
extract_vision_placeholders, count_vision_tokens,
_resolve_spatial_merge_size, supports_multimodal
- subclass attrs: vision_start_token, vision_end_token, merge_size_attr
- subclass hook: _prepare_mm_messages (MiMo overrides for content flatten)
QwenVLContinuousTokenBuilder(VLMixin, QwenContinuousTokenBuilder)
MiMoVLContinuousTokenBuilder(VLMixin, QwenContinuousTokenBuilder) + flatten
GLM4VContinuousTokenBuilder(VLMixin, GLMContinuousTokenBuilder)
KimiVLContinuousTokenBuilder(VLMixin, ContinuousTokenBuilder)
MRO dispatches _merge_non_assistant_token_ids to the correct text-family
boundary handler (Qwen newline patch, GLM observation/user trim, or base
direct concat).
1527 -> 1028 lines (~500 lines of duplication removed).
Verified:
- 113 unit tests pass
- GPU CT vs Legacy: 4/4 models MATCH (Qwen2.5-VL, MiMo-VL, GLM-4.5V, Kimi-VL)
- 6/6 corner cases PASS (tool+image, 3 images, system+MM, image-only,
text-then-image, 3-turn alternating)
- vLLM dedup compatibility: 4/4 PASS (Qwen-style dedup; others passthrough)
Direct concatenation boundary handling (no inter-turn separator). Validates EOS token with correct Unicode (U+FF5C fullwidth vertical line + U+2581 lower one-eighth block). V3/R1-specific tokens (User, Assistant, BOS) are optional lookups (tolerate absence on V2-Lite). GPU verified: DeepSeek-V2-Lite-Chat CT vs Legacy 3/3 MATCH. 115 local tests pass.
DeepSeek-VL2 uses a custom DeepseekVLV2Processor that handles conversation formatting + image token expansion in a single __call__ (no apply_chat_template). The builder bypasses the standard VLContinuousTokenMixin render path and uses the processor directly with full render + prefix diff. Key design decisions: - Does NOT inherit VLContinuousTokenMixin (VL2 has no paired vision markers) - Inherits DeepSeekContinuousTokenBuilder for boundary handling (direct concat) - Uses processor __call__ for all rendering (tokenizer has no chat_template) - Always uses full render + prefix diff (processor has stable prefixes) - extract_vision_placeholders: contiguous <image> token run detection - count_vision_tokens formula: 211 + 196*m*n + 14*m (verified empirically) - Requires monkey-patch to bypass transformers version incompatibility GPU verified: deepseek-ai/deepseek-vl2-tiny CT vs Legacy 3/3 MATCH. 117 local tests pass.
| """ | ||
| return False | ||
|
|
||
| def count_vision_tokens(self, image_grid_thw_row: tuple[int, int, int]) -> int: |
There was a problem hiding this comment.
is this function called anywhere?
| "Override supports_multimodal() and this method for VL models." | ||
| ) | ||
|
|
||
| def extract_vision_placeholders(self, token_ids: Sequence[int]) -> list[tuple[int, int]]: |
There was a problem hiding this comment.
same, is this function called anywhere?
| def supports_multimodal(cls) -> bool: | ||
| return True | ||
|
|
||
| def count_vision_tokens(self, image_grid_thw_row: tuple[int, int, int]) -> int: |
There was a problem hiding this comment.
is it called anywhere?
| def supports_multimodal(cls) -> bool: | ||
| return True | ||
|
|
||
| def count_vision_tokens(self, spatial_crop_row: tuple[int, int]) -> int: |
| merge = self._spatial_merge_size | ||
| return t * (h // merge) * (w // merge) | ||
|
|
||
| def extract_vision_placeholders(self, token_ids: Sequence[int]) -> list[tuple[int, int]]: |
| m, n = spatial_crop_row | ||
| return 211 + 196 * m * n + 14 * m | ||
|
|
||
| def extract_vision_placeholders(self, token_ids: Sequence[int]) -> list[tuple[int, int]]: |
| return self._render_tokens(messages, add_generation_prompt=True, tools=tools) | ||
| return self.render_tokens_with_mm(messages, images, add_generation_prompt=True, tools=tools) | ||
|
|
||
| def merge_tokens( |
There was a problem hiding this comment.
this function should override merge_non_assistant_tokens, and then should be called by ToolAgentLoop
| all_images = self._extract_images_from_messages(updated_messages) | ||
| full_token_ids = self._render_via_processor(updated_messages, all_images, add_generation_prompt=True) | ||
|
|
||
| prefix_len = len(runtime_token_ids) |
There was a problem hiding this comment.
This slices full_token_ids by length without verifying that runtime_token_ids is actually a prefix, so prefix drift from the processor would be silently accepted, which is dangerous
What does this PR do?
This PR depends on #6779 (Continuous Token base infrastructure). It extends ContinuousTokenBuilder to VLContinuouTokenBuilder and integrate with ToolAgentLoop.
What's added is that it now enables processor's encoding of text with multimodal info. Multimodal info is not concatenated by turns, instead AgentLoopWorker will postprocess it by using processor on full messages to get all multimodal info )(e.g. pixel values).
What's unchanged is each VL model family can still inherent its model-specific behavior (especiall merge_token_id behavior) on text.
Test
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes