[rollout] feat: Add Multimodal Continuous Token for AgentLoop by gxlvera · Pull Request #6804 · verl-project/verl

gxlvera · 2026-06-21T10:31:44Z

What does this PR do?

This PR depends on #6779 (Continuous Token base infrastructure). It extends ContinuousTokenBuilder to VLContinuouTokenBuilder and integrate with ToolAgentLoop.
What's added is that it now enables processor's encoding of text with multimodal info. Multimodal info is not concatenated by turns, instead AgentLoopWorker will postprocess it by using processor on full messages to get all multimodal info )(e.g. pixel values).
What's unchanged is each VL model family can still inherent its model-specific behavior (especiall merge_token_id behavior) on text.

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

- KimiVLContinuousTokenBuilder: inherits ContinuousTokenBuilder (direct concatenation, no newline insertion). Uses <|media_start|>/<|media_end|> vision tokens. Handles merge_kernel_size as list [2,2]. - GLM4VContinuousTokenBuilder: inherits GLMContinuousTokenBuilder (observation/user boundary removal). Uses <|begin_of_image|>/<|end_of_image|> wrappers. Same pixel_values/image_grid_thw format as Qwen2.5-VL. Both verified on H100 with CT vs Legacy comparison (3/3 MATCH each). Models tested: moonshotai/Kimi-VL-A3B-Instruct, zai-org/GLM-4.5V.

- Fix _slice_mm_delta: preserve mm_processor_kwargs in return dict (all 4 builders) - Fix Kimi-VL: pass full pixel_values when image_grid_thw unavailable (no silent drop) - Remove unused pad token IDs (_image_pad_id, _media_pad_id, _image_token_id) - Remove unused model_type parameter and attribute from QwenVL/MiMoVL - Update tests to match new __init__ signatures

…er-then-slice Instead of re-rendering ALL images through the processor and slicing out the delta, now renders ONLY new images incrementally. pixel_values are context-independent per image (confirmed byte-identical on H100), so this produces the same result with less compute. Before: O(all_images) processor calls per merge with new images After: O(new_images) processor calls per merge Full render is still used for token_ids (chat template needs full context). CT vs Legacy verified 4/4 models x 3 scenarios = 12/12 MATCH.

Add GLM4V and KimiVL to model support tables and comparison results. Updated total: 5 models, 12/12 scenarios MATCH.

When merge_tokens uses full render for token_ids, the rendered sequence already contains correct boundary tokens (e.g. \n after <|im_end|> for Qwen). Using _merge_token_ids would double-insert the boundary. Fix: use direct concatenation in the VL image path since full render produces the ground-truth token sequence. The text-only path still uses _merge_token_ids correctly (it computes tokens incrementally without full render, so boundary insertion is needed there). Found by comparing with slime's approach which revealed the asymmetry.

Replaces 2 processor calls with 1 using dummy+trim pattern: - Synthetic prefix + new messages → single processor call - Trim synthetic prefix tokens → incremental token_ids + pixel_values - Apply _merge_token_ids for boundary handling Qwen/MiMo/Kimi: fully single-call (1 processor call per merge) GLM4V: 2 calls (full render for token_ids due to boundary deletion, incremental for pixel_values) — GLM boundary handler deletes tokens from runtime, which breaks the dummy+trim assumption for token_ids. Verified 4/4 models PASS on H100 with CT vs Legacy comparison.

Merge学姐's refactoring from gxl-ct-dev: - Rename tokenize_incremental_messages -> tokenize_non_assistant_incremental_messages - Rename _merge_token_ids -> _merge_non_assistant_token_ids - Remove ct_align_response_metadata (moved elsewhere) - Add _DUMMY_TOOL_NAME constant - Restructure MergeResult docstring Resolved conflicts in continuous_token.py and test files. 113 tests pass.

…gxl-ct-dev-mm # Conflicts: # verl/utils/continuous_token.py

CLAassistant · 2026-06-21T10:31:53Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ Duckycoders
❌ gxlvera
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

This pull request extends the Continuous Token (CT) framework to support vision-language (VL) models (such as Qwen2.5-VL, MiMo-VL, GLM-4V, and Kimi-VL) in multi-turn agentic rollouts. It introduces specialized multimodal token builders, integrates them into the agent loops to handle incremental image token expansion, and adds comprehensive unit and integration tests to verify token-level correctness against legacy re-encoding paths. The review feedback highlights opportunities to simplify the code by removing redundant hasattr checks for supports_multimodal in agent_loop.py, as this method is already defined on the base builder class.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-21T10:33:32Z

+                tokenizer_name_or_path=model_config.tokenizer_path,
+            )
+            builder_cls = get_continuous_token_builder_class(resolved_family)
+            needs_processor = hasattr(builder_cls, "supports_multimodal") and builder_cls.supports_multimodal()


The hasattr(builder_cls, "supports_multimodal") check is redundant because supports_multimodal is defined on the base class ContinuousTokenBuilder from which all registered builders inherit. This violates the repository's general rule against over-defensive programming.

Suggested change

needs_processor = hasattr(builder_cls, "supports_multimodal") and builder_cls.supports_multimodal()

needs_processor = builder_cls.supports_multimodal()

References

Avoid over-defensive programming, such as wrapping inspect.signature in try-except blocks or using redundant getattr and None checks, unless there is a concrete, documented risk of runtime failure.

gemini-code-assist · 2026-06-21T10:33:32Z

+            if (
+                self.continuous_token_builder is not None
+                and hasattr(self.continuous_token_builder, "supports_multimodal")
+                and self.continuous_token_builder.supports_multimodal()
+            ):


The hasattr(self.continuous_token_builder, "supports_multimodal") check is redundant because supports_multimodal is defined on the base class ContinuousTokenBuilder. This violates the repository's general rule against over-defensive programming.

if ( self.continuous_token_builder is not None and self.continuous_token_builder.supports_multimodal() ):

References

Avoid over-defensive programming, such as wrapping inspect.signature in try-except blocks or using redundant getattr and None checks, unless there is a concrete, documented risk of runtime failure.

gxlvera · 2026-06-21T10:34:22Z

The class design will be refactored soon, we will add a VLContinuousTokenBuilder which will be extends by all modal family. So the duplicated logics in each model-specific VLCTB will be deleted.

Per design review: replace 4 copy-pasted VL builders with a shared mixin that combines via Python MRO with the appropriate text-family builder. VLContinuousTokenMixin - shared: build_initial_tokens, merge_tokens, render_tokens_with_mm, _render_incremental_with_mm, _extract_images_from_messages, extract_vision_placeholders, count_vision_tokens, _resolve_spatial_merge_size, supports_multimodal - subclass attrs: vision_start_token, vision_end_token, merge_size_attr - subclass hook: _prepare_mm_messages (MiMo overrides for content flatten) QwenVLContinuousTokenBuilder(VLMixin, QwenContinuousTokenBuilder) MiMoVLContinuousTokenBuilder(VLMixin, QwenContinuousTokenBuilder) + flatten GLM4VContinuousTokenBuilder(VLMixin, GLMContinuousTokenBuilder) KimiVLContinuousTokenBuilder(VLMixin, ContinuousTokenBuilder) MRO dispatches _merge_non_assistant_token_ids to the correct text-family boundary handler (Qwen newline patch, GLM observation/user trim, or base direct concat). 1527 -> 1028 lines (~500 lines of duplication removed). Verified: - 113 unit tests pass - GPU CT vs Legacy: 4/4 models MATCH (Qwen2.5-VL, MiMo-VL, GLM-4.5V, Kimi-VL) - 6/6 corner cases PASS (tool+image, 3 images, system+MM, image-only, text-then-image, 3-turn alternating) - vLLM dedup compatibility: 4/4 PASS (Qwen-style dedup; others passthrough)

Direct concatenation boundary handling (no inter-turn separator). Validates EOS token with correct Unicode (U+FF5C fullwidth vertical line + U+2581 lower one-eighth block). V3/R1-specific tokens (User, Assistant, BOS) are optional lookups (tolerate absence on V2-Lite). GPU verified: DeepSeek-V2-Lite-Chat CT vs Legacy 3/3 MATCH. 115 local tests pass.

DeepSeek-VL2 uses a custom DeepseekVLV2Processor that handles conversation formatting + image token expansion in a single __call__ (no apply_chat_template). The builder bypasses the standard VLContinuousTokenMixin render path and uses the processor directly with full render + prefix diff. Key design decisions: - Does NOT inherit VLContinuousTokenMixin (VL2 has no paired vision markers) - Inherits DeepSeekContinuousTokenBuilder for boundary handling (direct concat) - Uses processor __call__ for all rendering (tokenizer has no chat_template) - Always uses full render + prefix diff (processor has stable prefixes) - extract_vision_placeholders: contiguous <image> token run detection - count_vision_tokens formula: 211 + 196*m*n + 14*m (verified empirically) - Requires monkey-patch to bypass transformers version incompatibility GPU verified: deepseek-ai/deepseek-vl2-tiny CT vs Legacy 3/3 MATCH. 117 local tests pass.

gxlvera · 2026-06-24T17:39:53Z

+        """
+        return False
+
+    def count_vision_tokens(self, image_grid_thw_row: tuple[int, int, int]) -> int:


is this function called anywhere?

gxlvera · 2026-06-24T17:40:09Z

+            "Override supports_multimodal() and this method for VL models."
+        )
+
+    def extract_vision_placeholders(self, token_ids: Sequence[int]) -> list[tuple[int, int]]:


same, is this function called anywhere?

gxlvera · 2026-06-24T18:07:25Z

+    def supports_multimodal(cls) -> bool:
+        return True
+
+    def count_vision_tokens(self, image_grid_thw_row: tuple[int, int, int]) -> int:


is it called anywhere?

gxlvera · 2026-06-24T18:08:13Z

+    def supports_multimodal(cls) -> bool:
+        return True
+
+    def count_vision_tokens(self, spatial_crop_row: tuple[int, int]) -> int:


called anywhere?

gxlvera · 2026-06-24T18:08:52Z

+        merge = self._spatial_merge_size
+        return t * (h // merge) * (w // merge)
+
+    def extract_vision_placeholders(self, token_ids: Sequence[int]) -> list[tuple[int, int]]:


called anywhere?

gxlvera · 2026-06-24T18:09:10Z

+        m, n = spatial_crop_row
+        return 211 + 196 * m * n + 14 * m
+
+    def extract_vision_placeholders(self, token_ids: Sequence[int]) -> list[tuple[int, int]]:


called anywhere?

gxlvera · 2026-06-24T18:09:26Z

+            return self._render_tokens(messages, add_generation_prompt=True, tools=tools)
+        return self.render_tokens_with_mm(messages, images, add_generation_prompt=True, tools=tools)
+
+    def merge_tokens(


this function should override merge_non_assistant_tokens, and then should be called by ToolAgentLoop

gxlvera · 2026-06-24T18:23:48Z

+        all_images = self._extract_images_from_messages(updated_messages)
+        full_token_ids = self._render_via_processor(updated_messages, all_images, add_generation_prompt=True)
+
+        prefix_len = len(runtime_token_ids)


This slices full_token_ids by length without verifying that runtime_token_ids is actually a prefix, so prefix drift from the processor would be silently accepted, which is dangerous

gxlvera added 30 commits June 12, 2026 16:20

[ct] Add continuous token builders

9db6d26

[ct] Wire continuous token builder configuration

d2a9ba7

[ct] Warn when defaulting continuous token builder

902ef76

[ct] Rename updated continuous token messages

51f0721

[ct] Simplify continuous token builder wiring

8a90193

[ct] Add GPT-OSS and Gemma continuous token builders

8131d58

[ct] Use enum for continuous token model families

27cbcd7

[ct] Apply pre-commit updates

738e811

[ct] Integrate continuous token into agent loops

04e7ae3

[ct] gate agentloop with enable_continuous_token flag

dd21fd6

[ct] Prefix continuous token helper names

24e1f02

[ct] Rename initial continuous token helper

24ac1a4

[ct] Require response logprobs for assistant alignment

307a7b0

[ct] Carry tool call ids in parsed calls

acf5af8

[ct] Rename updated agent loop messages

12b557a

[ct] Add model-specific tool response builders

c72c41b

[ct] Restore legacy tool response name handling

f71b9fb

[ct] Resolve tool names from assistant calls

74d8042

[ct] Use simplified continuous token builder factory

c07b753

[ct] Apply agent loop pre-commit updates

998c515

[ct] Document continuous token comparison results

2abcf19

[ct] Add continuous token mock trajectories

d9dfbf6

[ct] Add continuous token comparison harness

ae3f4e7

[ct] Apply comparison test pre-commit updates

85e42e2

[ct] Add chat template checker

199dc3c

[ct] Apply chat template checker pre-commit updates

6d8b93d

[ct] Add response metadata alignment helper

918ee16

[ct] Update continuous token CPU tests

940b81a

[ct] Add continuous token CI coverage

edfd096

[ct] Keep agent loop comparison legacy-only

a1ea9c4

Duckycoders and others added 12 commits June 20, 2026 22:57

docs(ct): update RFC and comparison results with GLM-4.5V and Kimi-VL

0b5577c

Add GLM4V and KimiVL to model support tables and comparison results. Updated total: 5 models, 12/12 scenarios MATCH.

[ct] Refine continuous token metadata and tool alignment

d213889

Merge branch 'gxl-ct-dev-mm' of https://github.com/gxlvera/verl into …

9777dc3

…gxl-ct-dev-mm # Conflicts: # verl/utils/continuous_token.py

[ct] Keep merge result token-only for VL

cb107ed

[ct] Validate VL synthetic prefix trimming

dd31a3d

gemini-code-assist Bot reviewed Jun 21, 2026

View reviewed changes

gxlvera changed the title ~~[rollout] Add Multimodal Continuou Token for AgentLoop~~ [rollout] feat: Add Multimodal Continuou Token for AgentLoop Jun 22, 2026

gxlvera changed the title ~~[rollout] feat: Add Multimodal Continuou Token for AgentLoop~~ [rollout] feat: Add Multimodal Continuous Token for AgentLoop Jun 22, 2026

Duckycoders added 3 commits June 22, 2026 21:32

gxlvera commented Jun 24, 2026

View reviewed changes

gxlvera added 2 commits June 24, 2026 12:10

fix(ct): route VL merges through non-assistant entrypoint

07b9777

fix(agent-loop): include tool images in VL messages

9cc514f

	needs_processor = hasattr(builder_cls, "supports_multimodal") and builder_cls.supports_multimodal()
	needs_processor = builder_cls.supports_multimodal()

Uh oh!

Conversation

gxlvera commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test

API and Usage Example

Design & Code Changes

Uh oh!

CLAassistant commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

gxlvera commented Jun 21, 2026

Uh oh!

gxlvera Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gxlvera Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gxlvera Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gxlvera Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gxlvera Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gxlvera Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

gxlvera Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gxlvera Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gxlvera commented Jun 21, 2026 •

edited

Loading

CLAassistant commented Jun 21, 2026 •

edited

Loading

gxlvera Jun 24, 2026 •

edited

Loading