Skip to content

[rollout] feat: Add Multimodal Continuous Token for AgentLoop#6804

Draft
gxlvera wants to merge 62 commits into
verl-project:mainfrom
gxlvera:gxl-ct-dev-mm
Draft

[rollout] feat: Add Multimodal Continuous Token for AgentLoop#6804
gxlvera wants to merge 62 commits into
verl-project:mainfrom
gxlvera:gxl-ct-dev-mm

Conversation

@gxlvera

@gxlvera gxlvera commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

This PR depends on #6779 (Continuous Token base infrastructure). It extends ContinuousTokenBuilder to VLContinuouTokenBuilder and integrate with ToolAgentLoop.
What's added is that it now enables processor's encoding of text with multimodal info. Multimodal info is not concatenated by turns, instead AgentLoopWorker will postprocess it by using processor on full messages to get all multimodal info )(e.g. pixel values).
What's unchanged is each VL model family can still inherent its model-specific behavior (especiall merge_token_id behavior) on text.

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

gxlvera added 30 commits June 12, 2026 16:20
Duckycoders and others added 12 commits June 20, 2026 22:57
- KimiVLContinuousTokenBuilder: inherits ContinuousTokenBuilder (direct
  concatenation, no newline insertion). Uses <|media_start|>/<|media_end|>
  vision tokens. Handles merge_kernel_size as list [2,2].
- GLM4VContinuousTokenBuilder: inherits GLMContinuousTokenBuilder
  (observation/user boundary removal). Uses <|begin_of_image|>/<|end_of_image|>
  wrappers. Same pixel_values/image_grid_thw format as Qwen2.5-VL.

Both verified on H100 with CT vs Legacy comparison (3/3 MATCH each).
Models tested: moonshotai/Kimi-VL-A3B-Instruct, zai-org/GLM-4.5V.
- Fix _slice_mm_delta: preserve mm_processor_kwargs in return dict (all 4 builders)
- Fix Kimi-VL: pass full pixel_values when image_grid_thw unavailable (no silent drop)
- Remove unused pad token IDs (_image_pad_id, _media_pad_id, _image_token_id)
- Remove unused model_type parameter and attribute from QwenVL/MiMoVL
- Update tests to match new __init__ signatures
…er-then-slice

Instead of re-rendering ALL images through the processor and slicing out
the delta, now renders ONLY new images incrementally. pixel_values are
context-independent per image (confirmed byte-identical on H100), so this
produces the same result with less compute.

Before: O(all_images) processor calls per merge with new images
After:  O(new_images) processor calls per merge

Full render is still used for token_ids (chat template needs full context).
CT vs Legacy verified 4/4 models x 3 scenarios = 12/12 MATCH.
Add GLM4V and KimiVL to model support tables and comparison results.
Updated total: 5 models, 12/12 scenarios MATCH.
When merge_tokens uses full render for token_ids, the rendered sequence
already contains correct boundary tokens (e.g. \n after <|im_end|> for Qwen).
Using _merge_token_ids would double-insert the boundary.

Fix: use direct concatenation in the VL image path since full render
produces the ground-truth token sequence. The text-only path still uses
_merge_token_ids correctly (it computes tokens incrementally without
full render, so boundary insertion is needed there).

Found by comparing with slime's approach which revealed the asymmetry.
Replaces 2 processor calls with 1 using dummy+trim pattern:
- Synthetic prefix + new messages → single processor call
- Trim synthetic prefix tokens → incremental token_ids + pixel_values
- Apply _merge_token_ids for boundary handling

Qwen/MiMo/Kimi: fully single-call (1 processor call per merge)
GLM4V: 2 calls (full render for token_ids due to boundary deletion,
incremental for pixel_values) — GLM boundary handler deletes tokens
from runtime, which breaks the dummy+trim assumption for token_ids.

Verified 4/4 models PASS on H100 with CT vs Legacy comparison.
Replaces 2 processor calls with 1 using dummy+trim pattern:
- Synthetic prefix + new messages → single processor call
- Trim synthetic prefix tokens → incremental token_ids + pixel_values
- Apply _merge_token_ids for boundary handling

Qwen/MiMo/Kimi: fully single-call (1 processor call per merge)
GLM4V: 2 calls (full render for token_ids due to boundary deletion,
incremental for pixel_values) — GLM boundary handler deletes tokens
from runtime, which breaks the dummy+trim assumption for token_ids.

Verified 4/4 models PASS on H100 with CT vs Legacy comparison.
Merge学姐's refactoring from gxl-ct-dev:
- Rename tokenize_incremental_messages -> tokenize_non_assistant_incremental_messages
- Rename _merge_token_ids -> _merge_non_assistant_token_ids
- Remove ct_align_response_metadata (moved elsewhere)
- Add _DUMMY_TOOL_NAME constant
- Restructure MergeResult docstring

Resolved conflicts in continuous_token.py and test files.
113 tests pass.
…gxl-ct-dev-mm

# Conflicts:
#	verl/utils/continuous_token.py
@CLAassistant

CLAassistant commented Jun 21, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
0 out of 2 committers have signed the CLA.

❌ Duckycoders
❌ gxlvera
You have signed the CLA already but the status is still pending? Let us recheck it.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request extends the Continuous Token (CT) framework to support vision-language (VL) models (such as Qwen2.5-VL, MiMo-VL, GLM-4V, and Kimi-VL) in multi-turn agentic rollouts. It introduces specialized multimodal token builders, integrates them into the agent loops to handle incremental image token expansion, and adds comprehensive unit and integration tests to verify token-level correctness against legacy re-encoding paths. The review feedback highlights opportunities to simplify the code by removing redundant hasattr checks for supports_multimodal in agent_loop.py, as this method is already defined on the base builder class.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

tokenizer_name_or_path=model_config.tokenizer_path,
)
builder_cls = get_continuous_token_builder_class(resolved_family)
needs_processor = hasattr(builder_cls, "supports_multimodal") and builder_cls.supports_multimodal()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The hasattr(builder_cls, "supports_multimodal") check is redundant because supports_multimodal is defined on the base class ContinuousTokenBuilder from which all registered builders inherit. This violates the repository's general rule against over-defensive programming.

Suggested change
needs_processor = hasattr(builder_cls, "supports_multimodal") and builder_cls.supports_multimodal()
needs_processor = builder_cls.supports_multimodal()
References
  1. Avoid over-defensive programming, such as wrapping inspect.signature in try-except blocks or using redundant getattr and None checks, unless there is a concrete, documented risk of runtime failure.

Comment on lines +376 to +380
if (
self.continuous_token_builder is not None
and hasattr(self.continuous_token_builder, "supports_multimodal")
and self.continuous_token_builder.supports_multimodal()
):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The hasattr(self.continuous_token_builder, "supports_multimodal") check is redundant because supports_multimodal is defined on the base class ContinuousTokenBuilder. This violates the repository's general rule against over-defensive programming.

            if (
                self.continuous_token_builder is not None
                and self.continuous_token_builder.supports_multimodal()
            ):
References
  1. Avoid over-defensive programming, such as wrapping inspect.signature in try-except blocks or using redundant getattr and None checks, unless there is a concrete, documented risk of runtime failure.

@gxlvera

gxlvera commented Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

The class design will be refactored soon, we will add a VLContinuousTokenBuilder which will be extends by all modal family. So the duplicated logics in each model-specific VLCTB will be deleted.

@gxlvera gxlvera changed the title [rollout] Add Multimodal Continuou Token for AgentLoop [rollout] feat: Add Multimodal Continuou Token for AgentLoop Jun 22, 2026
@gxlvera gxlvera changed the title [rollout] feat: Add Multimodal Continuou Token for AgentLoop [rollout] feat: Add Multimodal Continuous Token for AgentLoop Jun 22, 2026
Per design review: replace 4 copy-pasted VL builders with a shared mixin
that combines via Python MRO with the appropriate text-family builder.

  VLContinuousTokenMixin
    - shared: build_initial_tokens, merge_tokens, render_tokens_with_mm,
              _render_incremental_with_mm, _extract_images_from_messages,
              extract_vision_placeholders, count_vision_tokens,
              _resolve_spatial_merge_size, supports_multimodal
    - subclass attrs: vision_start_token, vision_end_token, merge_size_attr
    - subclass hook: _prepare_mm_messages (MiMo overrides for content flatten)

  QwenVLContinuousTokenBuilder(VLMixin, QwenContinuousTokenBuilder)
  MiMoVLContinuousTokenBuilder(VLMixin, QwenContinuousTokenBuilder) + flatten
  GLM4VContinuousTokenBuilder(VLMixin, GLMContinuousTokenBuilder)
  KimiVLContinuousTokenBuilder(VLMixin, ContinuousTokenBuilder)

MRO dispatches _merge_non_assistant_token_ids to the correct text-family
boundary handler (Qwen newline patch, GLM observation/user trim, or base
direct concat).

1527 -> 1028 lines (~500 lines of duplication removed).

Verified:
  - 113 unit tests pass
  - GPU CT vs Legacy: 4/4 models MATCH (Qwen2.5-VL, MiMo-VL, GLM-4.5V, Kimi-VL)
  - 6/6 corner cases PASS (tool+image, 3 images, system+MM, image-only,
    text-then-image, 3-turn alternating)
  - vLLM dedup compatibility: 4/4 PASS (Qwen-style dedup; others passthrough)
Direct concatenation boundary handling (no inter-turn separator).
Validates EOS token with correct Unicode (U+FF5C fullwidth vertical line
+ U+2581 lower one-eighth block). V3/R1-specific tokens (User, Assistant,
BOS) are optional lookups (tolerate absence on V2-Lite).

GPU verified: DeepSeek-V2-Lite-Chat CT vs Legacy 3/3 MATCH.
115 local tests pass.
DeepSeek-VL2 uses a custom DeepseekVLV2Processor that handles conversation
formatting + image token expansion in a single __call__ (no apply_chat_template).
The builder bypasses the standard VLContinuousTokenMixin render path and uses
the processor directly with full render + prefix diff.

Key design decisions:
- Does NOT inherit VLContinuousTokenMixin (VL2 has no paired vision markers)
- Inherits DeepSeekContinuousTokenBuilder for boundary handling (direct concat)
- Uses processor __call__ for all rendering (tokenizer has no chat_template)
- Always uses full render + prefix diff (processor has stable prefixes)
- extract_vision_placeholders: contiguous <image> token run detection
- count_vision_tokens formula: 211 + 196*m*n + 14*m (verified empirically)
- Requires monkey-patch to bypass transformers version incompatibility

GPU verified: deepseek-ai/deepseek-vl2-tiny CT vs Legacy 3/3 MATCH.
117 local tests pass.
"""
return False

def count_vision_tokens(self, image_grid_thw_row: tuple[int, int, int]) -> int:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this function called anywhere?

"Override supports_multimodal() and this method for VL models."
)

def extract_vision_placeholders(self, token_ids: Sequence[int]) -> list[tuple[int, int]]:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same, is this function called anywhere?

def supports_multimodal(cls) -> bool:
return True

def count_vision_tokens(self, image_grid_thw_row: tuple[int, int, int]) -> int:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it called anywhere?

def supports_multimodal(cls) -> bool:
return True

def count_vision_tokens(self, spatial_crop_row: tuple[int, int]) -> int:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

called anywhere?

merge = self._spatial_merge_size
return t * (h // merge) * (w // merge)

def extract_vision_placeholders(self, token_ids: Sequence[int]) -> list[tuple[int, int]]:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

called anywhere?

m, n = spatial_crop_row
return 211 + 196 * m * n + 14 * m

def extract_vision_placeholders(self, token_ids: Sequence[int]) -> list[tuple[int, int]]:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

called anywhere?

Comment thread verl/utils/continuous_token.py Outdated
return self._render_tokens(messages, add_generation_prompt=True, tools=tools)
return self.render_tokens_with_mm(messages, images, add_generation_prompt=True, tools=tools)

def merge_tokens(

@gxlvera gxlvera Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function should override merge_non_assistant_tokens, and then should be called by ToolAgentLoop

all_images = self._extract_images_from_messages(updated_messages)
full_token_ids = self._render_via_processor(updated_messages, all_images, add_generation_prompt=True)

prefix_len = len(runtime_token_ids)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This slices full_token_ids by length without verifying that runtime_token_ids is actually a prefix, so prefix drift from the processor would be silently accepted, which is dangerous

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants