fix(data): prompt-length filtering crashes on VLM dataset with apply_chat_template#2126
Open
Meihan-chen wants to merge 2 commits into
Open
fix(data): prompt-length filtering crashes on VLM dataset with apply_chat_template#2126Meihan-chen wants to merge 2 commits into
Meihan-chen wants to merge 2 commits into
Conversation
filter_long_prompt re-extracted vision info from sample.prompt via process_vision_info in the multimodal branch. When apply_chat_template is set, sample.prompt is the rendered *string* (not a conversation list), so process_vision_info -> qwen_vl_utils crashed with "TypeError: string indices must be integers, not 'str'". This made prompt-length filtering unusable for any VLM dataset: setting --rollout-max-context-len (which derives rollout_max_prompt_len) or --rollout-max-prompt-len / --eval-max-prompt-len activates the filter and hits the crash. Reuse the multimodal inputs already computed during dataset construction via build_processor_kwargs (matching the sglang_rollout path) instead of recomputing them from the string prompt. Add CPU unit tests covering the multimodal branch and a mixed text-only + multimodal dataset. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Meihan-chen <zr010426ztt@outlook.com>
7447cd0 to
a4560f6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Repro
Prompt-length filtering crashes for multimodal (VLM) datasets when
--apply-chat-templateis set, making the feature unusable. Reproduced end-to-end on the geo3k VLM example with--apply-chat-template+--rollout-max-prompt-len:Root cause
In
filter_long_prompt(slime/utils/data.py), the multimodal branch re-derived vision info fromsample.prompt:With
apply_chat_template=True,Sample.promptis the rendered string, butfilter_long_promptpassed it toprocess_vision_info, which expects a conversation list → crash. The vision inputs are already computed and stored inSample.multimodal_inputs, so this recomputation is both wrong and redundant.Why it's easy to hit
Setting
--rollout-max-context-len(which derivesrollout_max_prompt_len),--rollout-max-prompt-len, or--eval-max-prompt-lenactivates the filter on a VLM dataset and trips the crash.Fix
Reuse the multimodal inputs already stored on the sample, routed through the same
build_processor_kwargshelper the rollout path (sglang_rollout) uses, so the token length measured during filtering matches the real pipeline: