perf(rope): vectorize qwen3_vl_get_rope_index, drop per-token sync#184
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
qwen3_vl_get_rope_indexis called once per training step on every multimodal model that uses Qwen3-VL-style mrope (qwen3_vl,qwen3_vl_moe,qwen3_5,qwen3_5_moe, plusaero_realtimeindirectly). The previous implementation triggers O(B + N_vision) device syncs per call:torch.repeat_interleavewith a GPU repeats tensor (waits on its sum)input_ids.tolist()per rowt.item() / h.item() / w.item()per image / per video framellm_pos_ids_list[-1].max() + 1per spanlist.appendof Python ints, thentorch.tensor([...])at the endOn large multimodal sequences these add up to tens of ms / step purely in host-side glue, and they show up in PyTorch traces as the largest single CPU op outside the LM forward.
This PR rewrites the hot path so that the entire function performs a single up-front device → host pull for
input_ids/attention_mask/image_grid_thw/video_grid_thw, then builds each row's(3, n_valid)position tensor with numpy slice-assigns (no per-token Python list growth), and copies it back with one H2D per row. The trivial no-vision branch is unchanged.Correctness
Verified element-wise equal to the original implementation across:
attention_mask=Nonewith visionmrope_position_deltasalso matches in all cases.Bench
A synthetic bench script (
scripts/bench_qwen3_vl_rope_index.py, gitignored) measures end-to-end function time on an RTX A6000:Video-heavy cases see the largest improvement because each video frame previously paid the per-span sync cost; the new path amortizes everything to a single CPU pass + numpy slice.
Scope
qwen3_vl_get_rope_indexinsrc/lmms_engine/models/common_ops/rope.py.qwen2_5_vl_rope_index(different layout, hassecond_per_grid_ts) is not touched in this PR; can be done separately if needed.