Skip to content

perf(rope): vectorize qwen3_vl_get_rope_index, drop per-token sync#184

Merged
kcz358 merged 1 commit into
mainfrom
perf/qwen3-vl-rope-index
Jun 9, 2026
Merged

perf(rope): vectorize qwen3_vl_get_rope_index, drop per-token sync#184
kcz358 merged 1 commit into
mainfrom
perf/qwen3-vl-rope-index

Conversation

@kcz358

@kcz358 kcz358 commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

qwen3_vl_get_rope_index is called once per training step on every multimodal model that uses Qwen3-VL-style mrope (qwen3_vl, qwen3_vl_moe, qwen3_5, qwen3_5_moe, plus aero_realtime indirectly). The previous implementation triggers O(B + N_vision) device syncs per call:

  • torch.repeat_interleave with a GPU repeats tensor (waits on its sum)
  • input_ids.tolist() per row
  • t.item() / h.item() / w.item() per image / per video frame
  • llm_pos_ids_list[-1].max() + 1 per span
  • per-token list.append of Python ints, then torch.tensor([...]) at the end

On large multimodal sequences these add up to tens of ms / step purely in host-side glue, and they show up in PyTorch traces as the largest single CPU op outside the LM forward.

This PR rewrites the hot path so that the entire function performs a single up-front device → host pull for input_ids / attention_mask / image_grid_thw / video_grid_thw, then builds each row's (3, n_valid) position tensor with numpy slice-assigns (no per-token Python list growth), and copies it back with one H2D per row. The trivial no-vision branch is unchanged.

Correctness

Verified element-wise equal to the original implementation across:

  • batch with mixed images + videos (B=2, S=4k, 8 img + 2 vid)
  • larger batch (B=4, S=8k, 16 img + 4 vid)
  • video-heavy (B=1, S=16k, 0 img + 8 vid with T=16)
  • no vision at all
  • attention_mask=None with vision
  • ragged sequences with padding

mrope_position_deltas also matches in all cases.

Bench

A synthetic bench script (scripts/bench_qwen3_vl_rope_index.py, gitignored) measures end-to-end function time on an RTX A6000:

Setup Original New Speedup
B=2, S=4k, 8 img + 2 vid 7.9 ms 1.3 ms 6.3x
B=4, S=8k, 16 img + 4 vid 30.0 ms 3.7 ms 7.9x
B=1, S=16k, 0 img + 8 vid (T=16) 42.4 ms 3.0 ms 13.7x

Video-heavy cases see the largest improvement because each video frame previously paid the per-span sync cost; the new path amortizes everything to a single CPU pass + numpy slice.

Scope

  • Only touches qwen3_vl_get_rope_index in src/lmms_engine/models/common_ops/rope.py.
  • Signature and return values unchanged — drop-in replacement for all callers.
  • qwen2_5_vl_rope_index (different layout, has second_per_grid_ts) is not touched in this PR; can be done separately if needed.

@kcz358 kcz358 merged commit e7cdc94 into main Jun 9, 2026
3 checks passed
@kcz358 kcz358 deleted the perf/qwen3-vl-rope-index branch June 9, 2026 06:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant