perf(rope): vectorize qwen3_vl_get_rope_index, drop per-token sync by kcz358 · Pull Request #184 · EvolvingLMMs-Lab/lmms-engine

kcz358 · 2026-06-09T06:09:10Z

Summary

qwen3_vl_get_rope_index is called once per training step on every multimodal model that uses Qwen3-VL-style mrope (qwen3_vl, qwen3_vl_moe, qwen3_5, qwen3_5_moe, plus aero_realtime indirectly). The previous implementation triggers O(B + N_vision) device syncs per call:

torch.repeat_interleave with a GPU repeats tensor (waits on its sum)
input_ids.tolist() per row
t.item() / h.item() / w.item() per image / per video frame
llm_pos_ids_list[-1].max() + 1 per span
per-token list.append of Python ints, then torch.tensor([...]) at the end

On large multimodal sequences these add up to tens of ms / step purely in host-side glue, and they show up in PyTorch traces as the largest single CPU op outside the LM forward.

This PR rewrites the hot path so that the entire function performs a single up-front device → host pull for input_ids / attention_mask / image_grid_thw / video_grid_thw, then builds each row's (3, n_valid) position tensor with numpy slice-assigns (no per-token Python list growth), and copies it back with one H2D per row. The trivial no-vision branch is unchanged.

Correctness

Verified element-wise equal to the original implementation across:

batch with mixed images + videos (B=2, S=4k, 8 img + 2 vid)
larger batch (B=4, S=8k, 16 img + 4 vid)
video-heavy (B=1, S=16k, 0 img + 8 vid with T=16)
no vision at all
attention_mask=None with vision
ragged sequences with padding

mrope_position_deltas also matches in all cases.

Bench

A synthetic bench script (scripts/bench_qwen3_vl_rope_index.py, gitignored) measures end-to-end function time on an RTX A6000:

Setup	Original	New	Speedup
B=2, S=4k, 8 img + 2 vid	7.9 ms	1.3 ms	6.3x
B=4, S=8k, 16 img + 4 vid	30.0 ms	3.7 ms	7.9x
B=1, S=16k, 0 img + 8 vid (T=16)	42.4 ms	3.0 ms	13.7x

Video-heavy cases see the largest improvement because each video frame previously paid the per-span sync cost; the new path amortizes everything to a single CPU pass + numpy slice.

Scope

Only touches qwen3_vl_get_rope_index in src/lmms_engine/models/common_ops/rope.py.
Signature and return values unchanged — drop-in replacement for all callers.
qwen2_5_vl_rope_index (different layout, has second_per_grid_ts) is not touched in this PR; can be done separately if needed.

perf(rope): vectorize qwen3_vl_get_rope_index, drop per-token sync

d49ca0c

kcz358 merged commit e7cdc94 into main Jun 9, 2026
3 checks passed

kcz358 deleted the perf/qwen3-vl-rope-index branch June 9, 2026 06:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(rope): vectorize qwen3_vl_get_rope_index, drop per-token sync#184

perf(rope): vectorize qwen3_vl_get_rope_index, drop per-token sync#184
kcz358 merged 1 commit into
mainfrom
perf/qwen3-vl-rope-index

kcz358 commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kcz358 commented Jun 9, 2026

Summary

Correctness

Bench

Scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant