Qwen3 5 mtp final 2 by wanfengcxz · Pull Request #40 · DeepLink-org/lmdeploy

wanfengcxz · 2026-04-21T07:50:50Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

Fix 5 instances of mutable default arguments (=[] and ={}) in function signatures across 3 files. This is a latent Python bug where shared mutable state can leak across function calls. - lmdeploy/turbomind/deploy/module.py: apply_gs=[] → apply_gs=None (2 places) - lmdeploy/turbomind/deploy/config.py: config: dict = {} → None - lmdeploy/lite/quantization/calibration.py: kwargs={} → None (2 places)

* optimize prefill waiting time * fix comment * check prefill_interval

InternLM#4546) * Add docker/Dockerfile_patch; minor tweaks in messages.py and setup.py. * add ProcessContextFilter in logger * fix reviewer comment * checkin .dockerignore * fix

* fix qwen35 moe dp * fix qwen35 dp * fix comment

* fix mtp experts * fix * fix set_step for ar spec when evict * fix evit and reprefill with bad token cache * fix mtp second step inputs * refactor ar spec seq and resp when canneled * add ut for spe seq * fix lint * resolve comment

* cancel request and block new inputs when sleep * fix

… parser (InternLM#4548) * add glm47 tool call parser * fix * add glm47 tool call parser * fix * fix comment

@staticmethod

* WIP: support mixed modality * fix mm processor kwargs, cleanup * qwen3.5 mixed modality * interns1 pro mixed modality, fix kwargs * fix generate, cleanup * minor * simplify * fix glm4.1v * compatible with legacy preprocess, give up re-writing all ... * fix bugs * minor * minor * minor * fix ut * fix qwen3vl moe * allow modality-specific kwargs, add ut * docs: add multi-modal input format reference (EN + ZH) Add multimodal_inputs.md covering all modalities (text, image, video, audio, time series, mixed) with OpenAI-style examples, local file / base64 usage via lmdeploy.vl.utils helpers, and mm_processor_kwargs / media_io_kwargs guidance. Link from vl_pipeline.md and index.rst. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: update video/audio URLs to official Qwen assets Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: fix model name Qwen3.5-VL -> Qwen3.5 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address PR InternLM#4531 review comments - glm4_1v: guard chat_template_kwargs against None before ** expansion - base: use local time_series_processor to avoid mutating self.processor - base: fix preprocess return type annotation list[dict] -> dict[str, Any] - base: lower valid size-override log from WARNING to INFO Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: rename interns1_pro_ts.py to interns1_pro_time_series.py * docs: remove audio sections (not yet supported) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: extract preprocess helpers from VisionModel into preprocess_utils.py Move get_mm_items_offset, get_override_size, get_expanded_input_ids, and get_expanded_mm_items out of VisionModel into a standalone module. Functions now receive explicit params (processor, mm_tokens) instead of relying on self, making them unit-testable without a full VisionModel instance. Also replace inline signature-detection logic with _is_new_preprocess_api() helper in multimodal.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: move MultimodalSpecialTokens to constants.py, promote API detector to staticmethod - Move MultimodalSpecialTokens from vl/model/base.py to vl/constants.py alongside Modality; fixes circular import and enables type annotations on mm_tokens params in preprocess_utils.py - Promote _is_new_preprocess_api to MultimodalProcessor.@staticmethod, encapsulating the vl_encoder None guard inside the method Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * minor * minor * bunch of fix * update glm4.1v * simplify map dict * Fix Qwen3VL tests for input prompt API * update * minor --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

* optimize get_sorted_idx in moe * add assert

… inference on Blackwell GPUs with memory copy optimizations (InternLM#4490) * feat(turbomind): integrate cublasGemmGroupedBatchedEx for Qwen3.5 MoE on Blackwell Use grouped batched GEMM on SM100, SM90 CUTLASS kernels split into a separate STATIC library for arch-specific builds, copy path workaround for Blackwell, and Llama MoE weight layout adjustments. Move tma.cu into libgemm2_sm90.a (its only callers are SM90 kernels), fixing undefined symbol make_2d_tma_desc from single-pass static link order between two archives. * fix: resolve undefined symbol and MoE dispatch crash CMakeLists.txt: Move tma.cu from gemm2 into GEMM2_KERNELS_SM90, so make_2d_tma_desc resides in the same archive (libgemm2_sm90.a) as its SM90 CUTLASS callers. This fixes the undefined symbol error caused by single-pass static-link ordering between libgemm2.a and libgemm2_sm90.a. LlamaLinear.cu: Guard invokeMoeDispatchScales with `if (U)`. The is_cublas_grouped path (SM100 bf16 MoE) enters the dispatch block without quantization, leaving the scales tensor U empty. Calling invokeMoeDispatchScales on an empty tensor crashes with std::out_of_range on B200. * fix: pass Adesc.ld/Ddesc.ld as ldb/ldc for cublas grouped batched GEMM --------- Co-authored-by: da.huo <da.huo@shopee.com>

* fix mp engine * fix name * fix ut * improve cancel * filter cancel in mp * clear prev chunk info * update * resolve comment

* remove barely used skills and checkin docker-build skill * remove resolve-review and submit-pr * fix * fix according to reviewer comment

…identity (InternLM#4523) * tell user-input session_id from the inner session_id * fix * log user's session_id * remove unnecessary log

* fix num_gpu_blocks for spec decoding * update cache engine * update config and message * fix ut * fix * fix

* support more message item types * make copilot happy

* fix draft tp by change dist ctx * resolve comment

* feat: add Anthropic-compatible serving endpoints Introduce Anthropic-style messages, count_tokens, and model-list endpoints with dedicated per-endpoint handlers so LMDeploy can interoperate with Anthropic-oriented clients while keeping OpenAI routes unchanged. Made-with: Cursor * update v1/messages * update user guide * fix according to review comments * integrate claude code * add claude code integration guide

InternLM#4511) * add explicit trust_remote_code controls * add trust remote code in pipeline * fix * fix * fix * fix * fix ut * add trust-remote-code in cli * fix * fix * fix * fix * fix * fix * fix * fix * pr_ete_test --trust-remote-code * use ArgumentHelper.trust_remote_code(parser) in serve.py --------- Co-authored-by: zhulin1 <zhulinJulia24@163.com>

* a tmp fix * zero out blocks

* fix gemma3 vl * fix ppl oom * interns2preview tool parser * fix accordintg to review comments * fix * fix ut

…LM#4576)

* yield error when prompt processing suffers exception * fix

* support interns2preview * support time series * fix time series * fix visual * fix: address InternS2 preview review comments * fix: align InternS1 Pro time-series handling * fix: restore InternS1 Pro processor dtype contract * fix: require dtype for Qwen3 VL input processor --------- Co-authored-by: RunningLeon <mnsheng@yeah.net> Co-authored-by: 吕晗 <lvhan@pjlab.org.cn>

…#4564)

…figs (InternLM#4572)

* bump version to v0.13.0 * update * fix as copilot suggests

- op_backend.py: MTP detection (is_multi_token_decoding), effective_is_decoding, actual_seq_lengths_q, vendor_device_init trigger - attention.py: add is_multi_token_decoding and actual_seq_lengths_q fields - pagedattention.py: MTP verify reuses paged_prefill_attention - config.py: SpecDecodeConfig.from_config add device_type param - config_builder.py: pass device_type to SpecDecodeConfig Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Keep only the generic draft-step and accepted-token metadata plumbing in lmdeploy so the dlinfer backend can drive Ascend multi-token state updates without broad runtime hooks in the core runtime. Made-with: Cursor

ZhijunLStudio and others added 27 commits April 22, 2026 12:15

fix: typo shoudd -> should and MODLES -> MODELS (InternLM#4543)

dff7c0a

fix: prevent prefill starvation under high decode load (InternLM#4532)

63d8bb3

* optimize prefill waiting time * fix comment * check prefill_interval

Add docker/Dockerfile_patch; minor tweaks in messages.py and setup.py. (

7049f17

InternLM#4546) * Add docker/Dockerfile_patch; minor tweaks in messages.py and setup.py. * add ProcessContextFilter in logger * fix reviewer comment * checkin .dockerignore * fix

Fix qwen35 dp (InternLM#4535)

964c878

* fix qwen35 moe dp * fix qwen35 dp * fix comment

Fix mtp for rl (InternLM#4520)

5a55716

* fix mtp experts * fix * fix set_step for ar spec when evict * fix evit and reprefill with bad token cache * fix mtp second step inputs * refactor ar spec seq and resp when canneled * add ut for spe seq * fix lint * resolve comment

cancel request and block new inputs when sleeping (InternLM#4541)

7ee73e7

* cancel request and block new inputs when sleep * fix

[refactor] [api_server] [2/N] improve tool parsers by abstracting xml…

69c7f2d

… parser (InternLM#4548) * add glm47 tool call parser * fix * add glm47 tool call parser * fix * fix comment

optimize get_sorted_idx in moe (InternLM#4529)

8e1445c

* optimize get_sorted_idx in moe * add assert

Fix mp engine (InternLM#4540)

47db6c2

* fix mp engine * fix name * fix ut * improve cancel * filter cancel in mp * clear prev chunk info * update * resolve comment

remove barely used skills and checkin docker-build skill (InternLM#4560)

6f5673f

* remove barely used skills and checkin docker-build skill * remove resolve-review and submit-pr * fix * fix according to reviewer comment

Map user-input session_id to internal session_id to maintain session …

0a537ef

…identity (InternLM#4523) * tell user-input session_id from the inner session_id * fix * log user's session_id * remove unnecessary log

Fix cache sizing and cache block layout edge cases (InternLM#4552)

bed8464

* fix num_gpu_blocks for spec decoding * update cache engine * update config and message * fix ut * fix * fix

support more message item types (InternLM#4501)

9df0eff

* support more message item types * make copilot happy

Fix qwen3.5-moe mtp with tp>1 (InternLM#4568)

cb4cc8a

* fix draft tp by change dist ctx * resolve comment

block_offsets padding 0 (InternLM#4569)

efe3b88

* a tmp fix * zero out blocks

hotfix: resolve test issues for v0.13.0 (InternLM#4571)

3b6c9ee

* fix gemma3 vl * fix ppl oom * interns2preview tool parser * fix accordintg to review comments * fix * fix ut

ResponseParser forget to strip <think> tag in non-stream mode (Intern…

6172fc2

…LM#4576)

yield error when prompt processing suffers exception (InternLM#4574)

0bf8a07

* yield error when prompt processing suffers exception * fix

Fix the reprefill of evicted seqs with invalid draft tokens (InternLM…

34a1ef6

…#4564)

disable quantization for the MTP fc projection to match FP8 model con…

f5a9860

…figs (InternLM#4572)

bump version to v0.13.0 (InternLM#4549)

e6948c1

* bump version to v0.13.0 * update * fix as copilot suggests

wanfengcxz force-pushed the qwen3_5_mtp_final_2 branch from 5516a2f to 5883957 Compare May 21, 2026 10:36

tangzhiyi11 and others added 2 commits May 21, 2026 10:38

[Ascend] Minimize lmdeploy MTP hooks for dlinfer

123d8c2

Keep only the generic draft-step and accepted-token metadata plumbing in lmdeploy so the dlinfer backend can drive Ascend multi-token state updates without broad runtime hooks in the core runtime. Made-with: Cursor

wanfengcxz added 6 commits May 21, 2026 10:39

[ascend] fix attn mask

6fd48a9

Refactor GDN and conv1d computation flow

2f694fc

Refactor GDN and conv1d computation flow

415c2c7

[ascend] remove unused code

a219b8a

[ascend] remote unused code

c4688c2

[ascend] add device_type

25eebc5

wanfengcxz force-pushed the qwen3_5_mtp_final_2 branch from 5883957 to 25eebc5 Compare May 25, 2026 07:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3 5 mtp final 2#40

Qwen3 5 mtp final 2#40
wanfengcxz wants to merge 35 commits into
DeepLink-org:mainfrom
wanfengcxz:qwen3_5_mtp_final_2

wanfengcxz commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

wanfengcxz commented Apr 21, 2026

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants