Skip to content

Qwen3 5 mtp final 2#40

Open
wanfengcxz wants to merge 35 commits into
DeepLink-org:mainfrom
wanfengcxz:qwen3_5_mtp_final_2
Open

Qwen3 5 mtp final 2#40
wanfengcxz wants to merge 35 commits into
DeepLink-org:mainfrom
wanfengcxz:qwen3_5_mtp_final_2

Conversation

@wanfengcxz
Copy link
Copy Markdown
Collaborator

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

Please describe the motivation of this PR and the goal you want to achieve through this PR.

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

ZhijunLStudio and others added 27 commits April 22, 2026 12:15
Fix 5 instances of mutable default arguments (=[] and ={}) in
function signatures across 3 files. This is a latent Python bug
where shared mutable state can leak across function calls.

- lmdeploy/turbomind/deploy/module.py: apply_gs=[] → apply_gs=None (2 places)
- lmdeploy/turbomind/deploy/config.py: config: dict = {} → None
- lmdeploy/lite/quantization/calibration.py: kwargs={} → None (2 places)
* optimize prefill waiting time

* fix comment

* check prefill_interval
InternLM#4546)

* Add docker/Dockerfile_patch; minor tweaks in messages.py and setup.py.

* add ProcessContextFilter in logger

* fix reviewer comment

* checkin .dockerignore

* fix
* fix qwen35 moe dp

* fix qwen35 dp

* fix comment
* fix mtp experts

* fix

* fix set_step for ar spec when evict

* fix evit and reprefill with bad token cache

* fix mtp second step inputs

* refactor ar spec seq and resp when canneled

* add ut for spe seq

* fix lint

* resolve comment
* cancel request and block new inputs when sleep

* fix
… parser (InternLM#4548)

* add glm47 tool call parser

* fix

* add glm47 tool call parser

* fix

* fix comment
* WIP: support mixed modality

* fix mm processor kwargs, cleanup

* qwen3.5 mixed modality

* interns1 pro mixed modality, fix kwargs

* fix generate, cleanup

* minor

* simplify

* fix glm4.1v

* compatible with legacy preprocess, give up re-writing all ...

* fix bugs

* minor

* minor

* minor

* fix ut

* fix qwen3vl moe

* allow modality-specific kwargs, add ut

* docs: add multi-modal input format reference (EN + ZH)

Add multimodal_inputs.md covering all modalities (text, image, video,
audio, time series, mixed) with OpenAI-style examples, local file /
base64 usage via lmdeploy.vl.utils helpers, and mm_processor_kwargs /
media_io_kwargs guidance. Link from vl_pipeline.md and index.rst.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: update video/audio URLs to official Qwen assets

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* docs: fix model name Qwen3.5-VL -> Qwen3.5

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* fix: address PR InternLM#4531 review comments

- glm4_1v: guard chat_template_kwargs against None before ** expansion
- base: use local time_series_processor to avoid mutating self.processor
- base: fix preprocess return type annotation list[dict] -> dict[str, Any]
- base: lower valid size-override log from WARNING to INFO

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: rename interns1_pro_ts.py to interns1_pro_time_series.py

* docs: remove audio sections (not yet supported)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: extract preprocess helpers from VisionModel into preprocess_utils.py

Move get_mm_items_offset, get_override_size, get_expanded_input_ids, and
get_expanded_mm_items out of VisionModel into a standalone module. Functions
now receive explicit params (processor, mm_tokens) instead of relying on self,
making them unit-testable without a full VisionModel instance.

Also replace inline signature-detection logic with _is_new_preprocess_api()
helper in multimodal.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* refactor: move MultimodalSpecialTokens to constants.py, promote API detector to staticmethod

- Move MultimodalSpecialTokens from vl/model/base.py to vl/constants.py
  alongside Modality; fixes circular import and enables type annotations
  on mm_tokens params in preprocess_utils.py
- Promote _is_new_preprocess_api to MultimodalProcessor.@staticmethod,
  encapsulating the vl_encoder None guard inside the method

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* minor

* minor

* bunch of fix

* update glm4.1v

* simplify map dict

* Fix Qwen3VL tests for input prompt API

* update

* minor

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* optimize get_sorted_idx in moe

* add assert
… inference on Blackwell GPUs with memory copy optimizations (InternLM#4490)

* feat(turbomind): integrate cublasGemmGroupedBatchedEx for Qwen3.5 MoE on Blackwell

Use grouped batched GEMM on SM100, SM90 CUTLASS kernels split into a
separate STATIC library for arch-specific builds, copy path workaround
for Blackwell, and Llama MoE weight layout adjustments.

Move tma.cu into libgemm2_sm90.a (its only callers are SM90 kernels),
fixing undefined symbol make_2d_tma_desc from single-pass static link
order between two archives.

* fix: resolve undefined symbol and MoE dispatch crash

CMakeLists.txt:
  Move tma.cu from gemm2 into GEMM2_KERNELS_SM90, so make_2d_tma_desc
  resides in the same archive (libgemm2_sm90.a) as its SM90 CUTLASS
  callers. This fixes the undefined symbol error caused by single-pass
  static-link ordering between libgemm2.a and libgemm2_sm90.a.

LlamaLinear.cu:
  Guard invokeMoeDispatchScales with `if (U)`. The is_cublas_grouped
  path (SM100 bf16 MoE) enters the dispatch block without quantization,
  leaving the scales tensor U empty. Calling invokeMoeDispatchScales on
  an empty tensor crashes with std::out_of_range on B200.

* fix: pass Adesc.ld/Ddesc.ld as ldb/ldc for cublas grouped batched GEMM

---------

Co-authored-by: da.huo <da.huo@shopee.com>
* fix mp engine

* fix name

* fix ut

* improve cancel

* filter cancel in mp

* clear prev chunk info

* update

* resolve comment
* remove barely used skills and checkin docker-build skill

* remove resolve-review and submit-pr

* fix

* fix according to reviewer comment
…identity (InternLM#4523)

* tell user-input session_id from the inner session_id

* fix

* log user's session_id

* remove unnecessary log
* fix num_gpu_blocks for spec decoding

* update cache engine

* update config and message

* fix ut

* fix

* fix
* support more message item types

* make copilot happy
* fix draft tp by change dist ctx

* resolve comment
* feat: add Anthropic-compatible serving endpoints

Introduce Anthropic-style messages, count_tokens, and model-list endpoints with dedicated per-endpoint handlers so LMDeploy can interoperate with Anthropic-oriented clients while keeping OpenAI routes unchanged.

Made-with: Cursor

* update v1/messages

* update user guide

* fix according to review comments

* integrate claude code

* add claude code integration guide
InternLM#4511)

* add explicit trust_remote_code controls

* add trust remote code in pipeline

* fix

* fix

* fix

* fix

* fix ut

* add trust-remote-code in cli

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* fix

* pr_ete_test --trust-remote-code

* use ArgumentHelper.trust_remote_code(parser) in serve.py

---------

Co-authored-by: zhulin1 <zhulinJulia24@163.com>
* a tmp fix

* zero out blocks
* fix gemma3 vl

* fix ppl oom

* interns2preview tool parser

* fix accordintg to review comments

* fix

* fix ut
* yield error when prompt processing suffers exception

* fix
* support interns2preview

* support time series

* fix time series

* fix visual

* fix: address InternS2 preview review comments

* fix: align InternS1 Pro time-series handling

* fix: restore InternS1 Pro processor dtype contract

* fix: require dtype for Qwen3 VL input processor

---------

Co-authored-by: RunningLeon <mnsheng@yeah.net>
Co-authored-by: 吕晗 <lvhan@pjlab.org.cn>
* bump version to v0.13.0

* update

* fix as copilot suggests
@wanfengcxz wanfengcxz force-pushed the qwen3_5_mtp_final_2 branch from 5516a2f to 5883957 Compare May 21, 2026 10:36
tangzhiyi11 and others added 2 commits May 21, 2026 10:38
- op_backend.py: MTP detection (is_multi_token_decoding),
  effective_is_decoding, actual_seq_lengths_q, vendor_device_init trigger
- attention.py: add is_multi_token_decoding and actual_seq_lengths_q fields
- pagedattention.py: MTP verify reuses paged_prefill_attention
- config.py: SpecDecodeConfig.from_config add device_type param
- config_builder.py: pass device_type to SpecDecodeConfig

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep only the generic draft-step and accepted-token metadata plumbing in lmdeploy so the dlinfer backend can drive Ascend multi-token state updates without broad runtime hooks in the core runtime.

Made-with: Cursor
@wanfengcxz wanfengcxz force-pushed the qwen3_5_mtp_final_2 branch from 5883957 to 25eebc5 Compare May 25, 2026 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants