Qwen3 5 mtp final 2#40
Open
wanfengcxz wants to merge 35 commits into
Open
Conversation
Fix 5 instances of mutable default arguments (=[] and ={}) in
function signatures across 3 files. This is a latent Python bug
where shared mutable state can leak across function calls.
- lmdeploy/turbomind/deploy/module.py: apply_gs=[] → apply_gs=None (2 places)
- lmdeploy/turbomind/deploy/config.py: config: dict = {} → None
- lmdeploy/lite/quantization/calibration.py: kwargs={} → None (2 places)
* optimize prefill waiting time * fix comment * check prefill_interval
InternLM#4546) * Add docker/Dockerfile_patch; minor tweaks in messages.py and setup.py. * add ProcessContextFilter in logger * fix reviewer comment * checkin .dockerignore * fix
* fix qwen35 moe dp * fix qwen35 dp * fix comment
* fix mtp experts * fix * fix set_step for ar spec when evict * fix evit and reprefill with bad token cache * fix mtp second step inputs * refactor ar spec seq and resp when canneled * add ut for spe seq * fix lint * resolve comment
* cancel request and block new inputs when sleep * fix
… parser (InternLM#4548) * add glm47 tool call parser * fix * add glm47 tool call parser * fix * fix comment
* WIP: support mixed modality * fix mm processor kwargs, cleanup * qwen3.5 mixed modality * interns1 pro mixed modality, fix kwargs * fix generate, cleanup * minor * simplify * fix glm4.1v * compatible with legacy preprocess, give up re-writing all ... * fix bugs * minor * minor * minor * fix ut * fix qwen3vl moe * allow modality-specific kwargs, add ut * docs: add multi-modal input format reference (EN + ZH) Add multimodal_inputs.md covering all modalities (text, image, video, audio, time series, mixed) with OpenAI-style examples, local file / base64 usage via lmdeploy.vl.utils helpers, and mm_processor_kwargs / media_io_kwargs guidance. Link from vl_pipeline.md and index.rst. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: update video/audio URLs to official Qwen assets Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * docs: fix model name Qwen3.5-VL -> Qwen3.5 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: address PR InternLM#4531 review comments - glm4_1v: guard chat_template_kwargs against None before ** expansion - base: use local time_series_processor to avoid mutating self.processor - base: fix preprocess return type annotation list[dict] -> dict[str, Any] - base: lower valid size-override log from WARNING to INFO Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: rename interns1_pro_ts.py to interns1_pro_time_series.py * docs: remove audio sections (not yet supported) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: extract preprocess helpers from VisionModel into preprocess_utils.py Move get_mm_items_offset, get_override_size, get_expanded_input_ids, and get_expanded_mm_items out of VisionModel into a standalone module. Functions now receive explicit params (processor, mm_tokens) instead of relying on self, making them unit-testable without a full VisionModel instance. Also replace inline signature-detection logic with _is_new_preprocess_api() helper in multimodal.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor: move MultimodalSpecialTokens to constants.py, promote API detector to staticmethod - Move MultimodalSpecialTokens from vl/model/base.py to vl/constants.py alongside Modality; fixes circular import and enables type annotations on mm_tokens params in preprocess_utils.py - Promote _is_new_preprocess_api to MultimodalProcessor.@staticmethod, encapsulating the vl_encoder None guard inside the method Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * minor * minor * bunch of fix * update glm4.1v * simplify map dict * Fix Qwen3VL tests for input prompt API * update * minor --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* optimize get_sorted_idx in moe * add assert
… inference on Blackwell GPUs with memory copy optimizations (InternLM#4490) * feat(turbomind): integrate cublasGemmGroupedBatchedEx for Qwen3.5 MoE on Blackwell Use grouped batched GEMM on SM100, SM90 CUTLASS kernels split into a separate STATIC library for arch-specific builds, copy path workaround for Blackwell, and Llama MoE weight layout adjustments. Move tma.cu into libgemm2_sm90.a (its only callers are SM90 kernels), fixing undefined symbol make_2d_tma_desc from single-pass static link order between two archives. * fix: resolve undefined symbol and MoE dispatch crash CMakeLists.txt: Move tma.cu from gemm2 into GEMM2_KERNELS_SM90, so make_2d_tma_desc resides in the same archive (libgemm2_sm90.a) as its SM90 CUTLASS callers. This fixes the undefined symbol error caused by single-pass static-link ordering between libgemm2.a and libgemm2_sm90.a. LlamaLinear.cu: Guard invokeMoeDispatchScales with `if (U)`. The is_cublas_grouped path (SM100 bf16 MoE) enters the dispatch block without quantization, leaving the scales tensor U empty. Calling invokeMoeDispatchScales on an empty tensor crashes with std::out_of_range on B200. * fix: pass Adesc.ld/Ddesc.ld as ldb/ldc for cublas grouped batched GEMM --------- Co-authored-by: da.huo <da.huo@shopee.com>
* fix mp engine * fix name * fix ut * improve cancel * filter cancel in mp * clear prev chunk info * update * resolve comment
* remove barely used skills and checkin docker-build skill * remove resolve-review and submit-pr * fix * fix according to reviewer comment
…identity (InternLM#4523) * tell user-input session_id from the inner session_id * fix * log user's session_id * remove unnecessary log
* fix num_gpu_blocks for spec decoding * update cache engine * update config and message * fix ut * fix * fix
* support more message item types * make copilot happy
* fix draft tp by change dist ctx * resolve comment
* feat: add Anthropic-compatible serving endpoints Introduce Anthropic-style messages, count_tokens, and model-list endpoints with dedicated per-endpoint handlers so LMDeploy can interoperate with Anthropic-oriented clients while keeping OpenAI routes unchanged. Made-with: Cursor * update v1/messages * update user guide * fix according to review comments * integrate claude code * add claude code integration guide
InternLM#4511) * add explicit trust_remote_code controls * add trust remote code in pipeline * fix * fix * fix * fix * fix ut * add trust-remote-code in cli * fix * fix * fix * fix * fix * fix * fix * fix * pr_ete_test --trust-remote-code * use ArgumentHelper.trust_remote_code(parser) in serve.py --------- Co-authored-by: zhulin1 <zhulinJulia24@163.com>
* a tmp fix * zero out blocks
* fix gemma3 vl * fix ppl oom * interns2preview tool parser * fix accordintg to review comments * fix * fix ut
* yield error when prompt processing suffers exception * fix
* support interns2preview * support time series * fix time series * fix visual * fix: address InternS2 preview review comments * fix: align InternS1 Pro time-series handling * fix: restore InternS1 Pro processor dtype contract * fix: require dtype for Qwen3 VL input processor --------- Co-authored-by: RunningLeon <mnsheng@yeah.net> Co-authored-by: 吕晗 <lvhan@pjlab.org.cn>
* bump version to v0.13.0 * update * fix as copilot suggests
5516a2f to
5883957
Compare
- op_backend.py: MTP detection (is_multi_token_decoding), effective_is_decoding, actual_seq_lengths_q, vendor_device_init trigger - attention.py: add is_multi_token_decoding and actual_seq_lengths_q fields - pagedattention.py: MTP verify reuses paged_prefill_attention - config.py: SpecDecodeConfig.from_config add device_type param - config_builder.py: pass device_type to SpecDecodeConfig Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep only the generic draft-step and accepted-token metadata plumbing in lmdeploy so the dlinfer backend can drive Ascend multi-token state updates without broad runtime hooks in the core runtime. Made-with: Cursor
5883957 to
25eebc5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily receiving feedbacks. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
Motivation
Please describe the motivation of this PR and the goal you want to achieve through this PR.
Modification
Please briefly describe what modification is made in this PR.
BC-breaking (Optional)
Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.
Checklist