Skip to content

Add VLM components to _default_disabled_quantizer_cfg? #1396

@harmya

Description

@harmya

Hello, I'd like to propose a small change to how multimodal models are handled during quantization. Please correct me if my understanding is incorrect.

Currently, when quantizing a multimodal model with any recipe, ModelOpt attaches observers to the vision tower and multimodal projector alongside the language model. For models like moonshotai/Kimi-K2.6, this breaks at export with ValueError: tensor column shape must be divisible by the given group_size 32 but got 4304 (raised by compressed_tensors/quantization/lifecycle/forward_helpers.py:138 from inside modelopt.export_hf_checkpoint)". The failure happens after calibration completes, so the issue only happens after the run.

Every published VLM NVFP4 checkpoint I could find like nvidia/Kimi-K2.5-NVFP4, wafer-ai/Kimi-K2.6-NVFP4, RedHatAI/Kimi-K2.6-NVFP4 keeps vision and projector layers in BF16. I think intuitively this makes sense because vision encoders are tiny on large VLMs (typically <1% of params), so quantizing them yields negligible memory savings and quantization noise at the visual feature stage compounds through the LLM layers.

I think we can append ~4 patterns to _default_disabled_quantizer_cfg, this is the same universal-disable list that already covers lm_head, MoE routers, BatchNorm, and Mamba conv1d. This makes sense to me intuitively because looking at the code, the convention there seems to be "always wrong to quantize regardless of recipe," and I think VLM vision/projector components appear to meet that bar?

I had to fix these issues to get a working Kimi2.6 quant that serves inference via SGLang. Happy to provide details if needed and make a PR!

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions