[feature] Support deepseekv4 online quantization#1027
Conversation
1264347 to
de75701
Compare
|
Two spots where the newly added logic can be merged/de-duplicated: 1.
|
Signed-off-by: root <root@smci355-ccs-aus-m06-05.cs-aus.dcgpu>
de75701 to
28a62ec
Compare
1.make_v4_quant_config now takes online_quant_config and forwards it; both DeepseekV4ForCausalLM and MTP pass it. Before, the flag was silently dropped for V4.
2.Keep sensitive layers BF16. Compressor / indexer.weights_proj stay raw; wo_a stays BF16 even under online quant (aiter has no FP8 grouped einsum). The new QuantType.No → skip guard makes online quant actually honor this.
3.Quantize the norm layer. q_norm's fused output feeds wq_b directly, so its activation scheme must follow wq_b's online target — otherwise the GEMM misreads the bits (garbage, GSM8K→0). Added online_quantize_activation on RMSNorm, and dropped the old guard that pinned wq_b to the source scheme.
4.Harden linear/moe online quant. Skip early if the layer is excluded (No) or already in the target format — saves compute and avoids re-quantizing/corrupting the fp4 experts.
original
Attn-shared expert ptpc fp8