Skip to content

[feature] Support deepseekv4 online quantization#1027

Open
haoyangli0109 wants to merge 2 commits into
ROCm:mainfrom
haoyangli0109:lhy/online_quant_update
Open

[feature] Support deepseekv4 online quantization#1027
haoyangli0109 wants to merge 2 commits into
ROCm:mainfrom
haoyangli0109:lhy/online_quant_update

Conversation

@haoyangli0109
Copy link
Copy Markdown
Contributor

@haoyangli0109 haoyangli0109 commented Jun 2, 2026

1.make_v4_quant_config now takes online_quant_config and forwards it; both DeepseekV4ForCausalLM and MTP pass it. Before, the flag was silently dropped for V4.

2.Keep sensitive layers BF16. Compressor / indexer.weights_proj stay raw; wo_a stays BF16 even under online quant (aiter has no FP8 grouped einsum). The new QuantType.No → skip guard makes online quant actually honor this.

3.Quantize the norm layer. q_norm's fused output feeds wq_b directly, so its activation scheme must follow wq_b's online target — otherwise the GEMM misreads the bits (garbage, GSM8K→0). Added online_quantize_activation on RMSNorm, and dropped the old guard that pinned wq_b to the source scheme.

4.Harden linear/moe online quant. Skip early if the layer is excluded (No) or already in the target format — saves compute and avoids re-quantizing/corrupting the fp4 experts.

original

ATOM_USE_TRITON_MOE=1 \
python -m atom.entrypoints.openai_server \
  --model /shareddata/deepseek-ai/DeepSeek-V4-Pro \
  --trust-remote-code \
  -tp 4 \
  --port 5679 \
  --server-port 7779


lm_eval   --model local-completions   --model_args "model=/shareddata/deepseek-ai/DeepSeek-V4-Pro,base_url=http://localhost:7779/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32"   --tasks gsm8k   --num_fewshot 5   --batch_size auto

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9462|±  |0.0062|
|     |       |strict-match    |     5|exact_match|↑  |0.9469|±  |0.0062|

Attn-shared expert ptpc fp8

HIP_VISIBLE_DEVICES=4,5,6,7 ATOM_USE_TRITON_MOE=1 python -m atom.entrypoints.openai_server   --model /shareddata/deepseek-ai/DeepSeek-V4-Pro   --trust-remote-code   --online_quant_config '{"global_quant_config": "ptpc_fp8", "exclude_layer": ["lm_head", "*.expert.*"]}'   -tp 4   --port 5678   --server-port 7778

lm_eval   --model local-completions   --model_args "model=/shareddata/deepseek-ai/DeepSeek-V4-Pro,base_url=http://localhost:7778/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32"   --tasks gsm8k   --num_fewshot 5   --batch_size auto

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9492|±  |0.0060|
|     |       |strict-match    |     5|exact_match|↑  |0.9484|±  |0.0061|


@haoyangli0109 haoyangli0109 force-pushed the lhy/online_quant_update branch 2 times, most recently from 1264347 to de75701 Compare June 3, 2026 09:50
@haoyangli0109 haoyangli0109 marked this pull request as ready for review June 3, 2026 09:50
@valarLip
Copy link
Copy Markdown
Collaborator

valarLip commented Jun 4, 2026

Two spots where the newly added logic can be merged/de-duplicated:

1. make_v4_quant_config.overridden — fold the two wo_a branches into one

if wo_a_is_bf16 and ".wo_a" in layer_name:
    return no_spec
...
if use_online_quant and "attn.wo_a" in layer_name:
    return no_spec

attn.wo_a is the only wo_a in V4 (constructed as prefix=f"{p}.wo_a", probed via layers.0.attn.wo_a), so .wo_a and attn.wo_a match the same layer. The two branches collapse to:

# wo_a feeds a grouped-LoRA einsum that aiter only supports in BF16, so keep it
# BF16 whenever it isn't FP8-with-load-time-dequant:
#   - wo_a_is_bf16:    ckpt (V4-Flash-FP8) ships it BF16 already (no scale)
#   - use_online_quant: skip the dequant->requant round-trip; keep BF16
if ".wo_a" in layer_name and (wo_a_is_bf16 or use_online_quant):
    return no_spec

Behaviour is identical across all four (wo_a_is_bf16, use_online_quant) combinations, and it also removes the .wo_a vs attn.wo_a match-string inconsistency.

2. Extract the duplicated online-quant skip guard (linear / moe / rmsnorm)

The same two early-return guards now appear in three places — LinearBase.online_quantize_weight, FusedMoE._online_quant, and the new RMSNorm.online_quantize_activation:

if online_quant_type == QuantType.No:
    return
if (<current_type> == online_quant_type and <current_dtype> == online_quant_dtype):
    return

Suggest a shared helper so a future quant path only touches one place:

def should_skip_online_requant(cur_type, cur_dtype, online_cfg) -> bool:
    """Skip online re-quant when the layer is excluded (No) or already in target."""
    return (
        online_cfg.quant_type == QuantType.No
        or (cur_type == online_cfg.quant_type and cur_dtype == online_cfg.quant_dtype)
    )

Call sites become:

if should_skip_online_requant(<cur_type>, self.params_dtype, online_cfg):
    return

(<cur_type>: self.quant_type for linear/rmsnorm, self.layer_quant_config.quant_type for moe — all "the layer's current quant type".)

haoyangli0109 and others added 2 commits June 4, 2026 05:51
Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>
Signed-off-by: root <root@smci355-ccs-aus-m06-05.cs-aus.dcgpu>
@haoyangli0109 haoyangli0109 force-pushed the lhy/online_quant_update branch from de75701 to 28a62ec Compare June 4, 2026 05:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants