[feature] Support deepseekv4 online quantization by haoyangli0109 · Pull Request #1027 · ROCm/ATOM

haoyangli0109 · 2026-06-02T03:58:16Z

1.make_v4_quant_config now takes online_quant_config and forwards it; both DeepseekV4ForCausalLM and MTP pass it. Before, the flag was silently dropped for V4.

2.Keep sensitive layers BF16. Compressor / indexer.weights_proj stay raw; wo_a stays BF16 even under online quant (aiter has no FP8 grouped einsum). The new QuantType.No → skip guard makes online quant actually honor this.

3.Quantize the norm layer. q_norm's fused output feeds wq_b directly, so its activation scheme must follow wq_b's online target — otherwise the GEMM misreads the bits (garbage, GSM8K→0). Added online_quantize_activation on RMSNorm, and dropped the old guard that pinned wq_b to the source scheme.

4.Harden linear/moe online quant. Skip early if the layer is excluded (No) or already in the target format — saves compute and avoids re-quantizing/corrupting the fp4 experts.

original

ATOM_USE_TRITON_MOE=1 \
python -m atom.entrypoints.openai_server \
  --model /shareddata/deepseek-ai/DeepSeek-V4-Pro \
  --trust-remote-code \
  -tp 4 \
  --port 5679 \
  --server-port 7779


lm_eval   --model local-completions   --model_args "model=/shareddata/deepseek-ai/DeepSeek-V4-Pro,base_url=http://localhost:7779/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32"   --tasks gsm8k   --num_fewshot 5   --batch_size auto

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9462|±  |0.0062|
|     |       |strict-match    |     5|exact_match|↑  |0.9469|±  |0.0062|

Attn-shared expert ptpc fp8

HIP_VISIBLE_DEVICES=4,5,6,7 ATOM_USE_TRITON_MOE=1 python -m atom.entrypoints.openai_server   --model /shareddata/deepseek-ai/DeepSeek-V4-Pro   --trust-remote-code   --online_quant_config '{"global_quant_config": "ptpc_fp8", "exclude_layer": ["lm_head", "*.expert.*"]}'   -tp 4   --port 5678   --server-port 7778

lm_eval   --model local-completions   --model_args "model=/shareddata/deepseek-ai/DeepSeek-V4-Pro,base_url=http://localhost:7778/v1/completions,tokenized_requests=False,tokenizer_backend=None,num_concurrent=32"   --tasks gsm8k   --num_fewshot 5   --batch_size auto

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9492|±  |0.0060|
|     |       |strict-match    |     5|exact_match|↑  |0.9484|±  |0.0061|

valarLip · 2026-06-04T03:13:34Z

Two spots where the newly added logic can be merged/de-duplicated:

1. `make_v4_quant_config.overridden` — fold the two `wo_a` branches into one

if wo_a_is_bf16 and ".wo_a" in layer_name:
    return no_spec
...
if use_online_quant and "attn.wo_a" in layer_name:
    return no_spec

attn.wo_a is the only wo_a in V4 (constructed as prefix=f"{p}.wo_a", probed via layers.0.attn.wo_a), so .wo_a and attn.wo_a match the same layer. The two branches collapse to:

# wo_a feeds a grouped-LoRA einsum that aiter only supports in BF16, so keep it
# BF16 whenever it isn't FP8-with-load-time-dequant:
#   - wo_a_is_bf16:    ckpt (V4-Flash-FP8) ships it BF16 already (no scale)
#   - use_online_quant: skip the dequant->requant round-trip; keep BF16
if ".wo_a" in layer_name and (wo_a_is_bf16 or use_online_quant):
    return no_spec

Behaviour is identical across all four (wo_a_is_bf16, use_online_quant) combinations, and it also removes the .wo_a vs attn.wo_a match-string inconsistency.

2. Extract the duplicated online-quant skip guard (linear / moe / rmsnorm)

The same two early-return guards now appear in three places — LinearBase.online_quantize_weight, FusedMoE._online_quant, and the new RMSNorm.online_quantize_activation:

if online_quant_type == QuantType.No:
    return
if (<current_type> == online_quant_type and <current_dtype> == online_quant_dtype):
    return

Suggest a shared helper so a future quant path only touches one place:

def should_skip_online_requant(cur_type, cur_dtype, online_cfg) -> bool:
    """Skip online re-quant when the layer is excluded (No) or already in target."""
    return (
        online_cfg.quant_type == QuantType.No
        or (cur_type == online_cfg.quant_type and cur_dtype == online_cfg.quant_dtype)
    )

Call sites become:

if should_skip_online_requant(<cur_type>, self.params_dtype, online_cfg):
    return

(<cur_type>: self.quant_type for linear/rmsnorm, self.layer_quant_config.quant_type for moe — all "the layer's current quant type".)

Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>

Signed-off-by: root <root@smci355-ccs-aus-m06-05.cs-aus.dcgpu>

haoyangli0109 force-pushed the lhy/online_quant_update branch 2 times, most recently from 1264347 to de75701 Compare June 3, 2026 09:50

haoyangli0109 marked this pull request as ready for review June 3, 2026 09:50

haoyangli0109 and others added 2 commits June 4, 2026 05:51

WIP

e6fe0a8

Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>

fix comment

28a62ec

Signed-off-by: root <root@smci355-ccs-aus-m06-05.cs-aus.dcgpu>

haoyangli0109 force-pushed the lhy/online_quant_update branch from de75701 to 28a62ec Compare June 4, 2026 05:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] Support deepseekv4 online quantization#1027

[feature] Support deepseekv4 online quantization#1027
haoyangli0109 wants to merge 2 commits into
ROCm:mainfrom
haoyangli0109:lhy/online_quant_update

haoyangli0109 commented Jun 2, 2026 •

edited by lihaoyang-amd

Loading

Uh oh!

valarLip commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

haoyangli0109 commented Jun 2, 2026 • edited by lihaoyang-amd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valarLip commented Jun 4, 2026

1. make_v4_quant_config.overridden — fold the two wo_a branches into one

2. Extract the duplicated online-quant skip guard (linear / moe / rmsnorm)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

haoyangli0109 commented Jun 2, 2026 •

edited by lihaoyang-amd

Loading

1. `make_v4_quant_config.overridden` — fold the two `wo_a` branches into one