feat: add AICB workload generator with Qwen3/Qwen3.5 training mocks by yanzhenghao · Pull Request #289 · aliyun/SimAI

yanzhenghao · 2026-06-15T11:17:36Z

Summary

Flattens the AICB submodule into the main repo and adds Qwen3/Qwen3.5 training workload mocks. One clean commit on top of origin/master.

MockedQwen3.py -- Qwen3 dense training workloads (461 lines)

Supports all 6 Qwen3 dense model sizes: 0.6B, 1.7B, 4B, 8B, 14B, 32B.

Architectural correctness vs LLaMA/Megatron:

GQA: separate num_key_value_heads for K/V projections (Megatron hardcodes MHA)
head_dim from config: uses explicit head_dim=128 instead of hidden_size // num_attention_heads. Correctly handles expansion models where head_dim * num_heads > hidden_size (Qwen3-0.6B: 2.00x, Qwen3-4B: 1.60x, Qwen3-32B: 1.60x)
QK-Norm: RMSNorm on query and key per-head after projection, hardcoded always-on in Qwen3 architecture. Compute-only -- zero communication impact. Confirmed from transformers source (modeling_qwen3.py lines 248-249)
SwiGLU sizing fix: down-projection input is intermediate_size, not 2 * intermediate_size (unlike MegatronMlp which overcounts params by 2x)
Qwen3Embedding: no Megatron artifacts (removed 4x vocab multiplier and learned position_embedding; Qwen3 uses RoPE)
tie_word_embeddings: lm_head weight zeroed when tie_word_embeddings=true (0.6B, 1.7B, 4B). Communication unchanged.
MoE: reuses MOEMLP from MockedMegatron (128 experts, top-8, no shared experts)

Reuses MegatronColumnLinear, MegatronRowLinear, MOEMLP from MockedMegatron.py -- zero duplication of TP communication primitives.

MockedQwen3_5.py -- Qwen3.5 dense/MoE training workloads (823 lines)

Supports Qwen3.5 dense (0.8B, 2B, 4B, 9B, 27B) and MoE (35B-A3B, 122B-A10B, 397B-A17B).

Hybrid architecture:

GatedDeltaNet linear attention on 75% of layers
Full attention on 25% of layers (3:1 interleaved pattern, full_attention_interval=4)
head_dim=256, partial_rotary_factor=0.25, attn_output_gate=true, MRoPE
MoE with shared experts (256-512 experts, top-8/10)

Bug fixes (also benefit Megatron and DeepSeek)

MoE backward pass: MOEMLP.backward() was missing workloads.extend() on the return values of self.permutation() and self.unpermutation(). This caused all MoE models (Megatron, Qwen3, Qwen3.5, DeepSeek) to report ~43-57% of correct backward communication. Fixed in MockedMegatron.py (2 lines).
EP message sizing: MoE TP all-gather/reduce-scatter message sizes now divide by ep_size, fixing a conservative overestimate. Applied to both MockedMegatron.py and MockedDeepSeek.py.
SyntaxWarning: raw string prefix for \i/\d escape sequences in aicb/utils/utils.py docstring.

Supporting changes

aicb/utils/utils.py: added Qwen3/Qwen3.5 to --frame choices, get_qwen3_params() with --head_dim and --num_key_value_heads CLI args
aicb/workload_generator/generate_megatron_workload.py: Qwen3/Qwen3.5 dispatch in __main__
aicb/workload_generator/CLAUDE.md: comprehensive architecture docs, verified configs, design patterns
aicb/tuning/: scaler, variability, wrapper (previously missing from AICB)

Tests: 73 total (58 new in `test_mocked_qwen3.py`), all green

Test	Models	What it verifies
AG/RS/A2A counts	6 dense + 2 MoE	Matches per-layer formula exactly
Message sizes	6 dense + 2 MoE	`2 x seq x batch x hidden` for ColumnLinear, correct A2A sizing
QK-Norm	6 dense	L q_norm + L k_norm params, each head_dim=128, 0 comm items
tie_word_embeddings	6 dense	lm_head=0 when tied, vocab*hidden/TP otherwise
Embedding params	6 dense	ratio=1.0x (no Megatron 4x multiplier)
Head expansion	6 dense	Q_dim = num_heads * head_dim, KV_dim = num_kv * head_dim
A2A symmetry	2 MoE	fwd_A2A == bwd_A2A
MoE backward	2 MoE	backward not empty (regression test for fix)

Full test suite: 79 server tests + 73 aicb tests + TypeScript type check -- all green, zero new warnings.

CLAassistant · 2026-06-15T11:17:58Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

yanzhenghao force-pushed the pr-aicb-qwen3 branch 12 times, most recently from 43aa2d0 to 964bb94 Compare June 15, 2026 14:16

feat: add AICB workload generator with Qwen3/Qwen3.5 training mocks

314e272

yanzhenghao force-pushed the pr-aicb-qwen3 branch from 964bb94 to 314e272 Compare June 15, 2026 14:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add AICB workload generator with Qwen3/Qwen3.5 training mocks#289

feat: add AICB workload generator with Qwen3/Qwen3.5 training mocks#289
yanzhenghao wants to merge 1 commit into
aliyun:masterfrom
yanzhenghao:pr-aicb-qwen3

yanzhenghao commented Jun 15, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Jun 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yanzhenghao commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

MockedQwen3.py -- Qwen3 dense training workloads (461 lines)

MockedQwen3_5.py -- Qwen3.5 dense/MoE training workloads (823 lines)

Bug fixes (also benefit Megatron and DeepSeek)

Supporting changes

Tests: 73 total (58 new in test_mocked_qwen3.py), all green

Uh oh!

CLAassistant commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yanzhenghao commented Jun 15, 2026 •

edited

Loading

Tests: 73 total (58 new in `test_mocked_qwen3.py`), all green

CLAassistant commented Jun 15, 2026 •

edited

Loading