Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions docs/en/examples/gemma4.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Gemma4 Dense and MoE with GSM8K

This example is a small model-support validation for the Gemma4 text models. It
uses GSM8K because the purpose is to verify the Megatron model path, SGLang
rollout load path, loss masking, backward pass, and live weight update without
adding task-specific runtime variables.

Larger task-specific recipes should be layered on after this validation passes.

## What to Run

Run the dense and MoE variants separately on one 8-GPU node:

| Model | Script | Megatron topology | SGLang topology |
| --- | --- | --- | --- |
| `google/gemma-4-31B-it` | `scripts/run-gemma4-31B-gsm8k.sh` | TP2 PP4 CP1 | TP8 |
| `google/gemma-4-26B-A4B-it` | `scripts/run-gemma4-26B-A4B-gsm8k.sh` | TP2 PP2 EP2 CP1 | TP8 |

The scripts default to two rollouts with short responses. They are intended to
prove that the model can train, not to report a meaningful GSM8K score. A small
default `--entropy-coef` keeps the optimizer path active even when the tiny
sample receives zero reward.

Use a fresh converted checkpoint directory for each model and topology. The
default paths include TP/PP/EP/CP because Megatron distributed checkpoints are
sharded by the conversion topology.

## Prepare Checkpoints and Data

```bash
cd /root
git clone https://github.com/THUDM/slime.git
cd slime
pip install -e . --no-deps

hf download google/gemma-4-31B-it --local-dir /root/gemma-4-31B-it
hf download google/gemma-4-26B-A4B-it --local-dir /root/gemma-4-26B-A4B-it
hf download --repo-type dataset zhuzilin/gsm8k --local-dir /root/datasets/gsm8k
```

Convert the dense checkpoint:

```bash
cd /root/slime
source scripts/models/gemma4-31B.sh
PYTHONPATH=/root/Megatron-LM torchrun --nproc-per-node 8 \
tools/convert_hf_to_torch_dist.py \
"${MODEL_ARGS[@]}" \
--hf-checkpoint /root/gemma-4-31B-it \
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 4 \
--context-parallel-size 1 \
--save /root/gemma-4-31B-it_tp2_pp4_cp1_torch_dist
```

Convert the MoE checkpoint:

```bash
cd /root/slime
source scripts/models/gemma4-26B-A4B.sh
PYTHONPATH=/root/Megatron-LM torchrun --nproc-per-node 8 \
tools/convert_hf_to_torch_dist.py \
"${MODEL_ARGS[@]}" \
--hf-checkpoint /root/gemma-4-26B-A4B-it \
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 2 \
--expert-model-parallel-size 2 \
--context-parallel-size 1 \
--save /root/gemma-4-26B-A4B-it_tp2_pp2_ep2_cp1_torch_dist
```

## Run Training

```bash
cd /root/slime
bash scripts/run-gemma4-31B-gsm8k.sh
bash scripts/run-gemma4-26B-A4B-gsm8k.sh
```

To log the validation runs:

```bash
USE_WANDB=1 WANDB_PROJECT=slime-gemma4-gsm8k bash scripts/run-gemma4-31B-gsm8k.sh
USE_WANDB=1 WANDB_PROJECT=slime-gemma4-gsm8k bash scripts/run-gemma4-26B-A4B-gsm8k.sh
```

## Expected Signal

A successful run should show:

- SGLang loading `Gemma4ForConditionalGeneration`.
- At least one completed rollout and train step.
- `train/loss`, `train/grad_norm`, and entropy metrics in stdout or W&B.
- Successful raw `update_weights` from Megatron to SGLang.

For quality training, increase the rollout count, batch sizes, response length,
and evaluation interval, and set `ENTROPY_COEF=0`.
1 change: 1 addition & 0 deletions docs/en/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ Start by Use Case
:caption: Dense

examples/qwen3-4B.md
examples/gemma4.md
examples/glm4-9B.md

.. toctree::
Expand Down
94 changes: 94 additions & 0 deletions docs/zh/examples/gemma4.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Gemma4 Dense 与 MoE 的 GSM8K 示例

这个示例用于验证 Gemma4 text 模型在 slime 中的模型支持。这里使用
GSM8K,因为目标是验证 Megatron 模型路径、SGLang rollout 加载路径、loss
mask、反向传播和在线权重更新,不引入任务特定的 runtime 变量。

更大的任务特定 recipe 应当在这个验证通过后再接入。

## 运行内容

在单个 8 卡节点上分别运行 dense 和 MoE 版本:

| 模型 | 脚本 | Megatron 拓扑 | SGLang 拓扑 |
| --- | --- | --- | --- |
| `google/gemma-4-31B-it` | `scripts/run-gemma4-31B-gsm8k.sh` | TP2 PP4 CP1 | TP8 |
| `google/gemma-4-26B-A4B-it` | `scripts/run-gemma4-26B-A4B-gsm8k.sh` | TP2 PP2 EP2 CP1 | TP8 |

脚本默认只跑两个 rollout,并使用较短的 response length。它用于证明模型可以
完成训练闭环,不用于报告有意义的 GSM8K 分数。默认的一个很小的
`--entropy-coef` 用来确保在小样本全零 reward 时仍然会触发 optimizer 路径。

每种模型和拓扑都应使用新的转换 checkpoint 目录。默认路径包含 TP/PP/EP/CP,
因为 Megatron distributed checkpoint 会按转换拓扑切分。

## 准备 Checkpoint 与数据

```bash
cd /root
git clone https://github.com/THUDM/slime.git
cd slime
pip install -e . --no-deps

hf download google/gemma-4-31B-it --local-dir /root/gemma-4-31B-it
hf download google/gemma-4-26B-A4B-it --local-dir /root/gemma-4-26B-A4B-it
hf download --repo-type dataset zhuzilin/gsm8k --local-dir /root/datasets/gsm8k
```

转换 dense checkpoint:

```bash
cd /root/slime
source scripts/models/gemma4-31B.sh
PYTHONPATH=/root/Megatron-LM torchrun --nproc-per-node 8 \
tools/convert_hf_to_torch_dist.py \
"${MODEL_ARGS[@]}" \
--hf-checkpoint /root/gemma-4-31B-it \
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 4 \
--context-parallel-size 1 \
--save /root/gemma-4-31B-it_tp2_pp4_cp1_torch_dist
```

转换 MoE checkpoint:

```bash
cd /root/slime
source scripts/models/gemma4-26B-A4B.sh
PYTHONPATH=/root/Megatron-LM torchrun --nproc-per-node 8 \
tools/convert_hf_to_torch_dist.py \
"${MODEL_ARGS[@]}" \
--hf-checkpoint /root/gemma-4-26B-A4B-it \
--tensor-model-parallel-size 2 \
--pipeline-model-parallel-size 2 \
--expert-model-parallel-size 2 \
--context-parallel-size 1 \
--save /root/gemma-4-26B-A4B-it_tp2_pp2_ep2_cp1_torch_dist
```

## 运行训练

```bash
cd /root/slime
bash scripts/run-gemma4-31B-gsm8k.sh
bash scripts/run-gemma4-26B-A4B-gsm8k.sh
```

如果需要记录到 W&B:

```bash
USE_WANDB=1 WANDB_PROJECT=slime-gemma4-gsm8k bash scripts/run-gemma4-31B-gsm8k.sh
USE_WANDB=1 WANDB_PROJECT=slime-gemma4-gsm8k bash scripts/run-gemma4-26B-A4B-gsm8k.sh
```

## 期望信号

成功运行时应当看到:

- SGLang 加载 `Gemma4ForConditionalGeneration`。
- 至少一个 rollout 和 train step 完成。
- stdout 或 W&B 中出现 `train/loss`、`train/grad_norm` 和 entropy 指标。
- Megatron 到 SGLang 的 raw `update_weights` 成功。

如果要做正式效果训练,应增加 rollout 数量、batch size、response length 和
eval interval,并设置 `ENTROPY_COEF=0`。
1 change: 1 addition & 0 deletions docs/zh/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ slime 的设计目标,是让这两大能力彼此强化,同时避免把系
:caption: Dense

examples/qwen3-4B.md
examples/gemma4.md
examples/glm4-9B.md

.. toctree::
Expand Down
19 changes: 19 additions & 0 deletions scripts/models/gemma4-12B.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
MODEL_ARGS=(
--spec "slime_plugins.models.gemma4" "get_gemma4_spec"
--custom-model-provider-path "slime_plugins.models.gemma4_provider.model_provider"
--num-layers 48
--hidden-size 3840
--ffn-hidden-size 15360
--num-attention-heads 16
--group-query-attention
--num-query-groups 8
--kv-channels 256
--use-rotary-position-embeddings
--disable-bias-linear
--normalization "RMSNorm"
--norm-epsilon 1e-6
--rotary-base 10000
--rotary-percent 1.0
--vocab-size 262144
--qk-layernorm
)
28 changes: 28 additions & 0 deletions scripts/models/gemma4-26B-A4B.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
MODEL_ARGS=(
--spec "slime_plugins.models.gemma4" "get_gemma4_spec"
--custom-model-provider-path "slime_plugins.models.gemma4_provider.model_provider"
--num-layers 30
--hidden-size 2816
--ffn-hidden-size 2112
--num-attention-heads 16
--group-query-attention
--num-query-groups 8
--kv-channels 256
--use-rotary-position-embeddings
--disable-bias-linear
--normalization "RMSNorm"
--norm-epsilon 1e-6
--rotary-base 10000
--rotary-percent 1.0
--vocab-size 262144
--qk-layernorm
--num-experts 128
--moe-ffn-hidden-size 704
--moe-router-topk 8
--moe-router-dtype fp32
--moe-router-score-function softmax
--moe-router-load-balancing-type none
--moe-aux-loss-coeff 0.0
--moe-token-dispatcher-type alltoall
--moe-grouped-gemm
)
19 changes: 19 additions & 0 deletions scripts/models/gemma4-31B.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
MODEL_ARGS=(
--spec "slime_plugins.models.gemma4" "get_gemma4_spec"
--custom-model-provider-path "slime_plugins.models.gemma4_provider.model_provider"
--num-layers 60
--hidden-size 5376
--ffn-hidden-size 21504
--num-attention-heads 32
--group-query-attention
--num-query-groups 16
--kv-channels 256
--use-rotary-position-embeddings
--disable-bias-linear
--normalization "RMSNorm"
--norm-epsilon 1e-6
--rotary-base 10000
--rotary-percent 1.0
--vocab-size 262144
--qk-layernorm
)
Loading
Loading