Skip to content

feat(verda): B300 cluster support + reproduce pa_warm_start_sft_120b_1bmix_32k (256 TFLOP/s/GPU)#13

Open
Kyle1668 wants to merge 6 commits into
mainfrom
kyle/verda-cluster-support
Open

feat(verda): B300 cluster support + reproduce pa_warm_start_sft_120b_1bmix_32k (256 TFLOP/s/GPU)#13
Kyle1668 wants to merge 6 commits into
mainfrom
kyle/verda-cluster-support

Conversation

@Kyle1668

Copy link
Copy Markdown
Contributor

Summary

Stands up the fork's pipelines on the Verda-Test cluster (2×8 NVIDIA B300, x86_64, InfiniBand, CUDA 13, NeMo-container runtime) and reproduces pa_warm_start_sft_120b_1bmix_32k_v1 (warm-start SFT of Nemotron-3-Super-120B-A12B-Base on the 1B reasoning mix, seq 32768) on 8/16 GPUs — then sweeps it to the fastest config: 256 TFLOP/s/GPU (clears the 250 target).

Verda-parallel additions only; Isambard files are untouched.

What's included

Verda cluster support (29fa4e21)

  • Container env (pipeline_env_verda.sh) — enroot import of nvcr.io/nvidia/nemo:26.04.01, fork via PYTHONPATH=$REPO/src (no uv sync); pyxis is broken here so we enroot create+start.
  • Launcher + submit (pipeline_training_{launch,submit}_verda.*) — minimal NCCL (IB auto-detected by NCCL 2.29.7), GLOO/NCCL_SOCKET_IFNAME=eth0 (the cross-node bootstrap fix), MASTER_ADDR/PORT/SLURM_NODEID passed into the container, ISAMBARD_FP32_SSM_STATE=checkpoint preserved (32k Mamba NaN guard).
  • Checkpoint convert (pipeline_checkpoint_convert_verda.sh), base chat-init graft (scripts/init_super_base_chat_embeddings.py — copies the 1188 dead-norm embedding rows from Instruct, else Inf in bucket #0 at iter 2).
  • Faithful configs (configs/verda/..._16gpu.yaml, ..._8gpu.yaml) + full reproduction doc (docs/pa-warm-start-120b-verda.md) + CLAUDE.md Verda section.

Fastest config + DP logging (07a57555)

  • configs/verda/..._16gpu_fp8norc.yaml — the sweep winner (FP8 mixed + no-recompute + HybridEP, CP=8): 256 TFLOP/s/GPU, 41.1 s/iter, 0 NaN, fits 268 GB.
  • configs/verda/..._16gpu_hybridep.yaml — best byte-faithful BF16 (205 TFLOP/s/GPU) for the quantization-free loss comparison vs 6cfuh1ky.
  • data_parallel_size logged at startup and recorded to the W&B run config (training/config.py, training/state.py) — it was previously underivable from the logs.

Optimization sweep (15-iter random-init smokes, 2×8 B300, TP1·PP1·EP8·ETP1·CP8·DP2, seq 32768, GBS 64)

Config s/iter TFLOP/s/GPU peak GB precision recompute
fp8norc (chosen) 41.1 256 204 FP8 none
fp8norc, TP=8/CP=1 41.1 256 203 FP8 none
fp8lessrc 43.1 244 201 FP8 [shared_experts]
nvfp4norc 43.3 243 210 NVFP4 none
nvfp4lessrc 46.6 226 207 NVFP4 [shared_experts]
hybridep (best BF16) 51.3 205 226 BF16 [moe, shared_experts]
baseline (all-to-all) 55.4 180 BF16 [moe, shared_experts]
optimizer-offload 76.0 138 160 FP8 [moe, shared_experts]

Findings: FP8→drop-recompute is the lever (backward 37→26 s); TP=8 ties CP=8 on B300 (only 8/88 layers are attention, experts fold to EP regardless); DP=4/CP=4 OOMs (doubles the dispatch transient past 268 GB); optimizer CPU offload is −38% at PP=1/DP=2.

Testing / verification

  • Env: in-container imports OK (torch 2.11 / TE 2.14.1 / megatron.core / megatron.bridge), arch sm_103.
  • Base graft: id-11 (<|im_end|>) embedding row non-zero post-graft; HF→Megatron import completes.
  • Data: packed parquet (992,759,966 tokens, byte-identical to the reference), 474 iters.
  • 16-GPU faithful run trains (iter-1 lm loss 0.86, ~59 s/iter at the offload baseline); sweep variants all 0-NaN.
  • DP logging verified live: > data_parallel_size = 2 (world_size=16 = TP1 x PP1 x CP8 x DP2).

Status / follow-ups

  • The full 474-iter fp8norc run is in progress; §4 loss-vs-6cfuh1ky overlay will be filled in on completion (use _hybridep.yaml BF16 for a quantization-free match).
  • Single-node 8-GPU (DP=1) host-OOMs on the replicated non-expert optimizer — TP=8 (shards it) is the documented lever for that case.
  • 3rdparty/Megatron-LM is intentionally empty in this checkout (the container provides MCore); not part of this PR.

🤖 Generated with Claude Code

https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi

Kyle1668 and others added 6 commits June 24, 2026 00:41
…b_1bmix_32k on B300

Add Verda-Test (2x8 B300, x86_64, CUDA 13, InfiniBand) support as Verda-parallel
files; Isambard pipelines untouched. Runs the fork inside nvcr.io/nvidia/nemo:26.04.01
via enroot (pyxis is broken here) with PYTHONPATH=$REPO/src — no uv sync needed.

- pipeline_{env,training_launch,training_submit,checkpoint_convert}_verda.* — enroot
  create/start launcher; multi-node fixes (enroot -e MASTER_ADDR/SLURM_NODEID,
  GLOO_SOCKET_IFNAME=eth0), FP32 SSM-state, /home + /mnt/local_disk caches.
- configs/verda/pa_warm_start_sft_120b_1bmix_32k_{8gpu,16gpu}.yaml — reference recipe
  re-laid-out onto 8/16 GPUs (TP1 PP1 EP8 CP8; MoE-paper CP/EP NVLink fold).
- scripts/init_super_base_chat_embeddings.py — Base-Chat-Init graft (restored).
- docs/pa-warm-start-120b-verda.md + CLAUDE.md "Cluster Overview (Verda-Test)".

Data verified byte-identical to reference 6cfuh1ky (992,759,966 tokens, 474 iters).
16-GPU (DP=2, optimizer offloaded to host) trains faithfully: iter-1 lm loss 0.86,
~59 s/iter, ~177 TFLOP/s/GPU.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi
…ze logging

Sweep the 120B/16-GPU Verda config to the fastest setup via 15-iter smokes.
fp8norc (FP8 mixed + no-recompute + HybridEP, CP=8) hits 256 TFLOP/s/GPU
(41.1 s/iter) -- clears the 250 target, +42% over the BF16 all-to-all
baseline; 0 NaN, fits 268 GB.

- configs/verda: add _16gpu_fp8norc.yaml (fastest) and _16gpu_hybridep.yaml
  (best byte-faithful BF16, 205 TFLOP/s/GPU -- for the loss comparison vs the
  pure-BF16 reference 6cfuh1ky; FP8 shifts the curve slightly)
- docs: section 3.5 optimization-sweep table + findings -- FP8->drop-recompute
  is the lever (backward 37->26 s); TP=8 ties CP=8 on B300 (only 8/88 layers
  are attention, experts fold to EP regardless); DP=4/CP=4 OOMs (doubles the
  dispatch transient); optimizer CPU offload is -38% at PP=1/DP=2
- log data_parallel_size at startup and record it to the W&B run config
  (training/config.py, training/state.py) -- previously underivable from logs

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi
The full epoch completed (0 NaN / 0 skipped, ckpt at iter_0000474). Final
lm loss 0.502 vs the BF16 reference 6cfuh1ky's ~0.49 -- FP8 did not perturb
convergence, so the run is a faithful reproduction at ~3.6x the reference's
per-GPU throughput (240 vs 66 TFLOP/s/GPU).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi
…heckpoints

The fp8norc checkpoints are FP8-trained (run_config fp8: hybrid). TE FP8 GEMMs
require the product of the leading dims to be divisible by 8, so a small greedy-gen
batch (e.g. a 25-token prompt) raises an FP8-execution assertion during coherence
generation. Clear m.config.fp8 / fp8_param before the forward so the fp8_autocast
gate is off and inference runs in BF16 (load_megatron_model only nulls fp8 for the
CPU-init path; the GPU multi-rank load otherwise keeps the checkpoint's fp8).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi
Doubles train_iters 474 -> 948 (2 epochs of the 30,341-row pack) on the proven
fastest topology (TP1/CP8/EP8, tensorwise-FP8 current scaling, HybridEP, no
recompute) to test whether a second epoch improves downstream. Fresh load/save
dir + W&B name so it trains from the grafted Base (not a resume of the 1-epoch
run); LR cosine now decays over 948 iters, warmup stays 10%; saves at iter 474
(epoch-1, directly comparable to the 1-epoch run) and 948.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi
The 2-epoch config was a speculative variant with no references and no recorded
run. The 8-GPU single-node config ships the host-OOM layout the docs mark as WIP
(DP=1 replicates the non-expert optimizer x8 onto one host; needs PP=2/TP=2 to
shard) and cannot complete as-is. Remove both and the now-dangling 8gpu usage
example. The docs still note the single-node limitation as a result; the kept
configs (16gpu base, fp8norc, hybridep) are all doc-referenced.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant