feat(verda): B300 cluster support + reproduce pa_warm_start_sft_120b_1bmix_32k (256 TFLOP/s/GPU)#13
Open
Kyle1668 wants to merge 6 commits into
Open
feat(verda): B300 cluster support + reproduce pa_warm_start_sft_120b_1bmix_32k (256 TFLOP/s/GPU)#13Kyle1668 wants to merge 6 commits into
Kyle1668 wants to merge 6 commits into
Conversation
…b_1bmix_32k on B300
Add Verda-Test (2x8 B300, x86_64, CUDA 13, InfiniBand) support as Verda-parallel
files; Isambard pipelines untouched. Runs the fork inside nvcr.io/nvidia/nemo:26.04.01
via enroot (pyxis is broken here) with PYTHONPATH=$REPO/src — no uv sync needed.
- pipeline_{env,training_launch,training_submit,checkpoint_convert}_verda.* — enroot
create/start launcher; multi-node fixes (enroot -e MASTER_ADDR/SLURM_NODEID,
GLOO_SOCKET_IFNAME=eth0), FP32 SSM-state, /home + /mnt/local_disk caches.
- configs/verda/pa_warm_start_sft_120b_1bmix_32k_{8gpu,16gpu}.yaml — reference recipe
re-laid-out onto 8/16 GPUs (TP1 PP1 EP8 CP8; MoE-paper CP/EP NVLink fold).
- scripts/init_super_base_chat_embeddings.py — Base-Chat-Init graft (restored).
- docs/pa-warm-start-120b-verda.md + CLAUDE.md "Cluster Overview (Verda-Test)".
Data verified byte-identical to reference 6cfuh1ky (992,759,966 tokens, 474 iters).
16-GPU (DP=2, optimizer offloaded to host) trains faithfully: iter-1 lm loss 0.86,
~59 s/iter, ~177 TFLOP/s/GPU.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi
…ze logging Sweep the 120B/16-GPU Verda config to the fastest setup via 15-iter smokes. fp8norc (FP8 mixed + no-recompute + HybridEP, CP=8) hits 256 TFLOP/s/GPU (41.1 s/iter) -- clears the 250 target, +42% over the BF16 all-to-all baseline; 0 NaN, fits 268 GB. - configs/verda: add _16gpu_fp8norc.yaml (fastest) and _16gpu_hybridep.yaml (best byte-faithful BF16, 205 TFLOP/s/GPU -- for the loss comparison vs the pure-BF16 reference 6cfuh1ky; FP8 shifts the curve slightly) - docs: section 3.5 optimization-sweep table + findings -- FP8->drop-recompute is the lever (backward 37->26 s); TP=8 ties CP=8 on B300 (only 8/88 layers are attention, experts fold to EP regardless); DP=4/CP=4 OOMs (doubles the dispatch transient); optimizer CPU offload is -38% at PP=1/DP=2 - log data_parallel_size at startup and record it to the W&B run config (training/config.py, training/state.py) -- previously underivable from logs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi
The full epoch completed (0 NaN / 0 skipped, ckpt at iter_0000474). Final lm loss 0.502 vs the BF16 reference 6cfuh1ky's ~0.49 -- FP8 did not perturb convergence, so the run is a faithful reproduction at ~3.6x the reference's per-GPU throughput (240 vs 66 TFLOP/s/GPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi
…heckpoints The fp8norc checkpoints are FP8-trained (run_config fp8: hybrid). TE FP8 GEMMs require the product of the leading dims to be divisible by 8, so a small greedy-gen batch (e.g. a 25-token prompt) raises an FP8-execution assertion during coherence generation. Clear m.config.fp8 / fp8_param before the forward so the fp8_autocast gate is off and inference runs in BF16 (load_megatron_model only nulls fp8 for the CPU-init path; the GPU multi-rank load otherwise keeps the checkpoint's fp8). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi
Doubles train_iters 474 -> 948 (2 epochs of the 30,341-row pack) on the proven fastest topology (TP1/CP8/EP8, tensorwise-FP8 current scaling, HybridEP, no recompute) to test whether a second epoch improves downstream. Fresh load/save dir + W&B name so it trains from the grafted Base (not a resume of the 1-epoch run); LR cosine now decays over 948 iters, warmup stays 10%; saves at iter 474 (epoch-1, directly comparable to the 1-epoch run) and 948. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi
The 2-epoch config was a speculative variant with no references and no recorded run. The 8-GPU single-node config ships the host-OOM layout the docs mark as WIP (DP=1 replicates the non-expert optimizer x8 onto one host; needs PP=2/TP=2 to shard) and cannot complete as-is. Remove both and the now-dangling 8gpu usage example. The docs still note the single-node limitation as a result; the kept configs (16gpu base, fp8norc, hybridep) are all doc-referenced. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stands up the fork's pipelines on the Verda-Test cluster (2×8 NVIDIA B300, x86_64, InfiniBand, CUDA 13, NeMo-container runtime) and reproduces
pa_warm_start_sft_120b_1bmix_32k_v1(warm-start SFT of Nemotron-3-Super-120B-A12B-Base on the 1B reasoning mix, seq 32768) on 8/16 GPUs — then sweeps it to the fastest config: 256 TFLOP/s/GPU (clears the 250 target).Verda-parallel additions only; Isambard files are untouched.
What's included
Verda cluster support (
29fa4e21)pipeline_env_verda.sh) — enroot import ofnvcr.io/nvidia/nemo:26.04.01, fork viaPYTHONPATH=$REPO/src(nouv sync); pyxis is broken here so weenroot create+start.pipeline_training_{launch,submit}_verda.*) — minimal NCCL (IB auto-detected by NCCL 2.29.7),GLOO/NCCL_SOCKET_IFNAME=eth0(the cross-node bootstrap fix),MASTER_ADDR/PORT/SLURM_NODEIDpassed into the container,ISAMBARD_FP32_SSM_STATE=checkpointpreserved (32k Mamba NaN guard).pipeline_checkpoint_convert_verda.sh), base chat-init graft (scripts/init_super_base_chat_embeddings.py— copies the 1188 dead-norm embedding rows from Instruct, elseInf in bucket #0at iter 2).configs/verda/..._16gpu.yaml,..._8gpu.yaml) + full reproduction doc (docs/pa-warm-start-120b-verda.md) + CLAUDE.md Verda section.Fastest config + DP logging (
07a57555)configs/verda/..._16gpu_fp8norc.yaml— the sweep winner (FP8 mixed + no-recompute + HybridEP, CP=8): 256 TFLOP/s/GPU, 41.1 s/iter, 0 NaN, fits 268 GB.configs/verda/..._16gpu_hybridep.yaml— best byte-faithful BF16 (205 TFLOP/s/GPU) for the quantization-free loss comparison vs6cfuh1ky.data_parallel_sizelogged at startup and recorded to the W&B run config (training/config.py,training/state.py) — it was previously underivable from the logs.Optimization sweep (15-iter random-init smokes, 2×8 B300, TP1·PP1·EP8·ETP1·CP8·DP2, seq 32768, GBS 64)
Findings: FP8→drop-recompute is the lever (backward 37→26 s); TP=8 ties CP=8 on B300 (only 8/88 layers are attention, experts fold to EP regardless); DP=4/CP=4 OOMs (doubles the dispatch transient past 268 GB); optimizer CPU offload is −38% at PP=1/DP=2.
Testing / verification
<|im_end|>) embedding row non-zero post-graft; HF→Megatron import completes.> data_parallel_size = 2 (world_size=16 = TP1 x PP1 x CP8 x DP2).Status / follow-ups
fp8norcrun is in progress; §4 loss-vs-6cfuh1kyoverlay will be filled in on completion (use_hybridep.yamlBF16 for a quantization-free match).3rdparty/Megatron-LMis intentionally empty in this checkout (the container provides MCore); not part of this PR.🤖 Generated with Claude Code
https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi