feat(verda): B300 cluster support + reproduce pa_warm_start_sft_120b_1bmix_32k (256 TFLOP/s/GPU) by Kyle1668 · Pull Request #13 · GeodesicResearch/geodesic-megatron

Kyle1668 · 2026-06-24T05:38:23Z

Summary

Stands up the fork's pipelines on the Verda-Test cluster (2×8 NVIDIA B300, x86_64, InfiniBand, CUDA 13, NeMo-container runtime) and reproduces pa_warm_start_sft_120b_1bmix_32k_v1 (warm-start SFT of Nemotron-3-Super-120B-A12B-Base on the 1B reasoning mix, seq 32768) on 8/16 GPUs — then sweeps it to the fastest config: 256 TFLOP/s/GPU (clears the 250 target).

Verda-parallel additions only; Isambard files are untouched.

What's included

Verda cluster support (29fa4e21)

Container env (pipeline_env_verda.sh) — enroot import of nvcr.io/nvidia/nemo:26.04.01, fork via PYTHONPATH=$REPO/src (no uv sync); pyxis is broken here so we enroot create+start.
Launcher + submit (pipeline_training_{launch,submit}_verda.*) — minimal NCCL (IB auto-detected by NCCL 2.29.7), GLOO/NCCL_SOCKET_IFNAME=eth0 (the cross-node bootstrap fix), MASTER_ADDR/PORT/SLURM_NODEID passed into the container, ISAMBARD_FP32_SSM_STATE=checkpoint preserved (32k Mamba NaN guard).
Checkpoint convert (pipeline_checkpoint_convert_verda.sh), base chat-init graft (scripts/init_super_base_chat_embeddings.py — copies the 1188 dead-norm embedding rows from Instruct, else Inf in bucket #0 at iter 2).
Faithful configs (configs/verda/..._16gpu.yaml, ..._8gpu.yaml) + full reproduction doc (docs/pa-warm-start-120b-verda.md) + CLAUDE.md Verda section.

Fastest config + DP logging (07a57555)

configs/verda/..._16gpu_fp8norc.yaml — the sweep winner (FP8 mixed + no-recompute + HybridEP, CP=8): 256 TFLOP/s/GPU, 41.1 s/iter, 0 NaN, fits 268 GB.
configs/verda/..._16gpu_hybridep.yaml — best byte-faithful BF16 (205 TFLOP/s/GPU) for the quantization-free loss comparison vs 6cfuh1ky.
data_parallel_size logged at startup and recorded to the W&B run config (training/config.py, training/state.py) — it was previously underivable from the logs.

Optimization sweep (15-iter random-init smokes, 2×8 B300, TP1·PP1·EP8·ETP1·CP8·DP2, seq 32768, GBS 64)

Config	s/iter	TFLOP/s/GPU	peak GB	precision	recompute
fp8norc (chosen)	41.1	256	204	FP8	none
fp8norc, TP=8/CP=1	41.1	256	203	FP8	none
fp8lessrc	43.1	244	201	FP8	[shared_experts]
nvfp4norc	43.3	243	210	NVFP4	none
nvfp4lessrc	46.6	226	207	NVFP4	[shared_experts]
hybridep (best BF16)	51.3	205	226	BF16	[moe, shared_experts]
baseline (all-to-all)	55.4	180	—	BF16	[moe, shared_experts]
optimizer-offload	76.0	138	160	FP8	[moe, shared_experts]

Findings: FP8→drop-recompute is the lever (backward 37→26 s); TP=8 ties CP=8 on B300 (only 8/88 layers are attention, experts fold to EP regardless); DP=4/CP=4 OOMs (doubles the dispatch transient past 268 GB); optimizer CPU offload is −38% at PP=1/DP=2.

Testing / verification

Env: in-container imports OK (torch 2.11 / TE 2.14.1 / megatron.core / megatron.bridge), arch sm_103.
Base graft: id-11 (<|im_end|>) embedding row non-zero post-graft; HF→Megatron import completes.
Data: packed parquet (992,759,966 tokens, byte-identical to the reference), 474 iters.
16-GPU faithful run trains (iter-1 lm loss 0.86, ~59 s/iter at the offload baseline); sweep variants all 0-NaN.
DP logging verified live: > data_parallel_size = 2 (world_size=16 = TP1 x PP1 x CP8 x DP2).

Status / follow-ups

The full 474-iter fp8norc run is in progress; §4 loss-vs-6cfuh1ky overlay will be filled in on completion (use _hybridep.yaml BF16 for a quantization-free match).
Single-node 8-GPU (DP=1) host-OOMs on the replicated non-expert optimizer — TP=8 (shards it) is the documented lever for that case.
3rdparty/Megatron-LM is intentionally empty in this checkout (the container provides MCore); not part of this PR.

🤖 Generated with Claude Code

https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi

…b_1bmix_32k on B300 Add Verda-Test (2x8 B300, x86_64, CUDA 13, InfiniBand) support as Verda-parallel files; Isambard pipelines untouched. Runs the fork inside nvcr.io/nvidia/nemo:26.04.01 via enroot (pyxis is broken here) with PYTHONPATH=$REPO/src — no uv sync needed. - pipeline_{env,training_launch,training_submit,checkpoint_convert}_verda.* — enroot create/start launcher; multi-node fixes (enroot -e MASTER_ADDR/SLURM_NODEID, GLOO_SOCKET_IFNAME=eth0), FP32 SSM-state, /home + /mnt/local_disk caches. - configs/verda/pa_warm_start_sft_120b_1bmix_32k_{8gpu,16gpu}.yaml — reference recipe re-laid-out onto 8/16 GPUs (TP1 PP1 EP8 CP8; MoE-paper CP/EP NVLink fold). - scripts/init_super_base_chat_embeddings.py — Base-Chat-Init graft (restored). - docs/pa-warm-start-120b-verda.md + CLAUDE.md "Cluster Overview (Verda-Test)". Data verified byte-identical to reference 6cfuh1ky (992,759,966 tokens, 474 iters). 16-GPU (DP=2, optimizer offloaded to host) trains faithfully: iter-1 lm loss 0.86, ~59 s/iter, ~177 TFLOP/s/GPU. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi

…ze logging Sweep the 120B/16-GPU Verda config to the fastest setup via 15-iter smokes. fp8norc (FP8 mixed + no-recompute + HybridEP, CP=8) hits 256 TFLOP/s/GPU (41.1 s/iter) -- clears the 250 target, +42% over the BF16 all-to-all baseline; 0 NaN, fits 268 GB. - configs/verda: add _16gpu_fp8norc.yaml (fastest) and _16gpu_hybridep.yaml (best byte-faithful BF16, 205 TFLOP/s/GPU -- for the loss comparison vs the pure-BF16 reference 6cfuh1ky; FP8 shifts the curve slightly) - docs: section 3.5 optimization-sweep table + findings -- FP8->drop-recompute is the lever (backward 37->26 s); TP=8 ties CP=8 on B300 (only 8/88 layers are attention, experts fold to EP regardless); DP=4/CP=4 OOMs (doubles the dispatch transient); optimizer CPU offload is -38% at PP=1/DP=2 - log data_parallel_size at startup and record it to the W&B run config (training/config.py, training/state.py) -- previously underivable from logs Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi

The full epoch completed (0 NaN / 0 skipped, ckpt at iter_0000474). Final lm loss 0.502 vs the BF16 reference 6cfuh1ky's ~0.49 -- FP8 did not perturb convergence, so the run is a faithful reproduction at ~3.6x the reference's per-GPU throughput (240 vs 66 TFLOP/s/GPU). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi

…heckpoints The fp8norc checkpoints are FP8-trained (run_config fp8: hybrid). TE FP8 GEMMs require the product of the leading dims to be divisible by 8, so a small greedy-gen batch (e.g. a 25-token prompt) raises an FP8-execution assertion during coherence generation. Clear m.config.fp8 / fp8_param before the forward so the fp8_autocast gate is off and inference runs in BF16 (load_megatron_model only nulls fp8 for the CPU-init path; the GPU multi-rank load otherwise keeps the checkpoint's fp8). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi

Doubles train_iters 474 -> 948 (2 epochs of the 30,341-row pack) on the proven fastest topology (TP1/CP8/EP8, tensorwise-FP8 current scaling, HybridEP, no recompute) to test whether a second epoch improves downstream. Fresh load/save dir + W&B name so it trains from the grafted Base (not a resume of the 1-epoch run); LR cosine now decays over 948 iters, warmup stays 10%; saves at iter 474 (epoch-1, directly comparable to the 1-epoch run) and 948. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi

The 2-epoch config was a speculative variant with no references and no recorded run. The 8-GPU single-node config ships the host-OOM layout the docs mark as WIP (DP=1 replicates the non-expert optimizer x8 onto one host; needs PP=2/TP=2 to shard) and cannot complete as-is. Remove both and the now-dangling 8gpu usage example. The docs still note the single-node limitation as a result; the kept configs (16gpu base, fp8norc, hybridep) are all doc-referenced. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01RgPwybywvYQvZY3koEUPbi

Kyle1668 and others added 6 commits June 24, 2026 00:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(verda): B300 cluster support + reproduce pa_warm_start_sft_120b_1bmix_32k (256 TFLOP/s/GPU)#13

feat(verda): B300 cluster support + reproduce pa_warm_start_sft_120b_1bmix_32k (256 TFLOP/s/GPU)#13
Kyle1668 wants to merge 6 commits into
mainfrom
kyle/verda-cluster-support

Kyle1668 commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Kyle1668 commented Jun 24, 2026

Summary

What's included

Optimization sweep (15-iter random-init smokes, 2×8 B300, TP1·PP1·EP8·ETP1·CP8·DP2, seq 32768, GBS 64)

Testing / verification

Status / follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant