[NV] B300 (Agg): migrate model path#1539
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| HF_HUB_CACHE_MOUNT="/scratch/models/" | ||
|
|
||
| # HF_HUB_CACHE is set to help with dataset download inside the container | ||
| # for eval jobs. Can be updated to some other path on the cluster and | ||
| # mounted just like HF_HUB_CACHE_MOUNT. | ||
| export HF_HUB_CACHE="$HOME/.cache/huggingface" | ||
|
|
||
| # Rewrite MODEL from HF id (org/name) to the pre-staged local path under | ||
| # HF_HUB_CACHE_MOUNT. Skip if MODEL is already an absolute path. | ||
| if [[ -n "$MODEL" && "$MODEL" != /* ]]; then | ||
| export MODEL="${HF_HUB_CACHE_MOUNT}${MODEL##*/}" | ||
| fi | ||
|
|
There was a problem hiding this comment.
🔴 The new universal MODEL rewrite (launch_b300-nv.sh lines 290-294) also fires for the three *-b300-vllm-agentic configs that the PR description explicitly lists as out of scope. Because every single-node B300 job goes through this launcher (runner label b300 maps to physical b300-nv_X nodes in .github/configs/runners.yaml, and benchmark-tmpl.yml dispatches via ${RUNNER_NAME%%_*}), the agentic configs get MODEL rewritten to an absolute /scratch/models/<basename> path; the agentic bench scripts still gate hf download on MODEL != /* so the fallback is skipped, and they now silently require pre-staged copies at /scratch/models/{DeepSeek-V4-Pro,Kimi-K2.5-NVFP4,MiniMax-M2.5} — a contract this PR never establishes. Scope the rewrite (e.g. guard on SCENARIO_TYPE != agentic-coding or IS_AGENTIC == 0) or explicitly include the agentic configs in the migration with documented staging.
Extended reasoning...
What the bug is
The PR replaces the old partial rewrite (Qwen-FP8 literal + MODEL_PREFIX == dsv4) with a universal rewrite in runners/launch_b300-nv.sh:
HF_HUB_CACHE_MOUNT="/scratch/models/"
if [[ -n "$MODEL" && "$MODEL" != /* ]]; then
export MODEL="${HF_HUB_CACHE_MOUNT}${MODEL##*/}"
fiThe PR description explicitly lists *-b300-vllm-agentic configs and agentic/*_b300*.sh bench scripts as out of scope, yet this rewrite runs for them too.
Why the agentic configs flow through the same launcher
.github/configs/runners.yaml:136-145defines labelb300as the physical runnersb300-nv_0..8. There is no separateb300runner pool..github/workflows/benchmark-tmpl.yml:188runsbash ./runners/launch_${RUNNER_NAME%%_*}.sh, whereRUNNER_NAME=${{ runner.name }}(line 177). Sincerunner.nameis the physical nameb300-nv_X, this resolves tolaunch_b300-nv.shregardless of whether the config saysrunner: b300orrunner: b300-nv.- Only
runners/launch_b300-nv.shexists — there is nolaunch_b300.sh. (The pre-PR code already contained anelif [[ $MODEL_PREFIX == 'dsv4' ]]branch that clearly assumed the dsv4 agentic config flows through this launcher.) - The three agentic configs are NOT retagged in this PR — they still set
runner: b300:
nvidia-master.yaml:2644kimik2.5-fp4-b300-vllm-agenticnvidia-master.yaml:2885dsv4-fp4-b300-vllm-agenticnvidia-master.yaml:4199minimaxm2.5-fp8-b300-vllm-agentic
- The agentic bench scripts still gate the HF fallback download:
benchmarks/single_node/agentic/dsv4_fp4_b300_vllm.sh:42benchmarks/single_node/agentic/kimik2.5_fp4_b300.sh:24benchmarks/single_node/agentic/minimaxm2.5_fp8_b300.sh:28
all containif [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi.
Step-by-step proof (kimik2.5-fp4-b300-vllm-agentic, post-PR)
- Job dispatched with
runner: b300lands onb300-nv_3(say).RUNNER_NAME=b300-nv_3. benchmark-tmpl.yml:188runsbash ./runners/launch_b300-nv.sh(${RUNNER_NAME%%_*}→b300-nv).multinode: false, so the launcher takes the else branch at line 281.- New block at lines 290-294 executes:
MODEL=nvidia/Kimi-K2.5-NVFP4→${MODEL##*/}=Kimi-K2.5-NVFP4, soMODEL=/scratch/models/Kimi-K2.5-NVFP4. - Launcher then
srunsbenchmarks/single_node/agentic/kimik2.5_fp4_b300.sh(SCENARIO_SUBDIR=agentic/for scenario-type=agentic-coding). - The script's line 24 (
if [[ "$MODEL" != /* ]]; then hf download ...) is now false because MODEL is absolute, so no download. vllm serve "$MODEL" ...is invoked. It fails unless/scratch/models/Kimi-K2.5-NVFP4is already staged on the cluster — a contract this PR neither documents nor verifies for the out-of-scope set.
Concrete behavior change vs. main
| Config | Pre-PR MODEL | Post-PR MODEL | New behavior |
|---|---|---|---|
| dsv4-fp4-b300-vllm-agentic | /data/models/dsv4-pro (rewritten by MODEL_PREFIX == dsv4 branch) |
/scratch/models/DeepSeek-V4-Pro |
path + basename changed (dsv4-pro → DeepSeek-V4-Pro) |
| kimik2.5-fp4-b300-vllm-agentic | nvidia/Kimi-K2.5-NVFP4 (no rewrite — fell through, hf download fetched it) |
/scratch/models/Kimi-K2.5-NVFP4 |
newly broken — was downloading via hf, now staged-or-fail |
| minimaxm2.5-fp8-b300-vllm-agentic | MiniMaxAI/MiniMax-M2.5 (no rewrite — same) |
/scratch/models/MiniMax-M2.5 |
newly broken — was downloading via hf, now staged-or-fail |
The kimik2.5 and minimaxm2.5 agentic cases are the most concerning: the pre-PR launcher did not touch them at all, so this PR newly couples them to staging on the B300 cluster, and the dsv4 case silently changes the basename.
Impact
If /scratch/models/{Kimi-K2.5-NVFP4,MiniMax-M2.5,DeepSeek-V4-Pro} are not pre-staged with those exact basenames, all three out-of-scope agentic configs will fail on first run after merge with model-not-found errors, in violation of the PR's own scope statement. Even if staging happens to exist (e.g. via overlap with the in-scope dsv4 staging), the contract is implicit rather than documented.
Fix
Either (a) scope the rewrite to non-agentic runs, e.g.
if [[ -n "$MODEL" && "$MODEL" != /* && "$SCENARIO_TYPE" != "agentic-coding" ]]; then
export MODEL="${HF_HUB_CACHE_MOUNT}${MODEL##*/}"
fior (b) explicitly include the three agentic configs in the migration, confirm /scratch/models/{Kimi-K2.5-NVFP4,MiniMax-M2.5,DeepSeek-V4-Pro} are staged, and remove hf download from the corresponding agentic bench scripts.
|
|
||
| nvidia-smi | ||
|
|
||
| if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi | ||
|
|
There was a problem hiding this comment.
🔴 The reduced search-space for dsr1-fp8-b300-sglang is a single point at tp=4, isl=1024, osl=1024, but benchmarks/single_node/dsr1_fp8_b300.sh lines 47-51 contain a guard that does exit 1 when TP=4 is combined with anything other than ISL=8192/OSL=1024. Every run of this config will exit before the server launches, defeating the wiring-verification goal stated in the PR description. Fix by changing the sweep point to tp: 8, switching to isl: 8192, or relaxing the script guard.
Extended reasoning...
What the bug is
The PR shrinks the dsr1-fp8-b300-sglang sweep in .github/configs/nvidia-master.yaml to one point:
- isl: 1024
osl: 1024
search-space:
- { tp: 4, ep: 1, conc-start: 4, conc-end: 4 }But the bench script the launcher resolves for this config, benchmarks/single_node/dsr1_fp8_b300.sh, contains an explicit guard:
elif [[ $TP -eq 4 ]]; then
if [[ $ISL -ne 8192 ]] || [[ $OSL -ne 1024 ]]; then
echo "TP=4 not yet supported for ISL=$ISL OSL=$OSL!"
exit 1
fiWith TP=4 and ISL=1024, $ISL -ne 8192 short-circuits true, the script prints the rejection message and exits 1 before sglang.launch_server is ever invoked.
Code path
runners/launch_b300-nv.sh(agg branch) resolves the bench script asbenchmarks/single_node/dsr1_fp8_b300_sglang.shfirst, then falls back todsr1_fp8_b300.sh(LEGACY_FW_SUFFIXis empty for sglang). Onlydsr1_fp8_b300.shexists, so that is what runs..github/workflows/benchmark-tmpl.ymlexportsTPfrom the matrix entry, soTP=4is passed into the script.- The TP branch above runs and exits 1.
Why pre-existing code doesn't prevent it
Before this PR the only ISL=1024/OSL=1024 entry under dsr1-fp8-b300-sglang was tp: 8, which is handled by the TP=8 branch above the guard. TP=4 only appeared under ISL=8192 — the one combination the guard explicitly permits. The PR's reduction removed the safe TP=8 point and replaced it with the exact (TP=4, ISL=1024) combination the guard rejects.
Impact
The PR description explicitly motivates the search-space reduction as wiring verification ("Reduce search-space to single (isl=1024, osl=1024, conc=4) point per config to verify model-path wiring end-to-end"). Because every job for dsr1-fp8-b300-sglang will exit 1 before launching the server, the model-path wiring for this config will not be verified at all — the script bails before reaching the MODEL=… path read, nvidia-smi, or sglang.launch_server --model-path $MODEL.
Step-by-step proof
- CI matrix produces a job with
tp=4, ep=1, conc=4, isl=1024, osl=1024. benchmark-tmpl.ymlexportsTP=4 ISL=1024 OSL=1024 CONC=4 EP_SIZE=1into the container.launch_b300-nv.sh(agg branch) selectsbenchmarks/single_node/dsr1_fp8_b300.shand execs it.- Script runs
check_env_vars(passes), prints SLURM banner, then runsnvidia-smi. - Control reaches the TP dispatch.
$TP -eq 8is false;$TP -eq 4is true. - Inside the TP=4 branch:
$ISL -ne 8192→ true (ISL is 1024). Short-circuit OR makes the outer test true. - Script prints
TP=4 not yet supported for ISL=1024 OSL=1024!andexit 1. sglang.launch_serveris never invoked; the universal MODEL rewrite the PR adds is never exercised.
Fix
Three equivalent options:
- Change the search-space entry to
tp: 8(matches the previous ISL=1024 sweep). - Change
islto8192(the one ISL the TP=4 branch permits). - Relax the guard in
dsr1_fp8_b300.shso TP=4 supports ISL=1024 (this would require validating the recipe at that shape, since the existing TP=4 tuning was only for 8192/1024).
17119f4 to
a0fa8c4
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26195653467 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26197296230 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26197296230 |
- launcher (launch_b300-nv.sh, agg block): switch HF_HUB_CACHE_MOUNT from /data/models to /scratch/models/ (audit-recommended root), export HF_HUB_CACHE to container's ~/.cache/huggingface for eval datasets, and replace the Qwen-FP8/dsv4 special-case rewrite with a universal MODEL rewrite from HF id (org/name) to absolute local path under the mount. - bench scripts: drop "hf download $MODEL" from 21 in-scope B300 scripts; contract is now staged-or-fail. - configs: pin 23 in-scope B300 keys to runner: b300-nv so jobs route only to the NVIDIA runner, and reduce search-space to a single (isl=1024, osl=1024, conc=4) point per config to verify model-path wiring end-to-end. Out of scope: disagg/multi-node block, multi-node B300 configs (dynamo-trt/-vllm), and agentic configs/bench scripts.
Launcher resolves MODEL_PATH from a B300 staged allow-list: pre-staged models map to read-only /scratch/models/<basename>; everything else maps to writable /data/models/<basename> (now bind-mounted into the srun container). Bench scripts run hf download --local-dir "$MODEL_PATH" (idempotent: hf manages a .cache/huggingface/ inside --local-dir and skips up-to-date files) then serve weights from $MODEL_PATH; when MODEL_PATH is unset (stand-alone runs) they fall back to the default HF cache and serve the HF id directly. $MODEL stays the HF id for the client (tokenizer, --served-model-name).
a0fa8c4 to
5d2e1d4
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26249925931 |
…ize flag style - Add --served-model-name $MODEL to vllm and sglang launches so the API name stays the HF id even though the server loads weights from $MODEL_PATH. Without this, vllm strictly rejects client requests (HTTP 404) when the request's model field doesn't match the served name (= the local path). sglang and trtllm are lenient about the mismatch. - Drop the hardcoded --served-model-name "Qwen/Qwen3.5-397B-A17B" in the qwen3.5-bf16 scripts in favor of $MODEL. - Guard hf download with a dir-non-empty check: hf download --local-dir writes per-file metadata locks under <local-dir>/.cache/huggingface/download/ even on no-op runs, which fails on the read-only /scratch/models/ mount. - Normalize all server-launch flags to --flag value (space) style; sglang scripts previously used --flag=value. - nvidia-master.yaml: dsr1-fp8-b300-sglang 1k1k tp 4 -> 8 (the script's benchmarks/single_node/dsr1_fp8_b300.sh:57-61 guard doesn't support tp=4 at isl=osl=1024).
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26256331107 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26256331107 |
Summary
runners/launch_b300-nv.shagg block): switchHF_HUB_CACHE_MOUNTfrom/data/modelsto/scratch/models/(audit-recommended root), exportHF_HUB_CACHEto container's~/.cache/huggingfacefor eval dataset downloads, and replace the partial Qwen-FP8/dsv4 special-case rewrite with a universalMODELrewrite from HF id (org/name) to absolute local path under the mount.hf download "$MODEL". Contract is now staged-or-fail.qwen3.5_fp8_b300{,_mtp}.shalready lacked the line (legacy of the prior special-case).runner: b300-nvso jobs route only to the NVIDIA runner, and reduce search-space to a single(isl=1024, osl=1024, conc=4)point per config. To revert to full sweep after wiring is confirmed.Out of scope
dsr1-fp{4,8}-b300-dynamo-trt,dsv4-fp4-b300-dynamo-vllm)*-b300-vllm-agentic) andagentic/*_b300*.shbench scripts