[NV] B300 (Agg): migrate model path by Ankur-singh · Pull Request #1539 · SemiAnalysisAI/InferenceX

Ankur-singh · 2026-05-20T23:19:38Z

Summary

Launcher (runners/launch_b300-nv.sh agg block): switch HF_HUB_CACHE_MOUNT from /data/models to /scratch/models/ (audit-recommended root), export HF_HUB_CACHE to container's ~/.cache/huggingface for eval dataset downloads, and replace the partial Qwen-FP8/dsv4 special-case rewrite with a universal MODEL rewrite from HF id (org/name) to absolute local path under the mount.
Bench scripts (21 in-scope B300 scripts): drop hf download "$MODEL". Contract is now staged-or-fail. qwen3.5_fp8_b300{,_mtp}.sh already lacked the line (legacy of the prior special-case).
Configs (23 in-scope keys): pin runner: b300-nv so jobs route only to the NVIDIA runner, and reduce search-space to a single (isl=1024, osl=1024, conc=4) point per config. To revert to full sweep after wiring is confirmed.

Out of scope

Disagg/multi-node block in the launcher
Multi-node B300 configs (dsr1-fp{4,8}-b300-dynamo-trt, dsv4-fp4-b300-dynamo-vllm)
Agentic configs (*-b300-vllm-agentic) and agentic/*_b300*.sh bench scripts
Other clusters (H100/H200/B200/GB200/GB300/MI3xx)

github-actions · 2026-05-20T23:19:45Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-05-20T23:35:34Z

+    HF_HUB_CACHE_MOUNT="/scratch/models/"
+
+    # HF_HUB_CACHE is set to help with dataset download inside the container
+    # for eval jobs. Can be updated to some other path on the cluster and
+    # mounted just like HF_HUB_CACHE_MOUNT.
+    export HF_HUB_CACHE="$HOME/.cache/huggingface"
+
+    # Rewrite MODEL from HF id (org/name) to the pre-staged local path under
+    # HF_HUB_CACHE_MOUNT. Skip if MODEL is already an absolute path.
+    if [[ -n "$MODEL" && "$MODEL" != /* ]]; then
+        export MODEL="${HF_HUB_CACHE_MOUNT}${MODEL##*/}"
    fi
+


🔴 The new universal MODEL rewrite (launch_b300-nv.sh lines 290-294) also fires for the three *-b300-vllm-agentic configs that the PR description explicitly lists as out of scope. Because every single-node B300 job goes through this launcher (runner label b300 maps to physical b300-nv_X nodes in .github/configs/runners.yaml, and benchmark-tmpl.yml dispatches via ${RUNNER_NAME%%_*}), the agentic configs get MODEL rewritten to an absolute /scratch/models/<basename> path; the agentic bench scripts still gate hf download on MODEL != /* so the fallback is skipped, and they now silently require pre-staged copies at /scratch/models/{DeepSeek-V4-Pro,Kimi-K2.5-NVFP4,MiniMax-M2.5} — a contract this PR never establishes. Scope the rewrite (e.g. guard on SCENARIO_TYPE != agentic-coding or IS_AGENTIC == 0) or explicitly include the agentic configs in the migration with documented staging.

Extended reasoning...

What the bug is

The PR replaces the old partial rewrite (Qwen-FP8 literal + MODEL_PREFIX == dsv4) with a universal rewrite in runners/launch_b300-nv.sh:

HF_HUB_CACHE_MOUNT="/scratch/models/" if [[ -n "$MODEL" && "$MODEL" != /* ]]; then export MODEL="${HF_HUB_CACHE_MOUNT}${MODEL##*/}" fi

The PR description explicitly lists *-b300-vllm-agentic configs and agentic/*_b300*.sh bench scripts as out of scope, yet this rewrite runs for them too.

Why the agentic configs flow through the same launcher

.github/configs/runners.yaml:136-145 defines label b300 as the physical runners b300-nv_0..8. There is no separate b300 runner pool.

.github/workflows/benchmark-tmpl.yml:188 runs bash ./runners/launch_${RUNNER_NAME%%_*}.sh, where RUNNER_NAME=${{ runner.name }} (line 177). Since runner.name is the physical name b300-nv_X, this resolves to launch_b300-nv.sh regardless of whether the config says runner: b300 or runner: b300-nv.

Only runners/launch_b300-nv.sh exists — there is no launch_b300.sh. (The pre-PR code already contained an elif [[ $MODEL_PREFIX == 'dsv4' ]] branch that clearly assumed the dsv4 agentic config flows through this launcher.)

The three agentic configs are NOT retagged in this PR — they still set runner: b300:

nvidia-master.yaml:2644 kimik2.5-fp4-b300-vllm-agentic

nvidia-master.yaml:2885 dsv4-fp4-b300-vllm-agentic

nvidia-master.yaml:4199 minimaxm2.5-fp8-b300-vllm-agentic

The agentic bench scripts still gate the HF fallback download:

benchmarks/single_node/agentic/dsv4_fp4_b300_vllm.sh:42

benchmarks/single_node/agentic/kimik2.5_fp4_b300.sh:24

benchmarks/single_node/agentic/minimaxm2.5_fp8_b300.sh:28
all contain if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi.

Step-by-step proof (kimik2.5-fp4-b300-vllm-agentic, post-PR)

Job dispatched with runner: b300 lands on b300-nv_3 (say). RUNNER_NAME=b300-nv_3.

benchmark-tmpl.yml:188 runs bash ./runners/launch_b300-nv.sh (${RUNNER_NAME%%_*} → b300-nv).

multinode: false, so the launcher takes the else branch at line 281.

New block at lines 290-294 executes: MODEL=nvidia/Kimi-K2.5-NVFP4 → ${MODEL##*/} = Kimi-K2.5-NVFP4, so MODEL=/scratch/models/Kimi-K2.5-NVFP4.

Launcher then sruns benchmarks/single_node/agentic/kimik2.5_fp4_b300.sh (SCENARIO_SUBDIR=agentic/ for scenario-type=agentic-coding).

The script's line 24 (if [[ "$MODEL" != /* ]]; then hf download ...) is now false because MODEL is absolute, so no download.

vllm serve "$MODEL" ... is invoked. It fails unless /scratch/models/Kimi-K2.5-NVFP4 is already staged on the cluster — a contract this PR neither documents nor verifies for the out-of-scope set.

Concrete behavior change vs. main

Config Pre-PR MODEL Post-PR MODEL New behavior

dsv4-fp4-b300-vllm-agentic /data/models/dsv4-pro (rewritten by MODEL_PREFIX == dsv4 branch) /scratch/models/DeepSeek-V4-Pro path + basename changed (dsv4-pro → DeepSeek-V4-Pro)

kimik2.5-fp4-b300-vllm-agentic nvidia/Kimi-K2.5-NVFP4 (no rewrite — fell through, hf download fetched it) /scratch/models/Kimi-K2.5-NVFP4 newly broken — was downloading via hf, now staged-or-fail

minimaxm2.5-fp8-b300-vllm-agentic MiniMaxAI/MiniMax-M2.5 (no rewrite — same) /scratch/models/MiniMax-M2.5 newly broken — was downloading via hf, now staged-or-fail

The kimik2.5 and minimaxm2.5 agentic cases are the most concerning: the pre-PR launcher did not touch them at all, so this PR newly couples them to staging on the B300 cluster, and the dsv4 case silently changes the basename.

Impact

If /scratch/models/{Kimi-K2.5-NVFP4,MiniMax-M2.5,DeepSeek-V4-Pro} are not pre-staged with those exact basenames, all three out-of-scope agentic configs will fail on first run after merge with model-not-found errors, in violation of the PR's own scope statement. Even if staging happens to exist (e.g. via overlap with the in-scope dsv4 staging), the contract is implicit rather than documented.

Fix

Either (a) scope the rewrite to non-agentic runs, e.g.

if [[ -n "$MODEL" && "$MODEL" != /* && "$SCENARIO_TYPE" != "agentic-coding" ]]; then export MODEL="${HF_HUB_CACHE_MOUNT}${MODEL##*/}" fi

or (b) explicitly include the three agentic configs in the migration, confirm /scratch/models/{Kimi-K2.5-NVFP4,MiniMax-M2.5,DeepSeek-V4-Pro} are staged, and remove hf download from the corresponding agentic bench scripts.

claude · 2026-05-20T23:35:34Z


 nvidia-smi

-if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi



🔴 The reduced search-space for dsr1-fp8-b300-sglang is a single point at tp=4, isl=1024, osl=1024, but benchmarks/single_node/dsr1_fp8_b300.sh lines 47-51 contain a guard that does exit 1 when TP=4 is combined with anything other than ISL=8192/OSL=1024. Every run of this config will exit before the server launches, defeating the wiring-verification goal stated in the PR description. Fix by changing the sweep point to tp: 8, switching to isl: 8192, or relaxing the script guard.

Extended reasoning...

What the bug is

The PR shrinks the dsr1-fp8-b300-sglang sweep in .github/configs/nvidia-master.yaml to one point:

- isl: 1024 osl: 1024 search-space: - { tp: 4, ep: 1, conc-start: 4, conc-end: 4 }

But the bench script the launcher resolves for this config, benchmarks/single_node/dsr1_fp8_b300.sh, contains an explicit guard:

elif [[ $TP -eq 4 ]]; then if [[ $ISL -ne 8192 ]] || [[ $OSL -ne 1024 ]]; then echo "TP=4 not yet supported for ISL=$ISL OSL=$OSL!" exit 1 fi

With TP=4 and ISL=1024, $ISL -ne 8192 short-circuits true, the script prints the rejection message and exits 1 before sglang.launch_server is ever invoked.

Code path

runners/launch_b300-nv.sh (agg branch) resolves the bench script as benchmarks/single_node/dsr1_fp8_b300_sglang.sh first, then falls back to dsr1_fp8_b300.sh (LEGACY_FW_SUFFIX is empty for sglang). Only dsr1_fp8_b300.sh exists, so that is what runs.

.github/workflows/benchmark-tmpl.yml exports TP from the matrix entry, so TP=4 is passed into the script.

The TP branch above runs and exits 1.

Why pre-existing code doesn't prevent it

Before this PR the only ISL=1024/OSL=1024 entry under dsr1-fp8-b300-sglang was tp: 8, which is handled by the TP=8 branch above the guard. TP=4 only appeared under ISL=8192 — the one combination the guard explicitly permits. The PR's reduction removed the safe TP=8 point and replaced it with the exact (TP=4, ISL=1024) combination the guard rejects.

Impact

The PR description explicitly motivates the search-space reduction as wiring verification ("Reduce search-space to single (isl=1024, osl=1024, conc=4) point per config to verify model-path wiring end-to-end"). Because every job for dsr1-fp8-b300-sglang will exit 1 before launching the server, the model-path wiring for this config will not be verified at all — the script bails before reaching the MODEL=… path read, nvidia-smi, or sglang.launch_server --model-path $MODEL.

Step-by-step proof

CI matrix produces a job with tp=4, ep=1, conc=4, isl=1024, osl=1024.

benchmark-tmpl.yml exports TP=4 ISL=1024 OSL=1024 CONC=4 EP_SIZE=1 into the container.

launch_b300-nv.sh (agg branch) selects benchmarks/single_node/dsr1_fp8_b300.sh and execs it.

Script runs check_env_vars (passes), prints SLURM banner, then runs nvidia-smi.

Control reaches the TP dispatch. $TP -eq 8 is false; $TP -eq 4 is true.

Inside the TP=4 branch: $ISL -ne 8192 → true (ISL is 1024). Short-circuit OR makes the outer test true.

Script prints TP=4 not yet supported for ISL=1024 OSL=1024! and exit 1.

sglang.launch_server is never invoked; the universal MODEL rewrite the PR adds is never exercised.

Fix

Three equivalent options:

Change the search-space entry to tp: 8 (matches the previous ISL=1024 sweep).

Change isl to 8192 (the one ISL the TP=4 branch permits).

Relax the guard in dsr1_fp8_b300.sh so TP=4 supports ISL=1024 (this would require validating the recipe at that shape, since the existing TP=4 tuning was only for 8192/1024).

github-actions · 2026-05-21T00:07:47Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26195653467
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26195653467

github-actions · 2026-05-21T01:14:00Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26197296230
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26197296230

github-actions · 2026-05-21T05:25:02Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26197296230
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26197296230

- launcher (launch_b300-nv.sh, agg block): switch HF_HUB_CACHE_MOUNT from /data/models to /scratch/models/ (audit-recommended root), export HF_HUB_CACHE to container's ~/.cache/huggingface for eval datasets, and replace the Qwen-FP8/dsv4 special-case rewrite with a universal MODEL rewrite from HF id (org/name) to absolute local path under the mount. - bench scripts: drop "hf download $MODEL" from 21 in-scope B300 scripts; contract is now staged-or-fail. - configs: pin 23 in-scope B300 keys to runner: b300-nv so jobs route only to the NVIDIA runner, and reduce search-space to a single (isl=1024, osl=1024, conc=4) point per config to verify model-path wiring end-to-end. Out of scope: disagg/multi-node block, multi-node B300 configs (dynamo-trt/-vllm), and agentic configs/bench scripts.

Launcher resolves MODEL_PATH from a B300 staged allow-list: pre-staged models map to read-only /scratch/models/<basename>; everything else maps to writable /data/models/<basename> (now bind-mounted into the srun container). Bench scripts run hf download --local-dir "$MODEL_PATH" (idempotent: hf manages a .cache/huggingface/ inside --local-dir and skips up-to-date files) then serve weights from $MODEL_PATH; when MODEL_PATH is unset (stand-alone runs) they fall back to the default HF cache and serve the HF id directly. $MODEL stays the HF id for the client (tokenizer, --served-model-name).

github-actions · 2026-05-21T21:11:01Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26249925931
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26249925931

…ize flag style - Add --served-model-name $MODEL to vllm and sglang launches so the API name stays the HF id even though the server loads weights from $MODEL_PATH. Without this, vllm strictly rejects client requests (HTTP 404) when the request's model field doesn't match the served name (= the local path). sglang and trtllm are lenient about the mismatch. - Drop the hardcoded --served-model-name "Qwen/Qwen3.5-397B-A17B" in the qwen3.5-bf16 scripts in favor of $MODEL. - Guard hf download with a dir-non-empty check: hf download --local-dir writes per-file metadata locks under <local-dir>/.cache/huggingface/download/ even on no-op runs, which fails on the read-only /scratch/models/ mount. - Normalize all server-launch flags to --flag value (space) style; sglang scripts previously used --flag=value. - nvidia-master.yaml: dsr1-fp8-b300-sglang 1k1k tp 4 -> 8 (the script's benchmarks/single_node/dsr1_fp8_b300.sh:57-61 guard doesn't support tp=4 at isl=osl=1024).

github-actions · 2026-05-21T23:14:16Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26256331107
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26256331107

github-actions · 2026-05-22T00:13:45Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26256331107
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26256331107

Ankur-singh requested a review from a team May 20, 2026 23:19

Ankur-singh requested review from jgangani and kedarpotdar-nv as code owners May 20, 2026 23:19

github-project-automation Bot added this to InferenceMAX Board May 20, 2026

Ankur-singh added a commit that referenced this pull request May 20, 2026

perf-changelog: set PR link to #1539

17119f4

Ankur-singh added the sweep-enabled label May 20, 2026

claude Bot reviewed May 20, 2026

View reviewed changes

Ankur-singh added a commit that referenced this pull request May 21, 2026

perf-changelog: set PR link to #1539

bfddde2

Ankur-singh force-pushed the fix-b300-model-path branch from 17119f4 to a0fa8c4 Compare May 21, 2026 00:06

Ankur-singh added 4 commits May 21, 2026 12:56

perf-changelog: set PR link to #1539

629c2ac

revert runner to b300 (drop b300-nv suffix)

b46504a

Ankur-singh force-pushed the fix-b300-model-path branch from a0fa8c4 to 5d2e1d4 Compare May 21, 2026 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NV] B300 (Agg): migrate model path#1539

[NV] B300 (Agg): migrate model path#1539
Ankur-singh wants to merge 5 commits into
mainfrom
fix-b300-model-path

Ankur-singh commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

claude Bot May 20, 2026

Uh oh!

claude Bot May 20, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Config	Pre-PR MODEL	Post-PR MODEL	New behavior
dsv4-fp4-b300-vllm-agentic	`/data/models/dsv4-pro` (rewritten by `MODEL_PREFIX == dsv4` branch)	`/scratch/models/DeepSeek-V4-Pro`	path + basename changed (dsv4-pro → DeepSeek-V4-Pro)
kimik2.5-fp4-b300-vllm-agentic	`nvidia/Kimi-K2.5-NVFP4` (no rewrite — fell through, `hf download` fetched it)	`/scratch/models/Kimi-K2.5-NVFP4`	newly broken — was downloading via hf, now staged-or-fail
minimaxm2.5-fp8-b300-vllm-agentic	`MiniMaxAI/MiniMax-M2.5` (no rewrite — same)	`/scratch/models/MiniMax-M2.5`	newly broken — was downloading via hf, now staged-or-fail


		nvidia-smi

		if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

Conversation

Ankur-singh commented May 20, 2026

Summary

Out of scope

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

claude Bot May 20, 2026

Choose a reason for hiding this comment

What the bug is

Why the agentic configs flow through the same launcher

Step-by-step proof (kimik2.5-fp4-b300-vllm-agentic, post-PR)

Concrete behavior change vs. main

Impact

Fix

Uh oh!

claude Bot May 20, 2026

Choose a reason for hiding this comment

What the bug is

Code path

Why pre-existing code doesn't prevent it

Impact

Step-by-step proof

Fix

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant