Skip to content

[NV] B300 (Agg): migrate model path#1539

Open
Ankur-singh wants to merge 5 commits into
mainfrom
fix-b300-model-path
Open

[NV] B300 (Agg): migrate model path#1539
Ankur-singh wants to merge 5 commits into
mainfrom
fix-b300-model-path

Conversation

@Ankur-singh
Copy link
Copy Markdown
Collaborator

Summary

  • Launcher (runners/launch_b300-nv.sh agg block): switch HF_HUB_CACHE_MOUNT from /data/models to /scratch/models/ (audit-recommended root), export HF_HUB_CACHE to container's ~/.cache/huggingface for eval dataset downloads, and replace the partial Qwen-FP8/dsv4 special-case rewrite with a universal MODEL rewrite from HF id (org/name) to absolute local path under the mount.
  • Bench scripts (21 in-scope B300 scripts): drop hf download "$MODEL". Contract is now staged-or-fail. qwen3.5_fp8_b300{,_mtp}.sh already lacked the line (legacy of the prior special-case).
  • Configs (23 in-scope keys): pin runner: b300-nv so jobs route only to the NVIDIA runner, and reduce search-space to a single (isl=1024, osl=1024, conc=4) point per config. To revert to full sweep after wiring is confirmed.

Out of scope

  • Disagg/multi-node block in the launcher
  • Multi-node B300 configs (dsr1-fp{4,8}-b300-dynamo-trt, dsv4-fp4-b300-dynamo-vllm)
  • Agentic configs (*-b300-vllm-agentic) and agentic/*_b300*.sh bench scripts
  • Other clusters (H100/H200/B200/GB200/GB300/MI3xx)

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment thread runners/launch_b300-nv.sh Outdated
Comment on lines +283 to +295
HF_HUB_CACHE_MOUNT="/scratch/models/"

# HF_HUB_CACHE is set to help with dataset download inside the container
# for eval jobs. Can be updated to some other path on the cluster and
# mounted just like HF_HUB_CACHE_MOUNT.
export HF_HUB_CACHE="$HOME/.cache/huggingface"

# Rewrite MODEL from HF id (org/name) to the pre-staged local path under
# HF_HUB_CACHE_MOUNT. Skip if MODEL is already an absolute path.
if [[ -n "$MODEL" && "$MODEL" != /* ]]; then
export MODEL="${HF_HUB_CACHE_MOUNT}${MODEL##*/}"
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new universal MODEL rewrite (launch_b300-nv.sh lines 290-294) also fires for the three *-b300-vllm-agentic configs that the PR description explicitly lists as out of scope. Because every single-node B300 job goes through this launcher (runner label b300 maps to physical b300-nv_X nodes in .github/configs/runners.yaml, and benchmark-tmpl.yml dispatches via ${RUNNER_NAME%%_*}), the agentic configs get MODEL rewritten to an absolute /scratch/models/<basename> path; the agentic bench scripts still gate hf download on MODEL != /* so the fallback is skipped, and they now silently require pre-staged copies at /scratch/models/{DeepSeek-V4-Pro,Kimi-K2.5-NVFP4,MiniMax-M2.5} — a contract this PR never establishes. Scope the rewrite (e.g. guard on SCENARIO_TYPE != agentic-coding or IS_AGENTIC == 0) or explicitly include the agentic configs in the migration with documented staging.

Extended reasoning...

What the bug is

The PR replaces the old partial rewrite (Qwen-FP8 literal + MODEL_PREFIX == dsv4) with a universal rewrite in runners/launch_b300-nv.sh:

HF_HUB_CACHE_MOUNT="/scratch/models/"
if [[ -n "$MODEL" && "$MODEL" != /* ]]; then
    export MODEL="${HF_HUB_CACHE_MOUNT}${MODEL##*/}"
fi

The PR description explicitly lists *-b300-vllm-agentic configs and agentic/*_b300*.sh bench scripts as out of scope, yet this rewrite runs for them too.

Why the agentic configs flow through the same launcher

  1. .github/configs/runners.yaml:136-145 defines label b300 as the physical runners b300-nv_0..8. There is no separate b300 runner pool.
  2. .github/workflows/benchmark-tmpl.yml:188 runs bash ./runners/launch_${RUNNER_NAME%%_*}.sh, where RUNNER_NAME=${{ runner.name }} (line 177). Since runner.name is the physical name b300-nv_X, this resolves to launch_b300-nv.sh regardless of whether the config says runner: b300 or runner: b300-nv.
  3. Only runners/launch_b300-nv.sh exists — there is no launch_b300.sh. (The pre-PR code already contained an elif [[ $MODEL_PREFIX == 'dsv4' ]] branch that clearly assumed the dsv4 agentic config flows through this launcher.)
  4. The three agentic configs are NOT retagged in this PR — they still set runner: b300:
  • nvidia-master.yaml:2644 kimik2.5-fp4-b300-vllm-agentic
  • nvidia-master.yaml:2885 dsv4-fp4-b300-vllm-agentic
  • nvidia-master.yaml:4199 minimaxm2.5-fp8-b300-vllm-agentic
  1. The agentic bench scripts still gate the HF fallback download:
  • benchmarks/single_node/agentic/dsv4_fp4_b300_vllm.sh:42
  • benchmarks/single_node/agentic/kimik2.5_fp4_b300.sh:24
  • benchmarks/single_node/agentic/minimaxm2.5_fp8_b300.sh:28
    all contain if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi.

Step-by-step proof (kimik2.5-fp4-b300-vllm-agentic, post-PR)

  1. Job dispatched with runner: b300 lands on b300-nv_3 (say). RUNNER_NAME=b300-nv_3.
  2. benchmark-tmpl.yml:188 runs bash ./runners/launch_b300-nv.sh (${RUNNER_NAME%%_*}b300-nv).
  3. multinode: false, so the launcher takes the else branch at line 281.
  4. New block at lines 290-294 executes: MODEL=nvidia/Kimi-K2.5-NVFP4${MODEL##*/} = Kimi-K2.5-NVFP4, so MODEL=/scratch/models/Kimi-K2.5-NVFP4.
  5. Launcher then sruns benchmarks/single_node/agentic/kimik2.5_fp4_b300.sh (SCENARIO_SUBDIR=agentic/ for scenario-type=agentic-coding).
  6. The script's line 24 (if [[ "$MODEL" != /* ]]; then hf download ...) is now false because MODEL is absolute, so no download.
  7. vllm serve "$MODEL" ... is invoked. It fails unless /scratch/models/Kimi-K2.5-NVFP4 is already staged on the cluster — a contract this PR neither documents nor verifies for the out-of-scope set.

Concrete behavior change vs. main

Config Pre-PR MODEL Post-PR MODEL New behavior
dsv4-fp4-b300-vllm-agentic /data/models/dsv4-pro (rewritten by MODEL_PREFIX == dsv4 branch) /scratch/models/DeepSeek-V4-Pro path + basename changed (dsv4-pro → DeepSeek-V4-Pro)
kimik2.5-fp4-b300-vllm-agentic nvidia/Kimi-K2.5-NVFP4 (no rewrite — fell through, hf download fetched it) /scratch/models/Kimi-K2.5-NVFP4 newly broken — was downloading via hf, now staged-or-fail
minimaxm2.5-fp8-b300-vllm-agentic MiniMaxAI/MiniMax-M2.5 (no rewrite — same) /scratch/models/MiniMax-M2.5 newly broken — was downloading via hf, now staged-or-fail

The kimik2.5 and minimaxm2.5 agentic cases are the most concerning: the pre-PR launcher did not touch them at all, so this PR newly couples them to staging on the B300 cluster, and the dsv4 case silently changes the basename.

Impact

If /scratch/models/{Kimi-K2.5-NVFP4,MiniMax-M2.5,DeepSeek-V4-Pro} are not pre-staged with those exact basenames, all three out-of-scope agentic configs will fail on first run after merge with model-not-found errors, in violation of the PR's own scope statement. Even if staging happens to exist (e.g. via overlap with the in-scope dsv4 staging), the contract is implicit rather than documented.

Fix

Either (a) scope the rewrite to non-agentic runs, e.g.

if [[ -n "$MODEL" && "$MODEL" != /* && "$SCENARIO_TYPE" != "agentic-coding" ]]; then
    export MODEL="${HF_HUB_CACHE_MOUNT}${MODEL##*/}"
fi

or (b) explicitly include the three agentic configs in the migration, confirm /scratch/models/{Kimi-K2.5-NVFP4,MiniMax-M2.5,DeepSeek-V4-Pro} are staged, and remove hf download from the corresponding agentic bench scripts.

Comment on lines 22 to 25

nvidia-smi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The reduced search-space for dsr1-fp8-b300-sglang is a single point at tp=4, isl=1024, osl=1024, but benchmarks/single_node/dsr1_fp8_b300.sh lines 47-51 contain a guard that does exit 1 when TP=4 is combined with anything other than ISL=8192/OSL=1024. Every run of this config will exit before the server launches, defeating the wiring-verification goal stated in the PR description. Fix by changing the sweep point to tp: 8, switching to isl: 8192, or relaxing the script guard.

Extended reasoning...

What the bug is

The PR shrinks the dsr1-fp8-b300-sglang sweep in .github/configs/nvidia-master.yaml to one point:

- isl: 1024
  osl: 1024
  search-space:
  - { tp: 4, ep: 1, conc-start: 4, conc-end: 4 }

But the bench script the launcher resolves for this config, benchmarks/single_node/dsr1_fp8_b300.sh, contains an explicit guard:

elif [[ $TP -eq 4 ]]; then
  if [[ $ISL -ne 8192 ]] || [[ $OSL -ne 1024 ]]; then
    echo "TP=4 not yet supported for ISL=$ISL OSL=$OSL!"
    exit 1
  fi

With TP=4 and ISL=1024, $ISL -ne 8192 short-circuits true, the script prints the rejection message and exits 1 before sglang.launch_server is ever invoked.

Code path

  1. runners/launch_b300-nv.sh (agg branch) resolves the bench script as benchmarks/single_node/dsr1_fp8_b300_sglang.sh first, then falls back to dsr1_fp8_b300.sh (LEGACY_FW_SUFFIX is empty for sglang). Only dsr1_fp8_b300.sh exists, so that is what runs.
  2. .github/workflows/benchmark-tmpl.yml exports TP from the matrix entry, so TP=4 is passed into the script.
  3. The TP branch above runs and exits 1.

Why pre-existing code doesn't prevent it

Before this PR the only ISL=1024/OSL=1024 entry under dsr1-fp8-b300-sglang was tp: 8, which is handled by the TP=8 branch above the guard. TP=4 only appeared under ISL=8192 — the one combination the guard explicitly permits. The PR's reduction removed the safe TP=8 point and replaced it with the exact (TP=4, ISL=1024) combination the guard rejects.

Impact

The PR description explicitly motivates the search-space reduction as wiring verification ("Reduce search-space to single (isl=1024, osl=1024, conc=4) point per config to verify model-path wiring end-to-end"). Because every job for dsr1-fp8-b300-sglang will exit 1 before launching the server, the model-path wiring for this config will not be verified at all — the script bails before reaching the MODEL=… path read, nvidia-smi, or sglang.launch_server --model-path $MODEL.

Step-by-step proof

  1. CI matrix produces a job with tp=4, ep=1, conc=4, isl=1024, osl=1024.
  2. benchmark-tmpl.yml exports TP=4 ISL=1024 OSL=1024 CONC=4 EP_SIZE=1 into the container.
  3. launch_b300-nv.sh (agg branch) selects benchmarks/single_node/dsr1_fp8_b300.sh and execs it.
  4. Script runs check_env_vars (passes), prints SLURM banner, then runs nvidia-smi.
  5. Control reaches the TP dispatch. $TP -eq 8 is false; $TP -eq 4 is true.
  6. Inside the TP=4 branch: $ISL -ne 8192 → true (ISL is 1024). Short-circuit OR makes the outer test true.
  7. Script prints TP=4 not yet supported for ISL=1024 OSL=1024! and exit 1.
  8. sglang.launch_server is never invoked; the universal MODEL rewrite the PR adds is never exercised.

Fix

Three equivalent options:

  • Change the search-space entry to tp: 8 (matches the previous ISL=1024 sweep).
  • Change isl to 8192 (the one ISL the TP=4 branch permits).
  • Relax the guard in dsr1_fp8_b300.sh so TP=4 supports ISL=1024 (this would require validating the recipe at that shape, since the existing TP=4 tuning was only for 8192/1024).

Ankur-singh added a commit that referenced this pull request May 21, 2026
@Ankur-singh Ankur-singh force-pushed the fix-b300-model-path branch from 17119f4 to a0fa8c4 Compare May 21, 2026 00:06
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

- launcher (launch_b300-nv.sh, agg block): switch HF_HUB_CACHE_MOUNT from /data/models to /scratch/models/ (audit-recommended root), export HF_HUB_CACHE to container's ~/.cache/huggingface for eval datasets, and replace the Qwen-FP8/dsv4 special-case rewrite with a universal MODEL rewrite from HF id (org/name) to absolute local path under the mount.
- bench scripts: drop "hf download $MODEL" from 21 in-scope B300 scripts; contract is now staged-or-fail.
- configs: pin 23 in-scope B300 keys to runner: b300-nv so jobs route only to the NVIDIA runner, and reduce search-space to a single (isl=1024, osl=1024, conc=4) point per config to verify model-path wiring end-to-end.

Out of scope: disagg/multi-node block, multi-node B300 configs (dynamo-trt/-vllm), and agentic configs/bench scripts.
Launcher resolves MODEL_PATH from a B300 staged allow-list: pre-staged
models map to read-only /scratch/models/<basename>; everything else maps
to writable /data/models/<basename> (now bind-mounted into the srun
container). Bench scripts run hf download --local-dir "$MODEL_PATH"
(idempotent: hf manages a .cache/huggingface/ inside --local-dir and
skips up-to-date files) then serve weights from $MODEL_PATH; when
MODEL_PATH is unset (stand-alone runs) they fall back to the default HF
cache and serve the HF id directly. $MODEL stays the HF id for the
client (tokenizer, --served-model-name).
@Ankur-singh Ankur-singh force-pushed the fix-b300-model-path branch from a0fa8c4 to 5d2e1d4 Compare May 21, 2026 20:01
@github-actions
Copy link
Copy Markdown
Contributor

…ize flag style

- Add --served-model-name $MODEL to vllm and sglang launches so the API name
  stays the HF id even though the server loads weights from $MODEL_PATH.
  Without this, vllm strictly rejects client requests (HTTP 404) when the
  request's model field doesn't match the served name (= the local path).
  sglang and trtllm are lenient about the mismatch.
- Drop the hardcoded --served-model-name "Qwen/Qwen3.5-397B-A17B" in the
  qwen3.5-bf16 scripts in favor of $MODEL.
- Guard hf download with a dir-non-empty check: hf download --local-dir
  writes per-file metadata locks under <local-dir>/.cache/huggingface/download/
  even on no-op runs, which fails on the read-only /scratch/models/ mount.
- Normalize all server-launch flags to --flag value (space) style; sglang
  scripts previously used --flag=value.
- nvidia-master.yaml: dsr1-fp8-b300-sglang 1k1k tp 4 -> 8 (the script's
  benchmarks/single_node/dsr1_fp8_b300.sh:57-61 guard doesn't support tp=4
  at isl=osl=1024).
@github-actions
Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant