-
Notifications
You must be signed in to change notification settings - Fork 175
[NV] B300 (Agg): migrate model path #1539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Ankur-singh
wants to merge
5
commits into
main
Choose a base branch
from
fix-b300-model-path
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
6377d35
b300(agg): migrate model path to /scratch/models/, drop hf download
Ankur-singh 629c2ac
perf-changelog: set PR link to #1539
Ankur-singh b46504a
revert runner to b300 (drop b300-nv suffix)
Ankur-singh 5d2e1d4
b300(agg): add hf download fallback, separate MODEL_PATH from MODEL
Ankur-singh bb1a4ae
b300(agg): fix vllm 404s, skip hf download on read-only mount, normal…
Ankur-singh File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The reduced search-space for
dsr1-fp8-b300-sglangis a single point attp=4, isl=1024, osl=1024, butbenchmarks/single_node/dsr1_fp8_b300.shlines 47-51 contain a guard that doesexit 1whenTP=4is combined with anything other thanISL=8192/OSL=1024. Every run of this config will exit before the server launches, defeating the wiring-verification goal stated in the PR description. Fix by changing the sweep point totp: 8, switching toisl: 8192, or relaxing the script guard.Extended reasoning...
What the bug is
The PR shrinks the
dsr1-fp8-b300-sglangsweep in.github/configs/nvidia-master.yamlto one point:But the bench script the launcher resolves for this config,
benchmarks/single_node/dsr1_fp8_b300.sh, contains an explicit guard:With
TP=4andISL=1024,$ISL -ne 8192short-circuits true, the script prints the rejection message and exits 1 beforesglang.launch_serveris ever invoked.Code path
runners/launch_b300-nv.sh(agg branch) resolves the bench script asbenchmarks/single_node/dsr1_fp8_b300_sglang.shfirst, then falls back todsr1_fp8_b300.sh(LEGACY_FW_SUFFIXis empty for sglang). Onlydsr1_fp8_b300.shexists, so that is what runs..github/workflows/benchmark-tmpl.ymlexportsTPfrom the matrix entry, soTP=4is passed into the script.Why pre-existing code doesn't prevent it
Before this PR the only
ISL=1024/OSL=1024entry underdsr1-fp8-b300-sglangwastp: 8, which is handled by the TP=8 branch above the guard.TP=4only appeared underISL=8192— the one combination the guard explicitly permits. The PR's reduction removed the safeTP=8point and replaced it with the exact(TP=4, ISL=1024)combination the guard rejects.Impact
The PR description explicitly motivates the search-space reduction as wiring verification ("Reduce search-space to single (isl=1024, osl=1024, conc=4) point per config to verify model-path wiring end-to-end"). Because every job for
dsr1-fp8-b300-sglangwill exit 1 before launching the server, the model-path wiring for this config will not be verified at all — the script bails before reaching theMODEL=…path read,nvidia-smi, orsglang.launch_server --model-path $MODEL.Step-by-step proof
tp=4, ep=1, conc=4, isl=1024, osl=1024.benchmark-tmpl.ymlexportsTP=4 ISL=1024 OSL=1024 CONC=4 EP_SIZE=1into the container.launch_b300-nv.sh(agg branch) selectsbenchmarks/single_node/dsr1_fp8_b300.shand execs it.check_env_vars(passes), prints SLURM banner, then runsnvidia-smi.$TP -eq 8is false;$TP -eq 4is true.$ISL -ne 8192→ true (ISL is 1024). Short-circuit OR makes the outer test true.TP=4 not yet supported for ISL=1024 OSL=1024!andexit 1.sglang.launch_serveris never invoked; the universal MODEL rewrite the PR adds is never exercised.Fix
Three equivalent options:
tp: 8(matches the previous ISL=1024 sweep).islto8192(the one ISL the TP=4 branch permits).dsr1_fp8_b300.shso TP=4 supports ISL=1024 (this would require validating the recipe at that shape, since the existing TP=4 tuning was only for 8192/1024).