[GB300][SGLang] Bump SGLang image for dsv4-fp4-gb300-dynamo-sglang-mtp#1559
[GB300][SGLang] Bump SGLang image for dsv4-fp4-gb300-dynamo-sglang-mtp#1559Fridge003 wants to merge 5 commits into
Conversation
Update SGLang container image from nightly-dev-cu13-20260510-2473659e to nightly-dev-20260522-c9153da5 across all DeepSeek-V4 8k1k disagg recipes.
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| model: | ||
| path: "deepseek-v4-pro" | ||
| container: "lmsysorg/sglang:nightly-dev-cu13-20260510-2473659e" | ||
| container: "lmsysorg/sglang:nightly-dev-20260522-c9153da5" |
There was a problem hiding this comment.
🔴 The 6 recipe YAMLs are bumped to lmsysorg/sglang:nightly-dev-20260522-c9153da5, but the matching image: field on the dsv4-fp4-gb300-dynamo-sglang-mtp block in .github/configs/nvidia-master.yaml (line 9073) is left at the stale lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034. Per AGENTS.md the two must be bumped in lockstep — the launcher uses image: as the container-alias key, so without this update CI will still import/run the old image and the perf-changelog claim is untrue. Fix: bump line 9073 of nvidia-master.yaml to the same nightly-dev-20260522-c9153da5 tag.
Extended reasoning...
What's wrong
This PR bumps model.container in all six benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/*-mtp.yaml files from lmsysorg/sglang:nightly-dev-cu13-20260510-2473659e to lmsysorg/sglang:nightly-dev-20260522-c9153da5, and adds a perf-changelog entry that explicitly claims the image was updated for dsv4-fp4-gb300-dynamo-sglang-mtp. However .github/configs/nvidia-master.yaml line 9073 (the image: field on the dsv4-fp4-gb300-dynamo-sglang-mtp block) still reads lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034 — an even older 20260509 tag from before the previous bump.
Why this matters
AGENTS.md line 115 documents the invariant explicitly: multi-node srt-slurm changes must edit the recipe yaml AND nvidia-master.yaml together, and for image bumps model.container must equal image: because the launcher uses the latter as the container-alias key. Concretely, .github/workflows/profile.yml reads matrix.config.image from nvidia-master.yaml into the IMAGE env var, and runners/launch_gb300-cw.sh uses it both to build/import the enroot squash file (enroot import -o ... docker://$image) and to register the alias in the generated srtslurm.yaml containers map (${IMAGE}: ${SQUASH_FILE}). The recipe's container: is then matched against that alias by srtctl.
Precedent
The sibling non-MTP PR #1528 (commit 59980fe) for dsv4-fp4-gb300-dynamo-sglang updated BOTH .github/configs/nvidia-master.yaml AND the recipe YAMLs in lockstep. After that PR, the non-MTP block at line 8760 sits at nightly-dev-cu13-20260520-425dffbd matching its recipe — a consistent lockstep. The MTP variant has now diverged 13 days from its recipe, and the new tag has dropped the cu13 prefix.
Step-by-step proof of impact
- CI launches the sweep, profile.yml reads
matrix.config.imagefrom nvidia-master.yaml →IMAGE=lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034. runners/launch_gb300-cw.shrunsenroot import -o $SQUASH_FILE docker://$IMAGE— squashing the 20260509 image.- The generated
srtslurm.yamlregisterscontainers: { "${IMAGE}": ${SQUASH_FILE} }— keyed by the 20260509 tag. - srtctl loads the recipe yaml, sees
model.container: lmsysorg/sglang:nightly-dev-20260522-c9153da5— does not match the alias. - Result is one of: (a) srtctl falls back to a fresh docker pull of the 20260522 image at runtime (defeating the pre-stage), (b) the alias mismatch causes a launch failure, or (c) the bench actually runs against the 20260509 squash file — invalidating the perf-changelog claim. All three are bad outcomes.
Fix
Bump .github/configs/nvidia-master.yaml line 9073 from lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034 to lmsysorg/sglang:nightly-dev-20260522-c9153da5 in this PR, matching the recipe container: values and the lockstep pattern established by PR #1528.
| - dsv4-fp4-gb300-dynamo-sglang-mtp | ||
| description: | ||
| - "Update SGLang image from nightly-dev-cu13-20260510-2473659e to nightly-dev-20260522-c9153da5" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1559 |
There was a problem hiding this comment.
🟡 The new perf-changelog.yaml entry for dsv4-fp4-gb300-dynamo-sglang-mtp has an unresolved template placeholder: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER. This link will 404 when clicked — please substitute the real PR number (1559) before merging.
Extended reasoning...
What the bug is
The newly-added entry in perf-changelog.yaml at line 3137 ends with:
- config-keys:
- dsv4-fp4-gb300-dynamo-sglang-mtp
description:
- "Update SGLang image from nightly-dev-cu13-20260510-2473659e to nightly-dev-20260522-c9153da5"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDERThe literal string PLACEHOLDER was never replaced with the actual PR number. (Note: the rendered PR diff in some GitHub views may show pull/1559, but the actual committed file content — what reviewers will merge — contains PLACEHOLDER. git show f66004e -- perf-changelog.yaml confirms the committed diff added the literal PLACEHOLDER text.)
Why existing code doesn't catch it
A repo-wide grep for PLACEHOLDER returns only this one location, so there is no post-merge template-substitution step that would replace it. Schema validation (e.g. utils/matrix_logic/validation.py) only checks that pr-link is a string — it does not validate URL format or that the path resolves, so the entry passes validation cleanly while still producing a broken link.
Impact
This is documentation-only — there is no runtime effect. Anyone clicking the link from the changelog to find the originating PR for the dsv4-fp4-gb300-dynamo-sglang-mtp image bump will hit a 404. Every other recent entry in this file uses the real PR number (e.g. pull/1554, pull/1555, pull/1516, pull/1514), so this entry breaks the established convention.
Fix
Replace PLACEHOLDER with 1559 (this PR's number):
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1559Step-by-step proof
- Open
perf-changelog.yamland look at line 3137. Actual content:pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER. - Run
git show f66004e -- perf-changelog.yaml— the added line is+ pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER, i.e. the committed diff contains the placeholder literally. - Run
grep -r PLACEHOLDERover the repo —PLACEHOLDERappears in no other file and in no workflow under.github/, so there is no substitution mechanism that would replace it post-merge. - Navigate to
https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER— GitHub returns 404 (path is not a valid PR number).
- Drop SGLANG_OPT_USE_JIT_NORM, SGLANG_OPT_USE_JIT_INDEXER_METADATA, SGLANG_OPT_USE_TOPK_V2 (now default-on in latest sglang). - Drop the MegaMoE companion envs that sglang now auto-sets when SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE is enabled: SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE, SGLANG_OPT_FIX_HASH_MEGA_MOE, SGLANG_OPT_FIX_MEGA_MOE_MEMORY, SGLANG_OPT_FIX_NEXTN_MEGA_MOE, SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK. - Drop SGLANG_RADIX_DISABLE_REUSE and SGLANG_OPT_USE_FAST_MASK_EP which no longer exist in sglang's environ.py.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26315949011 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26316947560 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26317022901 |
- Update dynamo commit hash to 81d0555ee23519cea80a42b4fe824e30368b7300 across all 6 dsv4 8k1k disagg recipes. - Quote moe-a2a-backend value as "megamoe" for consistency with other string fields. - Remove the now-unused deepep-config entries; megamoe doesn't read them.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26318696409 |
2 similar comments
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26318696409 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26318696409 |
Summary
lmsysorg/sglang:nightly-dev-cu13-20260510-2473659etolmsysorg/sglang:nightly-dev-20260522-c9153da5across all six DeepSeek-V4 8k1k disagg recipes.