Skip to content

[GB300][SGLang] Bump SGLang image for dsv4-fp4-gb300-dynamo-sglang-mtp#1559

Open
Fridge003 wants to merge 5 commits into
mainfrom
sgl_image_bump_dsv4
Open

[GB300][SGLang] Bump SGLang image for dsv4-fp4-gb300-dynamo-sglang-mtp#1559
Fridge003 wants to merge 5 commits into
mainfrom
sgl_image_bump_dsv4

Conversation

@Fridge003
Copy link
Copy Markdown
Collaborator

@Fridge003 Fridge003 commented May 22, 2026

Summary

  • Bump SGLang container image from lmsysorg/sglang:nightly-dev-cu13-20260510-2473659e to lmsysorg/sglang:nightly-dev-20260522-c9153da5 across all six DeepSeek-V4 8k1k disagg recipes.

Update SGLang container image from nightly-dev-cu13-20260510-2473659e
to nightly-dev-20260522-c9153da5 across all DeepSeek-V4 8k1k disagg
recipes.
@Fridge003 Fridge003 requested a review from a team May 22, 2026 22:55
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

model:
path: "deepseek-v4-pro"
container: "lmsysorg/sglang:nightly-dev-cu13-20260510-2473659e"
container: "lmsysorg/sglang:nightly-dev-20260522-c9153da5"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The 6 recipe YAMLs are bumped to lmsysorg/sglang:nightly-dev-20260522-c9153da5, but the matching image: field on the dsv4-fp4-gb300-dynamo-sglang-mtp block in .github/configs/nvidia-master.yaml (line 9073) is left at the stale lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034. Per AGENTS.md the two must be bumped in lockstep — the launcher uses image: as the container-alias key, so without this update CI will still import/run the old image and the perf-changelog claim is untrue. Fix: bump line 9073 of nvidia-master.yaml to the same nightly-dev-20260522-c9153da5 tag.

Extended reasoning...

What's wrong

This PR bumps model.container in all six benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/*-mtp.yaml files from lmsysorg/sglang:nightly-dev-cu13-20260510-2473659e to lmsysorg/sglang:nightly-dev-20260522-c9153da5, and adds a perf-changelog entry that explicitly claims the image was updated for dsv4-fp4-gb300-dynamo-sglang-mtp. However .github/configs/nvidia-master.yaml line 9073 (the image: field on the dsv4-fp4-gb300-dynamo-sglang-mtp block) still reads lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034 — an even older 20260509 tag from before the previous bump.

Why this matters

AGENTS.md line 115 documents the invariant explicitly: multi-node srt-slurm changes must edit the recipe yaml AND nvidia-master.yaml together, and for image bumps model.container must equal image: because the launcher uses the latter as the container-alias key. Concretely, .github/workflows/profile.yml reads matrix.config.image from nvidia-master.yaml into the IMAGE env var, and runners/launch_gb300-cw.sh uses it both to build/import the enroot squash file (enroot import -o ... docker://$image) and to register the alias in the generated srtslurm.yaml containers map (${IMAGE}: ${SQUASH_FILE}). The recipe's container: is then matched against that alias by srtctl.

Precedent

The sibling non-MTP PR #1528 (commit 59980fe) for dsv4-fp4-gb300-dynamo-sglang updated BOTH .github/configs/nvidia-master.yaml AND the recipe YAMLs in lockstep. After that PR, the non-MTP block at line 8760 sits at nightly-dev-cu13-20260520-425dffbd matching its recipe — a consistent lockstep. The MTP variant has now diverged 13 days from its recipe, and the new tag has dropped the cu13 prefix.

Step-by-step proof of impact

  1. CI launches the sweep, profile.yml reads matrix.config.image from nvidia-master.yaml → IMAGE=lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034.
  2. runners/launch_gb300-cw.sh runs enroot import -o $SQUASH_FILE docker://$IMAGE — squashing the 20260509 image.
  3. The generated srtslurm.yaml registers containers: { "${IMAGE}": ${SQUASH_FILE} } — keyed by the 20260509 tag.
  4. srtctl loads the recipe yaml, sees model.container: lmsysorg/sglang:nightly-dev-20260522-c9153da5 — does not match the alias.
  5. Result is one of: (a) srtctl falls back to a fresh docker pull of the 20260522 image at runtime (defeating the pre-stage), (b) the alias mismatch causes a launch failure, or (c) the bench actually runs against the 20260509 squash file — invalidating the perf-changelog claim. All three are bad outcomes.

Fix

Bump .github/configs/nvidia-master.yaml line 9073 from lmsysorg/sglang:nightly-dev-cu13-20260509-9ee83034 to lmsysorg/sglang:nightly-dev-20260522-c9153da5 in this PR, matching the recipe container: values and the lockstep pattern established by PR #1528.

Comment thread perf-changelog.yaml
- dsv4-fp4-gb300-dynamo-sglang-mtp
description:
- "Update SGLang image from nightly-dev-cu13-20260510-2473659e to nightly-dev-20260522-c9153da5"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1559
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog.yaml entry for dsv4-fp4-gb300-dynamo-sglang-mtp has an unresolved template placeholder: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER. This link will 404 when clicked — please substitute the real PR number (1559) before merging.

Extended reasoning...

What the bug is

The newly-added entry in perf-changelog.yaml at line 3137 ends with:

- config-keys:
    - dsv4-fp4-gb300-dynamo-sglang-mtp
  description:
    - "Update SGLang image from nightly-dev-cu13-20260510-2473659e to nightly-dev-20260522-c9153da5"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER

The literal string PLACEHOLDER was never replaced with the actual PR number. (Note: the rendered PR diff in some GitHub views may show pull/1559, but the actual committed file content — what reviewers will merge — contains PLACEHOLDER. git show f66004e -- perf-changelog.yaml confirms the committed diff added the literal PLACEHOLDER text.)

Why existing code doesn't catch it

A repo-wide grep for PLACEHOLDER returns only this one location, so there is no post-merge template-substitution step that would replace it. Schema validation (e.g. utils/matrix_logic/validation.py) only checks that pr-link is a string — it does not validate URL format or that the path resolves, so the entry passes validation cleanly while still producing a broken link.

Impact

This is documentation-only — there is no runtime effect. Anyone clicking the link from the changelog to find the originating PR for the dsv4-fp4-gb300-dynamo-sglang-mtp image bump will hit a 404. Every other recent entry in this file uses the real PR number (e.g. pull/1554, pull/1555, pull/1516, pull/1514), so this entry breaks the established convention.

Fix

Replace PLACEHOLDER with 1559 (this PR's number):

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1559

Step-by-step proof

  1. Open perf-changelog.yaml and look at line 3137. Actual content: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER.
  2. Run git show f66004e -- perf-changelog.yaml — the added line is + pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER, i.e. the committed diff contains the placeholder literally.
  3. Run grep -r PLACEHOLDER over the repo — PLACEHOLDER appears in no other file and in no workflow under .github/, so there is no substitution mechanism that would replace it post-merge.
  4. Navigate to https://github.com/SemiAnalysisAI/InferenceX/pull/PLACEHOLDER — GitHub returns 404 (path is not a valid PR number).

- Drop SGLANG_OPT_USE_JIT_NORM, SGLANG_OPT_USE_JIT_INDEXER_METADATA,
  SGLANG_OPT_USE_TOPK_V2 (now default-on in latest sglang).
- Drop the MegaMoE companion envs that sglang now auto-sets when
  SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE is enabled: SGLANG_OPT_USE_DEEPGEMM_MEGA_MOE,
  SGLANG_OPT_FIX_HASH_MEGA_MOE, SGLANG_OPT_FIX_MEGA_MOE_MEMORY,
  SGLANG_OPT_FIX_NEXTN_MEGA_MOE, SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK.
- Drop SGLANG_RADIX_DISABLE_REUSE and SGLANG_OPT_USE_FAST_MASK_EP which no
  longer exist in sglang's environ.py.
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

- Update dynamo commit hash to 81d0555ee23519cea80a42b4fe824e30368b7300
  across all 6 dsv4 8k1k disagg recipes.
- Quote moe-a2a-backend value as "megamoe" for consistency with other
  string fields.
- Remove the now-unused deepep-config entries; megamoe doesn't read them.
@github-actions
Copy link
Copy Markdown
Contributor

2 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant