[Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update dsv4-fp8-h200-sglang (+mtp) SGLang image to v0.5.12-cu130#1460
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26006218660 |
There was a problem hiding this comment.
LGTM — straightforward image bump that the full-sweep label will validate; the inline comment flags a stale rationale comment as a nit.
Extended reasoning...
Overview
This PR is a small, mechanical config change: it bumps the image: field for two recipes (dsv4-fp8-h200-sglang and dsv4-fp8-h200-sglang-mtp) in .github/configs/nvidia-master.yaml from a SHA-pinned lmsysorg/sglang:deepseek-v4-hopper custom build to the generic lmsysorg/sglang:v0.5.12-cu130 tag, and adds a corresponding perf-changelog.yaml entry.
Security risks
None. This is a YAML config change with no logic, secrets, auth, or permission surface. The new image tag belongs to the same trusted upstream registry already used across the repo.
Level of scrutiny
Low. The change is two literal string edits plus a changelog entry — exactly the kind of mechanical sweep PR the full-sweep-enabled label is designed to validate. Functional correctness (does v0.5.12-cu130 retain DSV4-specific features?) is verified by the sweep itself, not by code review.
Other factors
The bug hunter surfaced one nit: a comment block above the bumped recipes still references the now-removed deepseek-v4-hopper image and its /ix-mount rationale, and the substring branch at runners/launch_h200-dgxc-slurm.sh:308 that gates /ix vs /workspace on that image name is now effectively dead. These are documentation-rot / dead-code cleanups, not functional defects, and don't justify blocking a routine image bump. The inline comment already flags this for the author to address opportunistically.
|
|
||
| dsv4-fp8-h200-sglang: | ||
| image: lmsysorg/sglang:deepseek-v4-hopper@sha256:7f19c6dc092e47a10fac2e41f47eab78970280d06648b8e50d312a82f0ae722f | ||
| image: lmsysorg/sglang:v0.5.12-cu130 |
There was a problem hiding this comment.
🟡 The comment at .github/configs/nvidia-master.yaml:2891-2893 justifies the h200-dgxc-slurm runner pinning on the grounds that 'the deepseek-v4-hopper image needs the /ix mount layout that only launch_h200-dgxc-slurm.sh sets up' — but this PR removes the only references to that image, making the rationale stale and misleading. Please either update/remove the comment, or explicitly note the runner pinning is now retained for a different reason (and consider whether it can be relaxed, since the conditional in runners/launch_h200-dgxc-slurm.sh:308 that gates /ix vs /workspace on the deepseek-v4-hopper substring is now dead code).
Extended reasoning...
What the bug is
The PR bumps both dsv4-fp8-h200-sglang and dsv4-fp8-h200-sglang-mtp from lmsysorg/sglang:deepseek-v4-hopper@sha256:7f19c6dc... to the generic lmsysorg/sglang:v0.5.12-cu130. Sitting just above the first updated entry, at .github/configs/nvidia-master.yaml:2891-2893, is an explanatory comment:
Pinned to the h200-dgxc-slurm runner pool because the deepseek-v4-hopper image needs the /ix mount layout that only launch_h200-dgxc-slurm.sh sets up.
After this PR, no config in the repo references a deepseek-v4-hopper image, so the documented rationale no longer applies. The comment now describes a constraint that does not exist in the current state of the repo.
Why this matters
A future maintainer reading this block will be told the runner pin exists because of a /ix-mount requirement tied to the deepseek-v4-hopper image. That is no longer true. They may either:
- Believe the constraint still applies and propagate it to other configs, or
- Be unable to safely relax the
runner: h200-dgxcpin because they cannot trust the now-incorrect rationale.
The companion runtime code at runners/launch_h200-dgxc-slurm.sh:308 (if [[ "$IMAGE" == *deepseek-v4-hopper* ]]; then CONTAINER_MOUNT_DIR=/ix; else CONTAINER_MOUNT_DIR=/workspace; fi) is also affected: after this PR the *deepseek-v4-hopper* substring will never match any image in the configs, so the /ix branch is dead. These two recipes will now be mounted at /workspace rather than /ix at runtime.
Step-by-step proof
- Before this PR: the comment at lines 2891-2893 explains why the next two recipes (dsv4-fp8-h200-sglang and dsv4-fp8-h200-sglang-mtp at lines 2917 and 2941) use
runner: h200-dgxc, referencing thedeepseek-v4-hopperimage's /ix-mount requirement. - This PR changes both
image:fields fromlmsysorg/sglang:deepseek-v4-hopper@sha256:7f19c6dc...tolmsysorg/sglang:v0.5.12-cu130. - After this PR, grep for
deepseek-v4-hopperin.github/configs/nvidia-master.yamlfinds only line 2892 — the comment text itself. No config still uses that image. - At runtime,
launch_h200-dgxc-slurm.sh:308doesIMAGE=$IMAGE; if [[ "$IMAGE" == *deepseek-v4-hopper* ]]; then CONTAINER_MOUNT_DIR=/ix; else CONTAINER_MOUNT_DIR=/workspace; fi. WithIMAGE=lmsysorg/sglang:v0.5.12-cu130, the substring match fails and the else branch runs, so the mount point is now/workspacerather than/ix. The documented /ix behavior no longer occurs for these recipes.
Impact
No functional impact on this sweep — the new generic image works fine with the /workspace mount the else branch provides. This is a documentation-rot / dead-code issue introduced by this PR.
How to fix
Pick one:
- Remove the comment at lines 2891-2893 (and consider whether
runner: h200-dgxccan be relaxed to a less restrictive pool now that the /ix constraint is gone). - Rewrite the comment to capture the actual current reason for h200-dgxc pinning (if there is one — e.g. Slurm/enroot infrastructure availability rather than mount layout).
- Optionally remove the now-dead
*deepseek-v4-hopper*branch inrunners/launch_h200-dgxc-slurm.sh:308as a drive-by cleanup.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26006219979 |
|
Handing off to @Oseltamivir — tracked alongside 7 other stuck Klaud-Cold PRs in #1511. /loop will stop auto-retrying this one. AI-generated via Claude Code /loop. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26006219979 |
…sglang-v0.5.12 # Conflicts: # perf-changelog.yaml
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26260426277 |
Summary
dsv4-fp8-h200-sglanganddsv4-fp8-h200-sglang-mtpfrom SHA-pinnedlmsysorg/sglang:deepseek-v4-hopper@sha256:7f19c6dc...custom DSV4 build (15/14d old) tolmsysorg/sglang:v0.5.12-cu130.deepseek-v4-hoppertag is a custom DSV4 build; the generic v0.5.12-cu130 may or may not retain DSV4-specific features. Verify via sweep.Test plan
full-sweep-enabledlabel.🤖 Generated with Claude Code