Skip to content

[Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update dsv4-fp8-h200-sglang (+mtp) SGLang image to v0.5.12-cu130#1460

Open
functionstackx wants to merge 4 commits into
mainfrom
update-dsv4-fp8-h200-sglang-v0.5.12
Open

[Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update dsv4-fp8-h200-sglang (+mtp) SGLang image to v0.5.12-cu130#1460
functionstackx wants to merge 4 commits into
mainfrom
update-dsv4-fp8-h200-sglang-v0.5.12

Conversation

@functionstackx
Copy link
Copy Markdown
Collaborator

Summary

  • Bumps dsv4-fp8-h200-sglang and dsv4-fp8-h200-sglang-mtp from SHA-pinned lmsysorg/sglang:deepseek-v4-hopper@sha256:7f19c6dc... custom DSV4 build (15/14d old) to lmsysorg/sglang:v0.5.12-cu130.
  • ⚠️ Note: the deepseek-v4-hopper tag is a custom DSV4 build; the generic v0.5.12-cu130 may or may not retain DSV4-specific features. Verify via sweep.

Test plan

  • Full sweep passes with full-sweep-enabled label.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward image bump that the full-sweep label will validate; the inline comment flags a stale rationale comment as a nit.

Extended reasoning...

Overview

This PR is a small, mechanical config change: it bumps the image: field for two recipes (dsv4-fp8-h200-sglang and dsv4-fp8-h200-sglang-mtp) in .github/configs/nvidia-master.yaml from a SHA-pinned lmsysorg/sglang:deepseek-v4-hopper custom build to the generic lmsysorg/sglang:v0.5.12-cu130 tag, and adds a corresponding perf-changelog.yaml entry.

Security risks

None. This is a YAML config change with no logic, secrets, auth, or permission surface. The new image tag belongs to the same trusted upstream registry already used across the repo.

Level of scrutiny

Low. The change is two literal string edits plus a changelog entry — exactly the kind of mechanical sweep PR the full-sweep-enabled label is designed to validate. Functional correctness (does v0.5.12-cu130 retain DSV4-specific features?) is verified by the sweep itself, not by code review.

Other factors

The bug hunter surfaced one nit: a comment block above the bumped recipes still references the now-removed deepseek-v4-hopper image and its /ix-mount rationale, and the substring branch at runners/launch_h200-dgxc-slurm.sh:308 that gates /ix vs /workspace on that image name is now effectively dead. These are documentation-rot / dead-code cleanups, not functional defects, and don't justify blocking a routine image bump. The inline comment already flags this for the author to address opportunistically.


dsv4-fp8-h200-sglang:
image: lmsysorg/sglang:deepseek-v4-hopper@sha256:7f19c6dc092e47a10fac2e41f47eab78970280d06648b8e50d312a82f0ae722f
image: lmsysorg/sglang:v0.5.12-cu130
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The comment at .github/configs/nvidia-master.yaml:2891-2893 justifies the h200-dgxc-slurm runner pinning on the grounds that 'the deepseek-v4-hopper image needs the /ix mount layout that only launch_h200-dgxc-slurm.sh sets up' — but this PR removes the only references to that image, making the rationale stale and misleading. Please either update/remove the comment, or explicitly note the runner pinning is now retained for a different reason (and consider whether it can be relaxed, since the conditional in runners/launch_h200-dgxc-slurm.sh:308 that gates /ix vs /workspace on the deepseek-v4-hopper substring is now dead code).

Extended reasoning...

What the bug is

The PR bumps both dsv4-fp8-h200-sglang and dsv4-fp8-h200-sglang-mtp from lmsysorg/sglang:deepseek-v4-hopper@sha256:7f19c6dc... to the generic lmsysorg/sglang:v0.5.12-cu130. Sitting just above the first updated entry, at .github/configs/nvidia-master.yaml:2891-2893, is an explanatory comment:

Pinned to the h200-dgxc-slurm runner pool because the deepseek-v4-hopper image needs the /ix mount layout that only launch_h200-dgxc-slurm.sh sets up.

After this PR, no config in the repo references a deepseek-v4-hopper image, so the documented rationale no longer applies. The comment now describes a constraint that does not exist in the current state of the repo.

Why this matters

A future maintainer reading this block will be told the runner pin exists because of a /ix-mount requirement tied to the deepseek-v4-hopper image. That is no longer true. They may either:

  1. Believe the constraint still applies and propagate it to other configs, or
  2. Be unable to safely relax the runner: h200-dgxc pin because they cannot trust the now-incorrect rationale.

The companion runtime code at runners/launch_h200-dgxc-slurm.sh:308 (if [[ "$IMAGE" == *deepseek-v4-hopper* ]]; then CONTAINER_MOUNT_DIR=/ix; else CONTAINER_MOUNT_DIR=/workspace; fi) is also affected: after this PR the *deepseek-v4-hopper* substring will never match any image in the configs, so the /ix branch is dead. These two recipes will now be mounted at /workspace rather than /ix at runtime.

Step-by-step proof

  1. Before this PR: the comment at lines 2891-2893 explains why the next two recipes (dsv4-fp8-h200-sglang and dsv4-fp8-h200-sglang-mtp at lines 2917 and 2941) use runner: h200-dgxc, referencing the deepseek-v4-hopper image's /ix-mount requirement.
  2. This PR changes both image: fields from lmsysorg/sglang:deepseek-v4-hopper@sha256:7f19c6dc... to lmsysorg/sglang:v0.5.12-cu130.
  3. After this PR, grep for deepseek-v4-hopper in .github/configs/nvidia-master.yaml finds only line 2892 — the comment text itself. No config still uses that image.
  4. At runtime, launch_h200-dgxc-slurm.sh:308 does IMAGE=$IMAGE; if [[ "$IMAGE" == *deepseek-v4-hopper* ]]; then CONTAINER_MOUNT_DIR=/ix; else CONTAINER_MOUNT_DIR=/workspace; fi. With IMAGE=lmsysorg/sglang:v0.5.12-cu130, the substring match fails and the else branch runs, so the mount point is now /workspace rather than /ix. The documented /ix behavior no longer occurs for these recipes.

Impact

No functional impact on this sweep — the new generic image works fine with the /workspace mount the else branch provides. This is a documentation-rot / dead-code issue introduced by this PR.

How to fix

Pick one:

  • Remove the comment at lines 2891-2893 (and consider whether runner: h200-dgxc can be relaxed to a less restrictive pool now that the /ix constraint is gone).
  • Rewrite the comment to capture the actual current reason for h200-dgxc pinning (if there is one — e.g. Slurm/enroot infrastructure availability rather than mount layout).
  • Optionally remove the now-dead *deepseek-v4-hopper* branch in runners/launch_h200-dgxc-slurm.sh:308 as a drive-by cleanup.

@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx changed the title [Klaud Cold] Update dsv4-fp8-h200-sglang (+mtp) SGLang image to v0.5.12-cu130 [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update dsv4-fp8-h200-sglang (+mtp) SGLang image to v0.5.12-cu130 May 18, 2026
@functionstackx
Copy link
Copy Markdown
Collaborator Author

Handing off to @Oseltamivir — tracked alongside 7 other stuck Klaud-Cold PRs in #1511. /loop will stop auto-retrying this one.

AI-generated via Claude Code /loop.

@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants