Skip to content

[Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update dsv4-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130#1455

Open
functionstackx wants to merge 1 commit into
mainfrom
update-dsv4-fp4-b300-sglang-v0.5.12
Open

[Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update dsv4-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130#1455
functionstackx wants to merge 1 commit into
mainfrom
update-dsv4-fp4-b300-sglang-v0.5.12

Conversation

@functionstackx
Copy link
Copy Markdown
Collaborator

Summary

  • Bumps dsv4-fp4-b300-sglang and dsv4-fp4-b300-sglang-mtp from SHA-pinned deepseek-v4-b300@sha256:... custom builds (20/18d old) to lmsysorg/sglang:v0.5.12-cu130.
  • ⚠️ Note: the deepseek-v4-b300 tag is a custom DSV4 build; the generic v0.5.12-cu130 may or may not retain DSV4-specific features. Verify via sweep.

Test plan

  • Full sweep passes with full-sweep-enabled label.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx changed the title Update dsv4-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 [Klaud Cold] Update dsv4-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 May 17, 2026
@functionstackx
Copy link
Copy Markdown
Collaborator Author

Closing as not viable — same DSV4 transformers issue as #1460/#1450: the generic v0.5.12-cu130 image bundles a transformers that doesn't recognise model_type: "deepseek_v4", so the bench client crashes in AutoTokenizer.from_pretrained with KeyError: 'deepseek_v4'. Custom deepseek-v4-b300@sha256:... image bundles a patched transformers; the generic-image bump is NOT viable until sglang ships transformers with deepseek_v4 support (or the recipe ships its own pip install transformers upgrade).

Keep DSV4 b300 pinned to the SHA-pinned custom image for now. Will reopen when upstream catches up.

@functionstackx
Copy link
Copy Markdown
Collaborator Author

Reopening — leaving sweep labels off so it doesn't auto-trigger while you debug + patch the recipe manually.

@github-actions
Copy link
Copy Markdown
Contributor

# Parallelisms and concurrency ranges mirror dsv4-fp4-b200-vllm.
dsv4-fp4-b300-sglang:
image: lmsysorg/sglang:deepseek-v4-b300@sha256:2fec8d7958bb0d53b50d7bf04d6ae6a7de8a35503775826e0550a45dd8c3ee15
image: lmsysorg/sglang:v0.5.12-cu130
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bumping dsv4-fp4-b300-sglang (line 1986) and dsv4-fp4-b300-sglang-mtp (line 2027) from the SHA-pinned lmsysorg/sglang:deepseek-v4-b300@sha256:... custom image to the generic lmsysorg/sglang:v0.5.12-cu130 strips the patched transformers that registers model_type: "deepseek_v4", so AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Pro") will crash with KeyError: '\''deepseek_v4'\'' before the server is even probed. Either hold this change until upstream sglang ships transformers with deepseek_v4 support, or have the recipe pip install a patched transformers inside the container before invoking the bench client. The PR author has already acknowledged this in the timeline.

Extended reasoning...

What the bug is

Both modified entries (dsv4-fp4-b300-sglang at line 1986 and dsv4-fp4-b300-sglang-mtp at line 2027) swap out a SHA-pinned custom image for the generic lmsysorg/sglang:v0.5.12-cu130 image. The custom deepseek-v4-b300@sha256:... builds bundle a patched transformers that registers a model type for deepseek_v4 (the config.json of deepseek-ai/DeepSeek-V4-Pro declares model_type: "deepseek_v4"). The generic v0.5.12-cu130 image bundles the upstream transformers release, which has no deepseek_v4 entry in its model-type registry.

Code path that triggers the failure

  1. Sweep dispatcher launches a container with image: lmsysorg/sglang:v0.5.12-cu130.
  2. Bench client runs and calls AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Pro").
  3. transformers downloads the model repo's config.json, reads model_type: "deepseek_v4", then attempts to look it up in CONFIG_MAPPING.
  4. Upstream transformers in v0.5.12-cu130 does not have deepseek_v4 registered → KeyError: 'deepseek_v4' is raised before the SGLang server is ever probed.

Why existing code doesn't prevent it

The recipe scripts for these two configs only change the image tag; nothing in the recipe pipes in a pip install transformers ... upgrade to bring in deepseek-v4 support. The sister entry dsv4-fp4-b200-sglang at line 1699 is still pinned to lmsysorg/sglang:deepseek-v4-blackwell@sha256:df18bfc... for exactly this reason, and other DSV4 entries (e.g. trtllm variants at lines 1781/1802/3016/3039) all use specifically-tagged trtllm-deepseek-v4:feat-deepseek_v4-9aa3715 images. Every DSV4 config in this file requires a special image with deepseek_v4 support — the b300 sglang variants are no exception.

Author confirmation

The PR author (functionstackx) acknowledged this directly in this PR's timeline on 2026-05-18T07:45:18Z: "the generic v0.5.12-cu130 image bundles a transformers that doesn'''t recognise model_type: "deepseek_v4", so the bench client crashes in AutoTokenizer.from_pretrained with KeyError: '\''deepseek_v4'\''. ... the generic-image bump is NOT viable until sglang ships transformers with deepseek_v4 support." They closed the PR as not viable, then reopened it with sweep labels intentionally disabled to avoid auto-triggering failing runs while they debug.

Step-by-step proof

  1. Open .github/configs/nvidia-master.yaml at line 1986 — image is now lmsysorg/sglang:v0.5.12-cu130, model is deepseek-ai/DeepSeek-V4-Pro.
  2. Pull the v0.5.12-cu130 image: docker pull lmsysorg/sglang:v0.5.12-cu130.
  3. Inside the container: python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('\''deepseek-ai/DeepSeek-V4-Pro'\'')".
  4. Observe: KeyError: '\''deepseek_v4'\'' raised from CONFIG_MAPPING.__getitem__ because upstream transformers in this image has no entry for deepseek_v4.
  5. Repeat for line 2027 (dsv4-fp4-b300-sglang-mtp) — same image, same model, identical failure.

Impact

Both dsv4-fp4-b300-sglang and dsv4-fp4-b300-sglang-mtp sweep runs will fail at tokenizer load 100% of the time. No benchmarks will be produced. The PR description itself acknowledges this risk: "⚠️ Note: the deepseek-v4-b300 tag is a custom DSV4 build; the generic v0.5.12-cu130 may or may not retain DSV4-specific features."

How to fix

Either (a) revert the image to the SHA-pinned custom deepseek-v4-b300 builds and wait for upstream sglang to ship a transformers release with deepseek_v4 registered, or (b) keep the generic image bump but have the recipe pip install a transformers build containing deepseek_v4 support inside the container before invoking the bench client. Option (a) is the safer choice and matches what is already done for the b200 sister entry at line 1699.

@functionstackx functionstackx changed the title [Klaud Cold] Update dsv4-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update dsv4-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 May 18, 2026
@functionstackx
Copy link
Copy Markdown
Collaborator Author

Handing off to @Oseltamivir — tracked alongside 7 other stuck Klaud-Cold PRs in #1511. /loop will stop auto-retrying this one.

AI-generated via Claude Code /loop.

@functionstackx functionstackx force-pushed the update-dsv4-fp4-b300-sglang-v0.5.12 branch from d1c4bee to 7e3166e Compare May 20, 2026 05:48
@functionstackx functionstackx changed the title [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update dsv4-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 [Klaud Cold] Update dsv4-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 May 20, 2026
@functionstackx functionstackx changed the title [Klaud Cold] Update dsv4-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update dsv4-fp4-b300-sglang (+mtp) SGLang image to v0.5.12-cu130 May 20, 2026
@github-actions
Copy link
Copy Markdown
Contributor

2 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant