Add GLM5 FP8 dynamo-sglang GB300 disagg configs by yeswanthk-26 · Pull Request #1557 · SemiAnalysisAI/InferenceX

yeswanthk-26 · 2026-05-22T20:19:28Z

Summary

Add new glm5-fp8-gb300-dynamo-sglang entry in .github/configs/nvidia-master.yaml with 1k1k and 8k1k STP hightpt/lowlat scenarios.
Wire glm5-fp8 support in runners/launch_gb300-nv.sh

Port PR69 GLM5 FP8 GB300 disaggregated SGLang recipes to SA upstream and wire gb300-nv launcher support while keeping SA-default SLURM account/partition and sqsh paths.

claude · 2026-05-22T20:27:52Z

+      ep-dispatch-algorithm: static
+      moe-a2a-backend: deepep


🔴 In 1k1k_stp_hightpt_2.yaml the decode-side max-running-requests: 256 (line 134) is far below the benchmark's target concurrency of 7300 and is an outlier vs all sibling hightpt configs (which set it to 8192/8192/6500/5700, all aligned with their concurrency). The value 256 exactly matches the prefill section's setting in the same file, which strongly suggests a copy-paste error from prefill into decode. With this cap, ~7044 of the 7300 concurrent requests will perpetually queue inside the decode server and this sweep point will not reach intended decode throughput — please bump it to track the concurrency target (e.g. 7300 or 8192) like the other hightpt configs.

Extended reasoning...

What this is

In benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/1k1k/disagg/stp/1k1k_stp_hightpt_2.yaml, the decode block sets:

max-running-requests: 256 cuda-graph-max-bs: 180

while the benchmark concurrency at the bottom of the same file is concurrencies: 7300.

Why this is a bug — comparison across the sweep

All five 1k1k_stp_hightpt_* configs are part of the same hightpt sweep, and in every other file the decode max-running-requests is set to (or above) the target concurrency:

File concurrency decode max-running-requests decode cuda-graph-max-bs

hightpt_0 8192 8192 512

hightpt_1 7500 8192 256

hightpt_2 7300 256 ← outlier 180

hightpt_3 6500 6500 128

hightpt_4 5700 5700 100

Only hightpt_2 has decode max-running-requests: 256. That value is identical to the prefill block earlier in the same file (line 71: max-running-requests: 256 in the prefill section), which is the classic copy-paste signature — the decode block was authored by copying prefill and the max-running-requests line was not bumped.

Why existing settings do not save us

The same decode block also sets cuda-graph-max-bs: 180 with data-parallel-size: 40, implying the decode server was sized for roughly 180 * 40 = 7200 in-flight requests. So the rest of the decode config is consistent with a ~7300-concurrent workload — only the max-running-requests: 256 line is out of place. SGLang enforces max-running-requests as a hard cap on simultaneously-scheduled requests across all DP ranks, so the lower of (256, 7200) wins.

Step-by-step proof of the symptom

Bench harness launches with concurrencies: 7300 → opens 7300 simultaneous client connections.

Each client request arrives at the prefill stage, gets prefilled (prefill max-running-requests: 256 throttles the prefill side; that is intentional and matches all siblings).

After KV transfer, the request flips to the decode server. Decode SGLang sees max-running-requests=256 and will only schedule 256 requests at a time.

The remaining ~7044 client requests sit in the waiting/queue state; decode TBT is measured only over the 256 actually running.

The reported throughput at conc=7300 is effectively the throughput at running-concurrency ≈ 256, not 7300. The hightpt_2 sweep point reports a number that is unrelated to what a 40-way DP decode at conc=7300 actually does.

Impact

This is freshly-added config, so no regression to bisect, but it silently produces a misleading data point in a perf sweep that is specifically meant to characterize decode throughput at high concurrency. Across the five 1k1k hightpt points, hightpt_2 will look anomalously low (or anomalously low-utilization on the decode side) and the whole curve becomes uninterpretable around 7300 concurrency.

Fix

Bump the decode-side max-running-requests in this file to match the target concurrency, the same way every sibling does — e.g. max-running-requests: 7300 (mirroring the hightpt_3/hightpt_4 pattern of "decode cap == conc") or max-running-requests: 8192 (mirroring hightpt_0/hightpt_1). One-line change at line 134.

claude · 2026-05-22T20:27:52Z

+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    DYN_REQUEST_PLANE: nats
+      # DeepEP per-rank dispatch buffer; must be >= ceil(cuda_graph_max_bs / dp_size).


🟡 All 5 lowlat decode configs (1k1k_stp_lowlat_0/1 and 8k1k_stp_lowlat_0/1/2) copy the decode_environment block from the hightpt configs, including SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: '512' and a comment explaining DeepEP per-rank dispatch buffer sizing. However these lowlat decode workers use moe-runner-backend: flashinfer_trtllm and do not set moe-a2a-backend: deepep or deepep-mode — DeepEP is not in use, so the env var is a no-op and the comment (talking about 4096/24 ~= 171) is misleading since lowlat decode runs DP=1 with cuda-graph-max-bs <= 32. Nit/cosmetic — consider dropping both the env var and the comment from the 5 lowlat files.

Extended reasoning...

What's going on

In each of the 5 lowlat decode configs added in this PR:

benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/1k1k/disagg/stp/1k1k_stp_lowlat_0.yaml

benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/1k1k/disagg/stp/1k1k_stp_lowlat_1.yaml

benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_0.yaml

benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_1.yaml

benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_2.yaml

the decode_environment block ends with:

# DeepEP per-rank dispatch buffer; must be >= ceil(cuda_graph_max_bs / dp_size). # Default 128 overflows with large DP + batch (e.g. 4096/24 ~= 171 > 128). Limit 1024. SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: '512'

That block was copy-pasted from the hightpt configs where DeepEP is actually enabled (moe-a2a-backend: deepep, deepep-mode: low_latency, deepep-config: /configs/deepep_config.json).

Why this is misleading in the lowlat files

The lowlat decode sglang_config.decode section in each of these files uses:

tensor-parallel-size: 4 expert-parallel-size: 1 data-parallel-size: 1 enable-flashinfer-allreduce-fusion: true moe-runner-backend: flashinfer_trtllm

There is no moe-a2a-backend: deepep and no deepep-mode / deepep-config — DeepEP is not in use, so SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK is a no-op.

The accompanying comment is also wrong-by-context: it talks about ceil(cuda_graph_max_bs / dp_size) with an example of 4096/24 ~= 171, but the lowlat decode configs run with data-parallel-size: 1 and cuda-graph-max-bs of 1, 8, 15, or 32 — the formula gives 1..32, nowhere near the 128 default.

Step-by-step proof for 1k1k_stp_lowlat_0.yaml

Lines 49–52 (decode_environment) set SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: '512' with the DeepEP buffer comment.

Lines 117–122 (decode sglang_config) set moe-runner-backend: flashinfer_trtllm, data-parallel-size: 1, cuda-graph-max-bs: 32.

No moe-a2a-backend or deepep-mode is set anywhere in the decode block — DeepEP is not invoked.

SGLang reads SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK only on the DeepEP code path, so it has no effect.

The comment's example (4096/24 ~= 171) describes hightpt math (large DP + large batch); for this file the relevant value would be ceil(32/1) = 32, which never approaches the 128 default the comment warns about.

Impact and fix

No functional impact — SGLang silently ignores the unused env var, and benchmark behavior is unchanged. The cost is cosmetic but real: anyone reading these lowlat recipes will see a comment promising DeepEP-related tuning and an env var that doesn't apply, which is exactly the kind of friction that erodes trust in copy-pasted recipes.

Suggested fix: in all 5 lowlat yamls, drop the 3 lines (the two # comments and the SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK entry) from decode_environment. The hightpt files should keep them as-is.

claude · 2026-05-22T20:27:52Z

+    - "Add GLM-5 FP8 GB300 Dynamo SGLang disaggregated multi-node coverage using lmsysorg/sglang:v0.5.11-cu130"
+    - "1k1k and 8k1k STP hightpt and lowlat srt-slurm recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/ (resolved from upstream srt-slurm PR #160 via srtctl resolve-override)"
+    - "Wire glm5/fp8 model + dynamo-sglang framework branches into runners/launch_gb300-nv.sh with SA upstream defaults (SLURM_PARTITION=batch_1, SLURM_ACCOUNT=benchmark, SQUASH_FILE under /home/sa-shared/gharunners/squash/)"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX


🟡 The pr-link for this changelog entry contains the literal placeholder pull/XXXX instead of the actual PR number (1557). All other recent entries (lines 3084, 3091, 3102) use real PR numbers — please substitute XXXX with 1557 before merge so the changelog link resolves correctly.

Extended reasoning...

What the bug is: perf-changelog.yaml line 3110 contains:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX

The XXXX is a literal placeholder that was never replaced with this PR's actual number, 1557 (visible in the PR metadata). The link as written points to /pull/XXXX, which GitHub will resolve to a 404 (or an unrelated page) forever once this is merged.

The code path that triggers it: This is purely a static YAML metadata entry. Anyone — tooling or a human — who walks the changelog and follows the link for the glm5-fp8-gb300-dynamo-sglang entry will hit a broken link.

Why existing code doesn't prevent it: perf-changelog.yaml is a hand-edited document; there is no validator that checks pr-link URLs for placeholder tokens. The PR author clearly intended to fill it in but forgot before pushing.

Impact: Cosmetic / documentation only — no runtime impact, the YAML still parses, the benchmarks still run. But the changelog's whole purpose is to let readers trace each config change back to its PR; a permanently broken link defeats that for this entry, and the mistake will be frozen in git history once merged.

How to fix: One-character substitution. Replace line 3110 with:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1557

Step-by-step proof:

Open perf-changelog.yaml at line 3110. The line reads pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX.

Compare with the most recent prior entries: line 3102 uses /pull/1534, line 3091 uses /pull/1451, line 3084 uses /pull/1548 — all real, resolvable PR numbers.

The PR metadata for this change states <pr number="1557">, so the correct value is 1557, not XXXX.

Once merged, https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX will 404 (GitHub PR URLs require a numeric ID). The changelog entry will be unable to cross-reference its own PR.

Severity: nit — purely documentation/metadata, no functional consequence, but trivial to fix and worth catching pre-merge before it becomes a permanent artifact of git history.

[GB300][SGLang] Add GLM5 FP8 dynamo-sglang disagg configs

886e619

Port PR69 GLM5 FP8 GB300 disaggregated SGLang recipes to SA upstream and wire gb300-nv launcher support while keeping SA-default SLURM account/partition and sqsh paths.

yeswanthk-26 requested a review from a team May 22, 2026 20:19

yeswanthk-26 requested review from jgangani and kedarpotdar-nv as code owners May 22, 2026 20:19

github-project-automation Bot added this to InferenceMAX Board May 22, 2026

claude Bot reviewed May 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GLM5 FP8 dynamo-sglang GB300 disagg configs#1557

Add GLM5 FP8 dynamo-sglang GB300 disagg configs#1557
yeswanthk-26 wants to merge 1 commit into
mainfrom
yeswanth/glm5-fp8-gb300-disagg

yeswanthk-26 commented May 22, 2026 •

edited

Loading

Uh oh!

claude Bot May 22, 2026

Uh oh!

claude Bot May 22, 2026

Uh oh!

claude Bot May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

File	concurrency	decode max-running-requests	decode cuda-graph-max-bs
hightpt_0	8192	8192	512
hightpt_1	7500	8192	256
hightpt_2	7300	256 ← outlier	180
hightpt_3	6500	6500	128
hightpt_4	5700	5700	100

Conversation

yeswanthk-26 commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

claude Bot May 22, 2026

Choose a reason for hiding this comment

What this is

Why this is a bug — comparison across the sweep

Why existing settings do not save us

Step-by-step proof of the symptom

Impact

Fix

Uh oh!

claude Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yeswanthk-26 commented May 22, 2026 •

edited

Loading