Add GLM5 FP8 dynamo-sglang GB300 disagg configs#1557
Conversation
Port PR69 GLM5 FP8 GB300 disaggregated SGLang recipes to SA upstream and wire gb300-nv launcher support while keeping SA-default SLURM account/partition and sqsh paths.
| ep-dispatch-algorithm: static | ||
| moe-a2a-backend: deepep |
There was a problem hiding this comment.
🔴 In 1k1k_stp_hightpt_2.yaml the decode-side max-running-requests: 256 (line 134) is far below the benchmark's target concurrency of 7300 and is an outlier vs all sibling hightpt configs (which set it to 8192/8192/6500/5700, all aligned with their concurrency). The value 256 exactly matches the prefill section's setting in the same file, which strongly suggests a copy-paste error from prefill into decode. With this cap, ~7044 of the 7300 concurrent requests will perpetually queue inside the decode server and this sweep point will not reach intended decode throughput — please bump it to track the concurrency target (e.g. 7300 or 8192) like the other hightpt configs.
Extended reasoning...
What this is
In benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/1k1k/disagg/stp/1k1k_stp_hightpt_2.yaml, the decode block sets:
max-running-requests: 256
cuda-graph-max-bs: 180while the benchmark concurrency at the bottom of the same file is concurrencies: 7300.
Why this is a bug — comparison across the sweep
All five 1k1k_stp_hightpt_* configs are part of the same hightpt sweep, and in every other file the decode max-running-requests is set to (or above) the target concurrency:
| File | concurrency | decode max-running-requests | decode cuda-graph-max-bs |
|---|---|---|---|
| hightpt_0 | 8192 | 8192 | 512 |
| hightpt_1 | 7500 | 8192 | 256 |
| hightpt_2 | 7300 | 256 ← outlier | 180 |
| hightpt_3 | 6500 | 6500 | 128 |
| hightpt_4 | 5700 | 5700 | 100 |
Only hightpt_2 has decode max-running-requests: 256. That value is identical to the prefill block earlier in the same file (line 71: max-running-requests: 256 in the prefill section), which is the classic copy-paste signature — the decode block was authored by copying prefill and the max-running-requests line was not bumped.
Why existing settings do not save us
The same decode block also sets cuda-graph-max-bs: 180 with data-parallel-size: 40, implying the decode server was sized for roughly 180 * 40 = 7200 in-flight requests. So the rest of the decode config is consistent with a ~7300-concurrent workload — only the max-running-requests: 256 line is out of place. SGLang enforces max-running-requests as a hard cap on simultaneously-scheduled requests across all DP ranks, so the lower of (256, 7200) wins.
Step-by-step proof of the symptom
- Bench harness launches with
concurrencies: 7300→ opens 7300 simultaneous client connections. - Each client request arrives at the prefill stage, gets prefilled (prefill
max-running-requests: 256throttles the prefill side; that is intentional and matches all siblings). - After KV transfer, the request flips to the decode server. Decode SGLang sees
max-running-requests=256and will only schedule 256 requests at a time. - The remaining ~7044 client requests sit in the waiting/queue state; decode TBT is measured only over the 256 actually running.
- The reported throughput at conc=7300 is effectively the throughput at running-concurrency ≈ 256, not 7300. The hightpt_2 sweep point reports a number that is unrelated to what a 40-way DP decode at conc=7300 actually does.
Impact
This is freshly-added config, so no regression to bisect, but it silently produces a misleading data point in a perf sweep that is specifically meant to characterize decode throughput at high concurrency. Across the five 1k1k hightpt points, hightpt_2 will look anomalously low (or anomalously low-utilization on the decode side) and the whole curve becomes uninterpretable around 7300 concurrency.
Fix
Bump the decode-side max-running-requests in this file to match the target concurrency, the same way every sibling does — e.g. max-running-requests: 7300 (mirroring the hightpt_3/hightpt_4 pattern of "decode cap == conc") or max-running-requests: 8192 (mirroring hightpt_0/hightpt_1). One-line change at line 134.
| SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' | ||
| SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' | ||
| DYN_REQUEST_PLANE: nats | ||
| # DeepEP per-rank dispatch buffer; must be >= ceil(cuda_graph_max_bs / dp_size). |
There was a problem hiding this comment.
🟡 All 5 lowlat decode configs (1k1k_stp_lowlat_0/1 and 8k1k_stp_lowlat_0/1/2) copy the decode_environment block from the hightpt configs, including SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: '512' and a comment explaining DeepEP per-rank dispatch buffer sizing. However these lowlat decode workers use moe-runner-backend: flashinfer_trtllm and do not set moe-a2a-backend: deepep or deepep-mode — DeepEP is not in use, so the env var is a no-op and the comment (talking about 4096/24 ~= 171) is misleading since lowlat decode runs DP=1 with cuda-graph-max-bs <= 32. Nit/cosmetic — consider dropping both the env var and the comment from the 5 lowlat files.
Extended reasoning...
What's going on
In each of the 5 lowlat decode configs added in this PR:
benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/1k1k/disagg/stp/1k1k_stp_lowlat_0.yamlbenchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/1k1k/disagg/stp/1k1k_stp_lowlat_1.yamlbenchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_0.yamlbenchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_1.yamlbenchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_2.yaml
the decode_environment block ends with:
# DeepEP per-rank dispatch buffer; must be >= ceil(cuda_graph_max_bs / dp_size).
# Default 128 overflows with large DP + batch (e.g. 4096/24 ~= 171 > 128). Limit 1024.
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: '512'That block was copy-pasted from the hightpt configs where DeepEP is actually enabled (moe-a2a-backend: deepep, deepep-mode: low_latency, deepep-config: /configs/deepep_config.json).
Why this is misleading in the lowlat files
The lowlat decode sglang_config.decode section in each of these files uses:
tensor-parallel-size: 4
expert-parallel-size: 1
data-parallel-size: 1
enable-flashinfer-allreduce-fusion: true
moe-runner-backend: flashinfer_trtllmThere is no moe-a2a-backend: deepep and no deepep-mode / deepep-config — DeepEP is not in use, so SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK is a no-op.
The accompanying comment is also wrong-by-context: it talks about ceil(cuda_graph_max_bs / dp_size) with an example of 4096/24 ~= 171, but the lowlat decode configs run with data-parallel-size: 1 and cuda-graph-max-bs of 1, 8, 15, or 32 — the formula gives 1..32, nowhere near the 128 default.
Step-by-step proof for 1k1k_stp_lowlat_0.yaml
- Lines 49–52 (decode_environment) set
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: '512'with the DeepEP buffer comment. - Lines 117–122 (decode sglang_config) set
moe-runner-backend: flashinfer_trtllm,data-parallel-size: 1,cuda-graph-max-bs: 32. - No
moe-a2a-backendordeepep-modeis set anywhere in the decode block — DeepEP is not invoked. - SGLang reads
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANKonly on the DeepEP code path, so it has no effect. - The comment's example (
4096/24 ~= 171) describes hightpt math (large DP + large batch); for this file the relevant value would beceil(32/1) = 32, which never approaches the 128 default the comment warns about.
Impact and fix
No functional impact — SGLang silently ignores the unused env var, and benchmark behavior is unchanged. The cost is cosmetic but real: anyone reading these lowlat recipes will see a comment promising DeepEP-related tuning and an env var that doesn't apply, which is exactly the kind of friction that erodes trust in copy-pasted recipes.
Suggested fix: in all 5 lowlat yamls, drop the 3 lines (the two # comments and the SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK entry) from decode_environment. The hightpt files should keep them as-is.
| - "Add GLM-5 FP8 GB300 Dynamo SGLang disaggregated multi-node coverage using lmsysorg/sglang:v0.5.11-cu130" | ||
| - "1k1k and 8k1k STP hightpt and lowlat srt-slurm recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/ (resolved from upstream srt-slurm PR #160 via srtctl resolve-override)" | ||
| - "Wire glm5/fp8 model + dynamo-sglang framework branches into runners/launch_gb300-nv.sh with SA upstream defaults (SLURM_PARTITION=batch_1, SLURM_ACCOUNT=benchmark, SQUASH_FILE under /home/sa-shared/gharunners/squash/)" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX |
There was a problem hiding this comment.
🟡 The pr-link for this changelog entry contains the literal placeholder pull/XXXX instead of the actual PR number (1557). All other recent entries (lines 3084, 3091, 3102) use real PR numbers — please substitute XXXX with 1557 before merge so the changelog link resolves correctly.
Extended reasoning...
What the bug is: perf-changelog.yaml line 3110 contains:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXXThe XXXX is a literal placeholder that was never replaced with this PR's actual number, 1557 (visible in the PR metadata). The link as written points to /pull/XXXX, which GitHub will resolve to a 404 (or an unrelated page) forever once this is merged.
The code path that triggers it: This is purely a static YAML metadata entry. Anyone — tooling or a human — who walks the changelog and follows the link for the glm5-fp8-gb300-dynamo-sglang entry will hit a broken link.
Why existing code doesn't prevent it: perf-changelog.yaml is a hand-edited document; there is no validator that checks pr-link URLs for placeholder tokens. The PR author clearly intended to fill it in but forgot before pushing.
Impact: Cosmetic / documentation only — no runtime impact, the YAML still parses, the benchmarks still run. But the changelog's whole purpose is to let readers trace each config change back to its PR; a permanently broken link defeats that for this entry, and the mistake will be frozen in git history once merged.
How to fix: One-character substitution. Replace line 3110 with:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1557Step-by-step proof:
- Open
perf-changelog.yamlat line 3110. The line readspr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX. - Compare with the most recent prior entries: line 3102 uses
/pull/1534, line 3091 uses/pull/1451, line 3084 uses/pull/1548— all real, resolvable PR numbers. - The PR metadata for this change states
<pr number="1557">, so the correct value is1557, notXXXX. - Once merged,
https://github.com/SemiAnalysisAI/InferenceX/pull/XXXXwill 404 (GitHub PR URLs require a numeric ID). The changelog entry will be unable to cross-reference its own PR.
Severity: nit — purely documentation/metadata, no functional consequence, but trivial to fix and worth catching pre-merge before it becomes a permanent artifact of git history.
Summary
glm5-fp8-gb300-dynamo-sglangentry in.github/configs/nvidia-master.yamlwith 1k1k and 8k1k STP hightpt/lowlat scenarios.glm5-fp8support inrunners/launch_gb300-nv.sh