Skip to content

Add GLM5 FP8 dynamo-sglang GB300 disagg configs#1557

Open
yeswanthk-26 wants to merge 1 commit into
mainfrom
yeswanth/glm5-fp8-gb300-disagg
Open

Add GLM5 FP8 dynamo-sglang GB300 disagg configs#1557
yeswanthk-26 wants to merge 1 commit into
mainfrom
yeswanth/glm5-fp8-gb300-disagg

Conversation

@yeswanthk-26
Copy link
Copy Markdown
Collaborator

@yeswanthk-26 yeswanthk-26 commented May 22, 2026

Summary

  • Add new glm5-fp8-gb300-dynamo-sglang entry in .github/configs/nvidia-master.yaml with 1k1k and 8k1k STP hightpt/lowlat scenarios.
  • Wire glm5-fp8 support in runners/launch_gb300-nv.sh

Port PR69 GLM5 FP8 GB300 disaggregated SGLang recipes to SA upstream and wire gb300-nv launcher support while keeping SA-default SLURM account/partition and sqsh paths.
Comment on lines +133 to +134
ep-dispatch-algorithm: static
moe-a2a-backend: deepep
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 In 1k1k_stp_hightpt_2.yaml the decode-side max-running-requests: 256 (line 134) is far below the benchmark's target concurrency of 7300 and is an outlier vs all sibling hightpt configs (which set it to 8192/8192/6500/5700, all aligned with their concurrency). The value 256 exactly matches the prefill section's setting in the same file, which strongly suggests a copy-paste error from prefill into decode. With this cap, ~7044 of the 7300 concurrent requests will perpetually queue inside the decode server and this sweep point will not reach intended decode throughput — please bump it to track the concurrency target (e.g. 7300 or 8192) like the other hightpt configs.

Extended reasoning...

What this is

In benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/1k1k/disagg/stp/1k1k_stp_hightpt_2.yaml, the decode block sets:

      max-running-requests: 256
      cuda-graph-max-bs: 180

while the benchmark concurrency at the bottom of the same file is concurrencies: 7300.

Why this is a bug — comparison across the sweep

All five 1k1k_stp_hightpt_* configs are part of the same hightpt sweep, and in every other file the decode max-running-requests is set to (or above) the target concurrency:

File concurrency decode max-running-requests decode cuda-graph-max-bs
hightpt_0 8192 8192 512
hightpt_1 7500 8192 256
hightpt_2 7300 256 ← outlier 180
hightpt_3 6500 6500 128
hightpt_4 5700 5700 100

Only hightpt_2 has decode max-running-requests: 256. That value is identical to the prefill block earlier in the same file (line 71: max-running-requests: 256 in the prefill section), which is the classic copy-paste signature — the decode block was authored by copying prefill and the max-running-requests line was not bumped.

Why existing settings do not save us

The same decode block also sets cuda-graph-max-bs: 180 with data-parallel-size: 40, implying the decode server was sized for roughly 180 * 40 = 7200 in-flight requests. So the rest of the decode config is consistent with a ~7300-concurrent workload — only the max-running-requests: 256 line is out of place. SGLang enforces max-running-requests as a hard cap on simultaneously-scheduled requests across all DP ranks, so the lower of (256, 7200) wins.

Step-by-step proof of the symptom

  1. Bench harness launches with concurrencies: 7300 → opens 7300 simultaneous client connections.
  2. Each client request arrives at the prefill stage, gets prefilled (prefill max-running-requests: 256 throttles the prefill side; that is intentional and matches all siblings).
  3. After KV transfer, the request flips to the decode server. Decode SGLang sees max-running-requests=256 and will only schedule 256 requests at a time.
  4. The remaining ~7044 client requests sit in the waiting/queue state; decode TBT is measured only over the 256 actually running.
  5. The reported throughput at conc=7300 is effectively the throughput at running-concurrency ≈ 256, not 7300. The hightpt_2 sweep point reports a number that is unrelated to what a 40-way DP decode at conc=7300 actually does.

Impact

This is freshly-added config, so no regression to bisect, but it silently produces a misleading data point in a perf sweep that is specifically meant to characterize decode throughput at high concurrency. Across the five 1k1k hightpt points, hightpt_2 will look anomalously low (or anomalously low-utilization on the decode side) and the whole curve becomes uninterpretable around 7300 concurrency.

Fix

Bump the decode-side max-running-requests in this file to match the target concurrency, the same way every sibling does — e.g. max-running-requests: 7300 (mirroring the hightpt_3/hightpt_4 pattern of "decode cap == conc") or max-running-requests: 8192 (mirroring hightpt_0/hightpt_1). One-line change at line 134.

Comment on lines +49 to +52
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
DYN_REQUEST_PLANE: nats
# DeepEP per-rank dispatch buffer; must be >= ceil(cuda_graph_max_bs / dp_size).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 All 5 lowlat decode configs (1k1k_stp_lowlat_0/1 and 8k1k_stp_lowlat_0/1/2) copy the decode_environment block from the hightpt configs, including SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: '512' and a comment explaining DeepEP per-rank dispatch buffer sizing. However these lowlat decode workers use moe-runner-backend: flashinfer_trtllm and do not set moe-a2a-backend: deepep or deepep-mode — DeepEP is not in use, so the env var is a no-op and the comment (talking about 4096/24 ~= 171) is misleading since lowlat decode runs DP=1 with cuda-graph-max-bs <= 32. Nit/cosmetic — consider dropping both the env var and the comment from the 5 lowlat files.

Extended reasoning...

What's going on

In each of the 5 lowlat decode configs added in this PR:

  • benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/1k1k/disagg/stp/1k1k_stp_lowlat_0.yaml
  • benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/1k1k/disagg/stp/1k1k_stp_lowlat_1.yaml
  • benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_0.yaml
  • benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_1.yaml
  • benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_2.yaml

the decode_environment block ends with:

      # DeepEP per-rank dispatch buffer; must be >= ceil(cuda_graph_max_bs / dp_size).
      # Default 128 overflows with large DP + batch (e.g. 4096/24 ~= 171 > 128). Limit 1024.
    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: '512'

That block was copy-pasted from the hightpt configs where DeepEP is actually enabled (moe-a2a-backend: deepep, deepep-mode: low_latency, deepep-config: /configs/deepep_config.json).

Why this is misleading in the lowlat files

The lowlat decode sglang_config.decode section in each of these files uses:

      tensor-parallel-size: 4
      expert-parallel-size: 1
      data-parallel-size: 1
      enable-flashinfer-allreduce-fusion: true
      moe-runner-backend: flashinfer_trtllm

There is no moe-a2a-backend: deepep and no deepep-mode / deepep-config — DeepEP is not in use, so SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK is a no-op.

The accompanying comment is also wrong-by-context: it talks about ceil(cuda_graph_max_bs / dp_size) with an example of 4096/24 ~= 171, but the lowlat decode configs run with data-parallel-size: 1 and cuda-graph-max-bs of 1, 8, 15, or 32 — the formula gives 1..32, nowhere near the 128 default.

Step-by-step proof for 1k1k_stp_lowlat_0.yaml

  1. Lines 49–52 (decode_environment) set SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: '512' with the DeepEP buffer comment.
  2. Lines 117–122 (decode sglang_config) set moe-runner-backend: flashinfer_trtllm, data-parallel-size: 1, cuda-graph-max-bs: 32.
  3. No moe-a2a-backend or deepep-mode is set anywhere in the decode block — DeepEP is not invoked.
  4. SGLang reads SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK only on the DeepEP code path, so it has no effect.
  5. The comment's example (4096/24 ~= 171) describes hightpt math (large DP + large batch); for this file the relevant value would be ceil(32/1) = 32, which never approaches the 128 default the comment warns about.

Impact and fix

No functional impact — SGLang silently ignores the unused env var, and benchmark behavior is unchanged. The cost is cosmetic but real: anyone reading these lowlat recipes will see a comment promising DeepEP-related tuning and an env var that doesn't apply, which is exactly the kind of friction that erodes trust in copy-pasted recipes.

Suggested fix: in all 5 lowlat yamls, drop the 3 lines (the two # comments and the SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK entry) from decode_environment. The hightpt files should keep them as-is.

Comment thread perf-changelog.yaml
- "Add GLM-5 FP8 GB300 Dynamo SGLang disaggregated multi-node coverage using lmsysorg/sglang:v0.5.11-cu130"
- "1k1k and 8k1k STP hightpt and lowlat srt-slurm recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/ (resolved from upstream srt-slurm PR #160 via srtctl resolve-override)"
- "Wire glm5/fp8 model + dynamo-sglang framework branches into runners/launch_gb300-nv.sh with SA upstream defaults (SLURM_PARTITION=batch_1, SLURM_ACCOUNT=benchmark, SQUASH_FILE under /home/sa-shared/gharunners/squash/)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The pr-link for this changelog entry contains the literal placeholder pull/XXXX instead of the actual PR number (1557). All other recent entries (lines 3084, 3091, 3102) use real PR numbers — please substitute XXXX with 1557 before merge so the changelog link resolves correctly.

Extended reasoning...

What the bug is: perf-changelog.yaml line 3110 contains:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX

The XXXX is a literal placeholder that was never replaced with this PR's actual number, 1557 (visible in the PR metadata). The link as written points to /pull/XXXX, which GitHub will resolve to a 404 (or an unrelated page) forever once this is merged.

The code path that triggers it: This is purely a static YAML metadata entry. Anyone — tooling or a human — who walks the changelog and follows the link for the glm5-fp8-gb300-dynamo-sglang entry will hit a broken link.

Why existing code doesn't prevent it: perf-changelog.yaml is a hand-edited document; there is no validator that checks pr-link URLs for placeholder tokens. The PR author clearly intended to fill it in but forgot before pushing.

Impact: Cosmetic / documentation only — no runtime impact, the YAML still parses, the benchmarks still run. But the changelog's whole purpose is to let readers trace each config change back to its PR; a permanently broken link defeats that for this entry, and the mistake will be frozen in git history once merged.

How to fix: One-character substitution. Replace line 3110 with:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1557

Step-by-step proof:

  1. Open perf-changelog.yaml at line 3110. The line reads pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX.
  2. Compare with the most recent prior entries: line 3102 uses /pull/1534, line 3091 uses /pull/1451, line 3084 uses /pull/1548 — all real, resolvable PR numbers.
  3. The PR metadata for this change states <pr number="1557">, so the correct value is 1557, not XXXX.
  4. Once merged, https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX will 404 (GitHub PR URLs require a numeric ID). The changelog entry will be unable to cross-reference its own PR.

Severity: nit — purely documentation/metadata, no functional consequence, but trivial to fix and worth catching pre-merge before it becomes a permanent artifact of git history.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant