Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1933,3 +1933,22 @@ glm5-fp8-mi325x-sglang-mtp:
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }

dsr1-fp8-mi300x-sglang-mtp:
image: lmsysorg/sglang:v0.5.12-rocm700-mi30x
model: deepseek-ai/DeepSeek-R1-0528
model-prefix: dsr1
runner: mi300x
precision: fp8
framework: sglang
multinode: false
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }
87 changes: 87 additions & 0 deletions benchmarks/single_node/dsr1_fp8_mi300x_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
#!/usr/bin/env bash

# DeepSeek-R1-0528 FP8 on MI300X with EAGLE/MTP speculative decoding.
# Mirrors dsr1_fp8_mi300x.sh and adds the speculative-* flags.

Comment on lines +1 to +5
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The newly added benchmarks/single_node/dsr1_fp8_mi300x_mtp.sh is unreachable: runners/launch_mi300x-amds.sh:41 hardcodes the bench-script path as benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%_*}_${PRECISION}_mi300x.sh with no SPEC_SUFFIX dispatch, so the dsr1-fp8-mi300x-sglang-mtp recipe will always invoke the existing non-MTP dsr1_fp8_mi300x.sh regardless of spec-decoding: mtp. The sweep will silently run the autoregressive recipe (no --speculative-* flags, SGLANG_ENABLE_SPEC_V2 unset) and publish vanilla numbers under the -mtp recipe name. Fix: add SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf _mtp || printf "") to launch_mi300x-amds.sh and append it to the script path, mirroring launch_b200-cw.sh:6-16 / launch_mi355x-amds.sh:183,226.

Extended reasoning...

What's broken

The PR adds a new recipe dsr1-fp8-mi300x-sglang-mtp (.github/configs/amd-master.yaml:1823) with spec-decoding: mtp, plus a sibling bench script benchmarks/single_node/dsr1_fp8_mi300x_mtp.sh that wires up the --speculative-algorithm EAGLE … flags and SGLANG_ENABLE_SPEC_V2=1. But the launcher that routes mi300x runners has no path to ever execute that file.

Code path

The dispatcher .github/workflows/benchmark-tmpl.yml invokes runners/launch_mi300x-amds.sh for any recipe whose runner: mi300x. That launcher ends with a single hardcoded line at runners/launch_mi300x-amds.sh:41:

bash benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%_*}_${PRECISION}_mi300x.sh

EXP_NAME is built by utils/matrix_logic/generate_sweep_configs.py:290 as f"{model_code}_{seq_len_str}" where model_code is the recipe's model-prefix. For this PR's recipe (model-prefix: dsr1), EXP_NAME will be e.g. dsr1_1k1k or dsr1_8k1k, so ${EXP_NAME%%_*}=dsr1. With PRECISION=fp8 and no SCENARIO_SUBDIR (fixed-seq-len is the default, empty subdir), the launcher always resolves to:

benchmarks/single_node/dsr1_fp8_mi300x.sh

— which exists (the non-MTP script from origin/main) and runs cleanly, so there's no error to alert anyone.

Why every other launcher handles this

Every launcher that owns an MTP recipe computes a SPEC_SUFFIX and appends it to the script name. Examples from the tree:

  • runners/launch_mi355x-amds.sh:183,226-227SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '') and SCRIPT_BASE="${EXP_NAME%%_*}_${PRECISION}_mi355x" followed by ${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh.
  • runners/launch_b200-cw.sh:8,13, launch_b200-nb.sh:6, launch_b200-dgxc.sh:335, launch_b300-nv.sh:294, launch_h200-cw.sh:8,47, launch_h200-nb.sh:8, launch_h200-dgxc-slurm.sh:300 — all carry the same SPEC_SUFFIX pattern.

launch_mi300x-amds.sh and (separately) launch_mi325x-amds.sh are the only two launchers without it; until now there was no MTP recipe targeting either runner, so the omission was harmless. This PR is the first MTP recipe on mi300x, so the omission becomes load-bearing.

Impact

The label says this PR is full-sweep-enabled, so when the sweep runs:

  1. The server gets launched without --speculative-algorithm EAGLE, --speculative-num-steps, --speculative-eagle-topk, --speculative-num-draft-tokens, and without SGLANG_ENABLE_SPEC_V2=1.
  2. The benchmark client measures vanilla autoregressive decode latency/throughput on DeepSeek-R1-0528 fp8.
  3. Results are tagged dsr1-fp8-mi300x-sglang-mtp and written to the perf history.

That publishes non-MTP numbers under the MTP recipe name — which is exactly the opposite of what the PR is trying to demonstrate (EAGLE speedup vs. baseline). The bench script that was added in this PR contributes zero observable behavior.

Step-by-step proof

  1. CI selects the recipe dsr1-fp8-mi300x-sglang-mtp from amd-master.yaml → exports MODEL=deepseek-ai/DeepSeek-R1-0528, PRECISION=fp8, MODEL_PREFIX=dsr1, SPEC_DECODING=mtp.
  2. generate_sweep_configs.py:290 constructs EXP_NAME=dsr1_1k1k (or dsr1_8k1k).
  3. .github/workflows/benchmark-tmpl.yml routes runner: mi300x → invokes runners/launch_mi300x-amds.sh.
  4. Line 41 expands: SCENARIO_SUBDIR="", ${EXP_NAME%%_*}=dsr1, PRECISION=fp8bash benchmarks/single_node/dsr1_fp8_mi300x.sh.
  5. That file (origin/main, the non-MTP recipe) runs end-to-end and succeeds; SPEC_DECODING=mtp was never consulted by the launcher.
  6. The new dsr1_fp8_mi300x_mtp.sh is never read.

Fix

Mirror launch_b200-cw.sh:6-16 (or launch_mi355x-amds.sh:183,226) in launch_mi300x-amds.sh: compute SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf _mtp || printf "") early, then change line 41's path to ${EXP_NAME%%_*}_${PRECISION}_mi300x${SPEC_SUFFIX}.sh. Worth landing the same patch on launch_mi325x-amds.sh preemptively, since the same problem will resurface the next time someone adds an mi325x MTP recipe.

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

# If the machine runs a MEC FW older than 177, RCCL
# cannot reclaim some memory. Disable to avoid crashes.
version=`rocm-smi --showfw | grep MEC | head -n 1 | awk '{print $NF}'`
if [[ "$version" == "" || $version -lt 177 ]]; then
export HSA_NO_SCRATCH_RECLAIM=1
fi

export SGLANG_USE_AITER=1
export SGLANG_AITER_MLA_PERSIST=1
export SGLANG_ENABLE_SPEC_V2=1

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
fi

start_gpu_monitor

set -x
python3 -m sglang.launch_server \
--model-path=$MODEL --host=0.0.0.0 --port=$PORT --trust-remote-code \
--tensor-parallel-size=$TP \
--ep-size $EP_SIZE \
--mem-fraction-static=0.8 \
--cuda-graph-max-bs=128 \
--chunked-prefill-size=131072 \
--num-continuous-decode-steps=4 \
--max-prefill-tokens=131072 \
--kv-cache-dtype fp8_e4m3 \
--attention-backend aiter \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--disable-radix-cache $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

SERVER_PID=$!

wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts $(( $CONC * 10 )) \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--use-chat-template

if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

stop_gpu_monitor
set +x
6 changes: 6 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3050,3 +3050,9 @@
description:
- "Update SGLang image from v0.5.11-cu130 (5d old) to v0.5.12-cu130"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1475

- config-keys:
- dsr1-fp8-mi300x-sglang-mtp
description:
- "Add MTP/EAGLE speculative-decoding sibling of dsr1-fp8-mi300x-sglang on lmsysorg/sglang:v0.5.12-rocm700-mi30x"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1499
Loading