Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
249 changes: 22 additions & 227 deletions .github/configs/nvidia-master.yaml

Large diffs are not rendered by default.

17 changes: 14 additions & 3 deletions benchmarks/single_node/dsr1_fp4_b300.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,22 @@ check_env_vars \
RESULT_FILENAME \
EP_SIZE

# `hf download` creates the target dir if missing and is itself idempotent.
# When MODEL_PATH is unset (stand-alone runs), fall back to the HF_HUB_CACHE
# Either way, MODEL_PATH is what the server is launched with.
if [[ -n "${MODEL_PATH:-}" ]]; then
if [[ ! -d "$MODEL_PATH" || -z "$(ls -A "$MODEL_PATH" 2>/dev/null)" ]]; then
hf download "$MODEL" --local-dir "$MODEL_PATH"
fi
else
hf download "$MODEL"
export MODEL_PATH="$MODEL"
fi

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

nvidia-smi

Expand All @@ -44,8 +55,8 @@ fi
start_gpu_monitor

set -x
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path $MODEL --host 0.0.0.0 --port $PORT --trust-remote-code \
--tensor-parallel-size=$TP --data-parallel-size=1 \
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path $MODEL_PATH --served-model-name $MODEL --host 0.0.0.0 --port $PORT --trust-remote-code \
--tensor-parallel-size $TP --data-parallel-size 1 \
--cuda-graph-max-bs 256 --max-running-requests 256 --mem-fraction-static 0.85 --kv-cache-dtype fp8_e4m3 \
--chunked-prefill-size 16384 \
--ep-size $EP_SIZE --quantization modelopt_fp4 --enable-flashinfer-allreduce-fusion --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
Expand Down
17 changes: 14 additions & 3 deletions benchmarks/single_node/dsr1_fp8_b300.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,24 @@ check_env_vars \
RESULT_FILENAME \
EP_SIZE

# `hf download` creates the target dir if missing and is itself idempotent.
# When MODEL_PATH is unset (stand-alone runs), fall back to the HF_HUB_CACHE
# Either way, MODEL_PATH is what the server is launched with.
if [[ -n "${MODEL_PATH:-}" ]]; then
if [[ ! -d "$MODEL_PATH" || -z "$(ls -A "$MODEL_PATH" 2>/dev/null)" ]]; then
hf download "$MODEL" --local-dir "$MODEL_PATH"
fi
else
hf download "$MODEL"
export MODEL_PATH="$MODEL"
fi

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

Comment on lines 34 to 37
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The reduced search-space for dsr1-fp8-b300-sglang is a single point at tp=4, isl=1024, osl=1024, but benchmarks/single_node/dsr1_fp8_b300.sh lines 47-51 contain a guard that does exit 1 when TP=4 is combined with anything other than ISL=8192/OSL=1024. Every run of this config will exit before the server launches, defeating the wiring-verification goal stated in the PR description. Fix by changing the sweep point to tp: 8, switching to isl: 8192, or relaxing the script guard.

Extended reasoning...

What the bug is

The PR shrinks the dsr1-fp8-b300-sglang sweep in .github/configs/nvidia-master.yaml to one point:

- isl: 1024
  osl: 1024
  search-space:
  - { tp: 4, ep: 1, conc-start: 4, conc-end: 4 }

But the bench script the launcher resolves for this config, benchmarks/single_node/dsr1_fp8_b300.sh, contains an explicit guard:

elif [[ $TP -eq 4 ]]; then
  if [[ $ISL -ne 8192 ]] || [[ $OSL -ne 1024 ]]; then
    echo "TP=4 not yet supported for ISL=$ISL OSL=$OSL!"
    exit 1
  fi

With TP=4 and ISL=1024, $ISL -ne 8192 short-circuits true, the script prints the rejection message and exits 1 before sglang.launch_server is ever invoked.

Code path

  1. runners/launch_b300-nv.sh (agg branch) resolves the bench script as benchmarks/single_node/dsr1_fp8_b300_sglang.sh first, then falls back to dsr1_fp8_b300.sh (LEGACY_FW_SUFFIX is empty for sglang). Only dsr1_fp8_b300.sh exists, so that is what runs.
  2. .github/workflows/benchmark-tmpl.yml exports TP from the matrix entry, so TP=4 is passed into the script.
  3. The TP branch above runs and exits 1.

Why pre-existing code doesn't prevent it

Before this PR the only ISL=1024/OSL=1024 entry under dsr1-fp8-b300-sglang was tp: 8, which is handled by the TP=8 branch above the guard. TP=4 only appeared under ISL=8192 — the one combination the guard explicitly permits. The PR's reduction removed the safe TP=8 point and replaced it with the exact (TP=4, ISL=1024) combination the guard rejects.

Impact

The PR description explicitly motivates the search-space reduction as wiring verification ("Reduce search-space to single (isl=1024, osl=1024, conc=4) point per config to verify model-path wiring end-to-end"). Because every job for dsr1-fp8-b300-sglang will exit 1 before launching the server, the model-path wiring for this config will not be verified at all — the script bails before reaching the MODEL=… path read, nvidia-smi, or sglang.launch_server --model-path $MODEL.

Step-by-step proof

  1. CI matrix produces a job with tp=4, ep=1, conc=4, isl=1024, osl=1024.
  2. benchmark-tmpl.yml exports TP=4 ISL=1024 OSL=1024 CONC=4 EP_SIZE=1 into the container.
  3. launch_b300-nv.sh (agg branch) selects benchmarks/single_node/dsr1_fp8_b300.sh and execs it.
  4. Script runs check_env_vars (passes), prints SLURM banner, then runs nvidia-smi.
  5. Control reaches the TP dispatch. $TP -eq 8 is false; $TP -eq 4 is true.
  6. Inside the TP=4 branch: $ISL -ne 8192 → true (ISL is 1024). Short-circuit OR makes the outer test true.
  7. Script prints TP=4 not yet supported for ISL=1024 OSL=1024! and exit 1.
  8. sglang.launch_server is never invoked; the universal MODEL rewrite the PR adds is never exercised.

Fix

Three equivalent options:

  • Change the search-space entry to tp: 8 (matches the previous ISL=1024 sweep).
  • Change isl to 8192 (the one ISL the TP=4 branch permits).
  • Relax the guard in dsr1_fp8_b300.sh so TP=4 supports ISL=1024 (this would require validating the recipe at that shape, since the existing TP=4 tuning was only for 8192/1024).

export SGL_ENABLE_JIT_DEEPGEMM=false
export SGLANG_ENABLE_FLASHINFER_GEMM=true
Expand Down Expand Up @@ -76,8 +87,8 @@ fi
start_gpu_monitor

set -x
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
--tensor-parallel-size=$TP --data-parallel-size=1 \
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path $MODEL_PATH --served-model-name $MODEL --host 0.0.0.0 --port $PORT \
--tensor-parallel-size $TP --data-parallel-size 1 \
--cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE --max-running-requests $MAX_RUNNING_REQUESTS \
--mem-fraction-static $MEM_FRAC_STATIC --kv-cache-dtype fp8_e4m3 --chunked-prefill-size $CHUNKED_PREFILL_SIZE --max-prefill-tokens $MAX_PREFILL_TOKENS \
--enable-flashinfer-allreduce-fusion --scheduler-recv-interval $SCHEDULER_RECV_INTERVAL --disable-radix-cache \
Expand Down
25 changes: 18 additions & 7 deletions benchmarks/single_node/dsr1_fp8_b300_mtp.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,24 @@ check_env_vars \
RESULT_FILENAME \
EP_SIZE

# `hf download` creates the target dir if missing and is itself idempotent.
# When MODEL_PATH is unset (stand-alone runs), fall back to the HF_HUB_CACHE
# Either way, MODEL_PATH is what the server is launched with.
if [[ -n "${MODEL_PATH:-}" ]]; then
if [[ ! -d "$MODEL_PATH" || -z "$(ls -A "$MODEL_PATH" 2>/dev/null)" ]]; then
hf download "$MODEL" --local-dir "$MODEL_PATH"
fi
else
hf download "$MODEL"
export MODEL_PATH="$MODEL"
fi

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

export SGLANG_ENABLE_JIT_DEEPGEMM=false

Expand Down Expand Up @@ -70,11 +81,11 @@ start_gpu_monitor

set -x
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server \
--model-path=$MODEL \
--host=0.0.0.0 \
--port=$PORT \
--tensor-parallel-size=$TP \
--data-parallel-size=1 \
--model-path $MODEL_PATH --served-model-name $MODEL \
--host 0.0.0.0 \
--port $PORT \
--tensor-parallel-size $TP \
--data-parallel-size 1 \
--cuda-graph-max-bs $CUDA_GRAPH_MAX_BATCH_SIZE \
--max-running-requests $MAX_RUNNING_REQUESTS \
--mem-fraction-static $MEM_FRAC_STATIC \
Expand All @@ -84,7 +95,7 @@ PYTHONNOUSERSITE=1 python3 -m sglang.launch_server \
--enable-flashinfer-allreduce-fusion \
--scheduler-recv-interval $SCHEDULER_RECV_INTERVAL \
--disable-radix-cache \
--fp8-gemm-backend=flashinfer_trtllm \
--fp8-gemm-backend flashinfer_trtllm \
--attention-backend trtllm_mla \
--stream-interval 30 \
--ep-size $EP_SIZE \
Expand Down
20 changes: 13 additions & 7 deletions benchmarks/single_node/dsv4_fp4_b300_sglang.sh
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,20 @@ check_env_vars \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
# `hf download` creates the target dir if missing and is itself idempotent.
# When MODEL_PATH is unset (stand-alone runs), fall back to the HF_HUB_CACHE
# Either way, MODEL_PATH is what the server is launched with.
if [[ -n "${MODEL_PATH:-}" ]]; then
if [[ ! -d "$MODEL_PATH" || -z "$(ls -A "$MODEL_PATH" 2>/dev/null)" ]]; then
hf download "$MODEL" --local-dir "$MODEL_PATH"
fi
else
hf download "$MODEL"
export MODEL_PATH="$MODEL"
fi

# The B300 runner overrides MODEL to a pre-staged /data/models path, so skip
# `hf download`. Only fetch when MODEL looks like a HF repo ID.
if [[ "$MODEL" != /* ]]; then
hf download "$MODEL"
if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi
Expand Down Expand Up @@ -172,7 +178,7 @@ fi

set -x
PYTHONNOUSERSITE=1 sglang serve \
--model-path $MODEL \
--model-path $MODEL_PATH --served-model-name $MODEL \
--host 0.0.0.0 \
--port $PORT \
--trust-remote-code \
Expand Down
20 changes: 13 additions & 7 deletions benchmarks/single_node/dsv4_fp4_b300_sglang_mtp.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,14 +23,20 @@ check_env_vars \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
# `hf download` creates the target dir if missing and is itself idempotent.
# When MODEL_PATH is unset (stand-alone runs), fall back to the HF_HUB_CACHE
# Either way, MODEL_PATH is what the server is launched with.
if [[ -n "${MODEL_PATH:-}" ]]; then
if [[ ! -d "$MODEL_PATH" || -z "$(ls -A "$MODEL_PATH" 2>/dev/null)" ]]; then
hf download "$MODEL" --local-dir "$MODEL_PATH"
fi
else
hf download "$MODEL"
export MODEL_PATH="$MODEL"
fi

# The B300 runner overrides MODEL to a pre-staged /data/models path, so skip
# `hf download`. Only fetch when MODEL looks like a HF repo ID.
if [[ "$MODEL" != /* ]]; then
hf download "$MODEL"
if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi
Expand Down Expand Up @@ -121,7 +127,7 @@ fi

set -x
PYTHONNOUSERSITE=1 sglang serve \
--model-path $MODEL \
--model-path $MODEL_PATH --served-model-name $MODEL \
--host 0.0.0.0 \
--port $PORT \
--trust-remote-code \
Expand Down
18 changes: 13 additions & 5 deletions benchmarks/single_node/dsv4_fp4_b300_trt.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,18 @@ check_env_vars \
DP_ATTENTION \
EP_SIZE

# `hf download` creates the target dir if missing and is itself idempotent.
# When MODEL_PATH is unset (stand-alone runs), fall back to the HF_HUB_CACHE
# Either way, MODEL_PATH is what the server is launched with.
if [[ -n "${MODEL_PATH:-}" ]]; then
if [[ ! -d "$MODEL_PATH" || -z "$(ls -A "$MODEL_PATH" 2>/dev/null)" ]]; then
hf download "$MODEL" --local-dir "$MODEL_PATH"
fi
else
hf download "$MODEL"
export MODEL_PATH="$MODEL"
fi

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi
Expand Down Expand Up @@ -47,10 +59,6 @@ sanitize_slurm_mpi_env_for_trtllm
export NCCL_NVLS_ENABLE="${NCCL_NVLS_ENABLE:-0}"
echo "NCCL_NVLS_ENABLE: $NCCL_NVLS_ENABLE"

if [[ "$MODEL" != /* ]]; then
hf download "$MODEL"
fi

nvidia-smi

SERVER_LOG="$PWD/server.log"
Expand Down Expand Up @@ -108,7 +116,7 @@ start_gpu_monitor --output "$PWD/gpu_metrics.csv"

set -x
SERVE_CMD=(
trtllm-serve "$MODEL" \
trtllm-serve "$MODEL_PATH" \
--host 0.0.0.0 \
--port "$PORT" \
--trust_remote_code \
Expand Down
18 changes: 13 additions & 5 deletions benchmarks/single_node/dsv4_fp4_b300_trt_mtp.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,18 @@ check_env_vars \
DP_ATTENTION \
EP_SIZE

# `hf download` creates the target dir if missing and is itself idempotent.
# When MODEL_PATH is unset (stand-alone runs), fall back to the HF_HUB_CACHE
# Either way, MODEL_PATH is what the server is launched with.
if [[ -n "${MODEL_PATH:-}" ]]; then
if [[ ! -d "$MODEL_PATH" || -z "$(ls -A "$MODEL_PATH" 2>/dev/null)" ]]; then
hf download "$MODEL" --local-dir "$MODEL_PATH"
fi
else
hf download "$MODEL"
export MODEL_PATH="$MODEL"
fi

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi
Expand Down Expand Up @@ -46,10 +58,6 @@ sanitize_slurm_mpi_env_for_trtllm
export NCCL_NVLS_ENABLE="${NCCL_NVLS_ENABLE:-0}"
echo "NCCL_NVLS_ENABLE: $NCCL_NVLS_ENABLE"

if [[ "$MODEL" != /* ]]; then
hf download "$MODEL"
fi

nvidia-smi

SERVER_LOG="$PWD/server.log"
Expand Down Expand Up @@ -111,7 +119,7 @@ start_gpu_monitor --output "$PWD/gpu_metrics.csv"

set -x
SERVE_CMD=(
trtllm-serve "$MODEL" \
trtllm-serve "$MODEL_PATH" \
--host 0.0.0.0 \
--port "$PORT" \
--trust_remote_code \
Expand Down
15 changes: 13 additions & 2 deletions benchmarks/single_node/dsv4_fp4_b300_vllm.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,13 +17,24 @@ check_env_vars \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

# `hf download` creates the target dir if missing and is itself idempotent.
# When MODEL_PATH is unset (stand-alone runs), fall back to the HF_HUB_CACHE
# Either way, MODEL_PATH is what the server is launched with.
if [[ -n "${MODEL_PATH:-}" ]]; then
if [[ ! -d "$MODEL_PATH" || -z "$(ls -A "$MODEL_PATH" 2>/dev/null)" ]]; then
hf download "$MODEL" --local-dir "$MODEL_PATH"
fi
else
hf download "$MODEL"
export MODEL_PATH="$MODEL"
fi

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
Expand Down Expand Up @@ -67,7 +78,7 @@ fi
start_gpu_monitor

set -x
vllm serve "$MODEL" --host 0.0.0.0 --port "$PORT" \
vllm serve "$MODEL_PATH" --served-model-name "$MODEL" --host 0.0.0.0 --port "$PORT" \
"${PARALLEL_ARGS[@]}" \
--pipeline-parallel-size 1 \
--kv-cache-dtype fp8 \
Expand Down
15 changes: 13 additions & 2 deletions benchmarks/single_node/dsv4_fp4_b300_vllm_mtp.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,13 +13,24 @@ check_env_vars \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

# `hf download` creates the target dir if missing and is itself idempotent.
# When MODEL_PATH is unset (stand-alone runs), fall back to the HF_HUB_CACHE
# Either way, MODEL_PATH is what the server is launched with.
if [[ -n "${MODEL_PATH:-}" ]]; then
if [[ ! -d "$MODEL_PATH" || -z "$(ls -A "$MODEL_PATH" 2>/dev/null)" ]]; then
hf download "$MODEL" --local-dir "$MODEL_PATH"
fi
else
hf download "$MODEL"
export MODEL_PATH="$MODEL"
fi

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
Expand Down Expand Up @@ -60,7 +71,7 @@ NUM_SPEC_TOKENS=2
start_gpu_monitor

set -x
vllm serve "$MODEL" --host 0.0.0.0 --port "$PORT" \
vllm serve "$MODEL_PATH" --served-model-name "$MODEL" --host 0.0.0.0 --port "$PORT" \
"${PARALLEL_ARGS[@]}" \
--pipeline-parallel-size 1 \
--kv-cache-dtype fp8 \
Expand Down
19 changes: 15 additions & 4 deletions benchmarks/single_node/glm5_fp4_b300.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,24 @@ check_env_vars \
RESULT_FILENAME \
EP_SIZE

# `hf download` creates the target dir if missing and is itself idempotent.
# When MODEL_PATH is unset (stand-alone runs), fall back to the HF_HUB_CACHE
# Either way, MODEL_PATH is what the server is launched with.
if [[ -n "${MODEL_PATH:-}" ]]; then
if [[ ! -d "$MODEL_PATH" || -z "$(ls -A "$MODEL_PATH" 2>/dev/null)" ]]; then
hf download "$MODEL" --local-dir "$MODEL_PATH"
fi
else
hf download "$MODEL"
export MODEL_PATH="$MODEL"
fi

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
Expand All @@ -38,9 +49,9 @@ fi
start_gpu_monitor

set -x
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.0.0.0 --port=$PORT \
PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path $MODEL_PATH --served-model-name $MODEL --host 0.0.0.0 --port $PORT \
--trust-remote-code \
--tensor-parallel-size=$TP \
--tensor-parallel-size $TP \
--data-parallel-size 1 --expert-parallel-size $EP_SIZE \
--disable-radix-cache \
--quantization modelopt_fp4 \
Expand All @@ -56,7 +67,7 @@ PYTHONNOUSERSITE=1 python3 -m sglang.launch_server --model-path=$MODEL --host=0.
--stream-interval 30 \
--scheduler-recv-interval 10 \
--tokenizer-worker-num 6 \
--tokenizer-path $MODEL $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &
--tokenizer-path $MODEL_PATH $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

SERVER_PID=$!

Expand Down
Loading
Loading