Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion runners/launch_mi355x-amds.sh
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,11 @@
LOCK_FILE="${SQUASH_FILE}.lock"

set -x
salloc --partition=$PARTITION --gres=gpu:$TP --exclusive --cpus-per-task=128 --time=500 --no-shell --job-name="$RUNNER_NAME"
# Exclude known-bad mi355x compute nodes (KLAUD_DEBUG §5.1 / §5.2):
# mia1-p01-g09: pyxis broken (persistently fails to create container filesystem)
# mia1-p01-g11: docker.sock permissions denied (cluster-cleanup step fails)
# Both have been root-caused via #1431/#1432/#1440/#1441/#1443 sweep failures.
salloc --partition=$PARTITION --exclude=mia1-p01-g09,mia1-p01-g11 --gres=gpu:$TP --exclusive --cpus-per-task=128 --time=500 --no-shell --job-name="$RUNNER_NAME"

Check warning on line 194 in runners/launch_mi355x-amds.sh

View check run for this annotation

Claude / Claude Code Review

Exclude list misses g12/g31 which share the same docker.sock failure as g11

The new `--exclude=mia1-p01-g09,mia1-p01-g11` only covers 1 of the 3 nodes that `KLAUD_DEBUG.md §5.2` explicitly groups as sharing the docker.sock-permissions failure (`mia1-p01-g11 / g12 / g31`). §5.2 also states "Recipe-level workaround: none" — i.e. g12 and g31 are not drained at the SLURM level, so salloc can still land on them and the very next `srun ... docker stop $(docker ps -a -q)` (line 197) will hit the identical cascade this PR is trying to prevent. Consider extending to `--exclude=m
Comment on lines +190 to +194
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new --exclude=mia1-p01-g09,mia1-p01-g11 only covers 1 of the 3 nodes that KLAUD_DEBUG.md §5.2 explicitly groups as sharing the docker.sock-permissions failure (mia1-p01-g11 / g12 / g31). §5.2 also states "Recipe-level workaround: none" — i.e. g12 and g31 are not drained at the SLURM level, so salloc can still land on them and the very next srun ... docker stop $(docker ps -a -q) (line 197) will hit the identical cascade this PR is trying to prevent. Consider extending to --exclude=mia1-p01-g[09,11,12,31] (or comma-separated equivalent).

Extended reasoning...

What the bug is

This PR adds --exclude=mia1-p01-g09,mia1-p01-g11 to the salloc on line 191 of runners/launch_mi355x-amds.sh, citing KLAUD_DEBUG §5.1 / §5.2 as the justification. However, §5.2 of that very file (lines 114-116) explicitly groups three nodes together as sharing the identical failure mode:

5.2 mia1-p01-g11 / g12 / g31 — docker socket perms

Symptom: mi355x jobs fail with permission denied while trying to connect to the docker API at unix:///var/run/docker.sock during the docker stop $(docker ps -a -q) cleanup step, cascading into SLURM job expiration.
Fix: ops needs to fix docker group / socket perms on these nodes. Recipe-level workaround: none.

The PR only excludes g11, leaving g12 and g31 reachable by SLURM with the documented identical defect.

Why existing code does not prevent this

The g19/g37 nodes from §5.1 don't need to be in --exclude because §5.1 says they are kept in State=DRAIN/DOWN by ops, so salloc won't allocate to them anyway. But §5.2 makes no such claim about g12/g31 — it explicitly states "Recipe-level workaround: none", meaning they are not drained at the SLURM level. The very fact that the PR had to add g09 to --exclude despite §5.1 calling it "persistently drained" demonstrates the drain state is unreliable in practice (consistent with §5.6 about DYNAMIC_NORM nodes auto-clearing DRAIN).

Impact / proof

Walk through the failure case:

  1. The runner script is invoked, e.g. via one of the affected PRs (Update dsr1-fp4-mi355x-sglang SGLang image to v0.5.12-rocm700-mi35x #1431/Update dsr1-fp8-mi355x-sglang SGLang image to v0.5.12-rocm700-mi35x #1432/Update glm5-fp8-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1440/[Handoff to @Oseltamivir Claude /loop] Update glm5.1-fp4-mi355x-sglang SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1441/Update qwen3.5-bf16-mi355x-sglang and mtp SGLang ROCm image to v0.5.12-rocm720-mi35x-20260517 #1443).
  2. salloc --partition=compute --exclude=mia1-p01-g09,mia1-p01-g11 --gres=gpu:$TP ... is issued (line 191).
  3. SLURM picks an available node. Since g12 and g31 are not in --exclude and are not drained per §5.2, they remain in the allocation pool alongside healthy nodes.
  4. Suppose SLURM picks mia1-p01-g12.
  5. The next command runs: srun --jobid=$JOB_ID bash -c "docker stop \$(docker ps -a -q)" (line 197).
  6. On g12, docker ps -a -q fails with permission denied while trying to connect to the docker API at unix:///var/run/docker.sock — the exact symptom from §5.2.
  7. The cleanup step exits non-zero, cascading into SLURM job expiration — the identical failure the PR is trying to prevent for the g11 case.

The PR's empirical argument ("every failure landed on g09/g11 across 5 sweep PRs") is sampling luck on a ~12-node pool with several already drained. With only ~5 trials, observing g12/g31 0 times has a non-trivial probability even if the underlying defect is present, and §5.2 explicitly says it is present.

How to fix

Extend the exclude list to cover all three §5.2 nodes:

salloc --partition=$PARTITION --exclude=mia1-p01-g[09,11,12,31] ...

or equivalently --exclude=mia1-p01-g09,mia1-p01-g11,mia1-p01-g12,mia1-p01-g31. The inline comment should be updated correspondingly. This is a follow-up improvement rather than a regression — the PR strictly improves the baseline, so it does not need to block on this, but the documented gap is worth closing in the same change.

JOB_ID=$(squeue --name="$RUNNER_NAME" -h -o %A | head -n1)

srun --jobid=$JOB_ID bash -c "docker stop \$(docker ps -a -q)"
Expand Down