-
Notifications
You must be signed in to change notification settings - Fork 175
[Klaud Cold] runners(mi355x): exclude broken nodes mia1-p01-g09 + mia1-p01-g11 #1498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 The new
--exclude=mia1-p01-g09,mia1-p01-g11only covers 1 of the 3 nodes thatKLAUD_DEBUG.md §5.2explicitly groups as sharing the docker.sock-permissions failure (mia1-p01-g11 / g12 / g31). §5.2 also states "Recipe-level workaround: none" — i.e. g12 and g31 are not drained at the SLURM level, so salloc can still land on them and the very nextsrun ... docker stop $(docker ps -a -q)(line 197) will hit the identical cascade this PR is trying to prevent. Consider extending to--exclude=mia1-p01-g[09,11,12,31](or comma-separated equivalent).Extended reasoning...
What the bug is
This PR adds
--exclude=mia1-p01-g09,mia1-p01-g11to thesallocon line 191 ofrunners/launch_mi355x-amds.sh, citingKLAUD_DEBUG §5.1 / §5.2as the justification. However, §5.2 of that very file (lines 114-116) explicitly groups three nodes together as sharing the identical failure mode:The PR only excludes
g11, leavingg12andg31reachable by SLURM with the documented identical defect.Why existing code does not prevent this
The
g19/g37nodes from §5.1 don't need to be in--excludebecause §5.1 says they are kept inState=DRAIN/DOWNby ops, sosallocwon't allocate to them anyway. But §5.2 makes no such claim aboutg12/g31— it explicitly states "Recipe-level workaround: none", meaning they are not drained at the SLURM level. The very fact that the PR had to addg09to--excludedespite §5.1 calling it "persistently drained" demonstrates the drain state is unreliable in practice (consistent with §5.6 about DYNAMIC_NORM nodes auto-clearing DRAIN).Impact / proof
Walk through the failure case:
salloc --partition=compute --exclude=mia1-p01-g09,mia1-p01-g11 --gres=gpu:$TP ...is issued (line 191).g12andg31are not in--excludeand are not drained per §5.2, they remain in the allocation pool alongside healthy nodes.mia1-p01-g12.srun --jobid=$JOB_ID bash -c "docker stop \$(docker ps -a -q)"(line 197).g12,docker ps -a -qfails withpermission denied while trying to connect to the docker API at unix:///var/run/docker.sock— the exact symptom from §5.2.g11case.The PR's empirical argument ("every failure landed on g09/g11 across 5 sweep PRs") is sampling luck on a ~12-node pool with several already drained. With only ~5 trials, observing
g12/g310 times has a non-trivial probability even if the underlying defect is present, and §5.2 explicitly says it is present.How to fix
Extend the exclude list to cover all three §5.2 nodes:
salloc --partition=$PARTITION --exclude=mia1-p01-g[09,11,12,31] ...or equivalently
--exclude=mia1-p01-g09,mia1-p01-g11,mia1-p01-g12,mia1-p01-g31. The inline comment should be updated correspondingly. This is a follow-up improvement rather than a regression — the PR strictly improves the baseline, so it does not need to block on this, but the documented gap is worth closing in the same change.