More robust restart detection#132
Open
adamdempsey90 wants to merge 3 commits into
Open
Conversation
…er has timed out.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Before, we were checking the actual return code of artemis to determine if we should restart or not. This can run into an issue in slurm where once
srundetects that one process has exited it will sendSIGTERMto all remaining processes which causes them to return with code 143 instead of the artemis value.Because of this, I have added a file detection based scheme that on startup touches a file called
DO-NOT-RESTART. The only way to remove this file is for the driver to return withtimeoutat which point the code removes the file. If you try to restart the code and this file is present, it will fatal. Automatic resubmission should check for the existence of that file before restarting. I have updated the docs to reflect this.I have tested this by hand and it works as expected.
Closes #131
Background
Description of Changes
Checklist
// This file was created in part or in whole by generative AI