Skip to content

More robust restart detection#132

Open
adamdempsey90 wants to merge 3 commits into
developfrom
dempsey/restarts
Open

More robust restart detection#132
adamdempsey90 wants to merge 3 commits into
developfrom
dempsey/restarts

Conversation

@adamdempsey90

@adamdempsey90 adamdempsey90 commented May 26, 2026

Copy link
Copy Markdown
Collaborator

Before, we were checking the actual return code of artemis to determine if we should restart or not. This can run into an issue in slurm where once srun detects that one process has exited it will send SIGTERM to all remaining processes which causes them to return with code 143 instead of the artemis value.

Because of this, I have added a file detection based scheme that on startup touches a file called DO-NOT-RESTART. The only way to remove this file is for the driver to return with timeout at which point the code removes the file. If you try to restart the code and this file is present, it will fatal. Automatic resubmission should check for the existence of that file before restarting. I have updated the docs to reflect this.

I have tested this by hand and it works as expected.

Closes #131

Background

Description of Changes

Checklist

  • New features are documented
  • Tests added for bug fixes and new features
  • (@lanl.gov employees) Update copyright on changed files
  • Any contribution that was created or modified with the assistance of generative AI must have a comment disclosing this such as // This file was created in part or in whole by generative AI

@adamdempsey90 adamdempsey90 requested a review from pdmullen May 26, 2026 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Exit codes

1 participant