feat!: timeouts#457
Conversation
7182a85 to
111e53f
Compare
This is a problem; if this is something we are going to use for the benchmarking study we need this to work with singularity because CHTC only uses singularity/apptainer. |
|
I originally assumed this was a more esoteric PR to test the error-handling workflow. Though, based on the meeting just now, I'll look into a nice way to get this working with Singularity. |
There was a problem hiding this comment.
Here is my first round review. I like the empty files if an error occurs, and it is good to have an associated log explaining why.
(Adding this comment again here) My only issue with this is the definition of "error." If a parameter combination fails the heuristics check, I do not want the output to be empty. I want the output to reflect what was actually produced, so I do not have to rerun that combination even though it "failed" the heuristics. I should be able to freely update the heuristics and have that output counted if it now passes, without rerunning combinations that previously produced empty output. In short, a parameter combination failing the heuristics should not be classified as an error as defined in this PR.
This PR not working with singularity is a very big problem because of the chtc integration.
"ML requires at least one pathway, and failing pathways can break ML-work. How do we want to handle downstream analysis when errors occur." This seems like a separate problem that I will fix internally in the ML code (I want to still make figures if we only have one pathway or a set of empty pathways).
This perplexes me but from my tests we do not need --keep-going. I do not know my original intent here
Co-Authored-By: Neha Talluri <78840540+ntalluri@users.noreply.github.com>
The small part here that is adaptable is that errors, in this PR, are only made in the |
|
This works with apptainer now, but this now touches an untested part of profiling, so that part needs a review from @jhiemstrawisc. |
ntalluri
left a comment
There was a problem hiding this comment.
here is half of a review on this pr
Co-Authored-By: Neha Talluri <78840540+ntalluri@users.noreply.github.com>
|
I've made a few changes:
The final bullet point here also means that the conditionals PR will now actually depend on this PR. |
|
does this need to use #464? |
|
No - I was thinking that the |
why did i not do this before???
this is why I didn't do this earlier...
|
I seem to have messed with imports when I went to pass in the entirety of |
jhiemstrawisc
left a comment
There was a problem hiding this comment.
A few comments sprinkled throughout. The bigger issue for me is that this doesn't appear to run in its current state. Using a relatively stock config/config.yaml I get a runtime error:
RuleException:
AttributeError in file "/Users/jhiemstra/Desktop/dev/spras/spras/Snakefile", line 302:
'int' object has no attribute 'timeout'
File "/Users/jhiemstra/Desktop/dev/spras/spras/Snakefile", line 302, in __rule_reconstruct
File "/Users/jhiemstra/Desktop/dev/spras/spras/spras/runner.py", line 58, in run
File "/Users/jhiemstra/Desktop/dev/spras/spras/spras/prm.py", line 81, in run_typeless
File "/Users/jhiemstra/Desktop/dev/spras/spras/spras/pathlinker.py", line 119, in run
Exiting because a job execution failed. Look below for error messages
WorkflowError:
At least one job did not complete successfully.
| - 10 | ||
| - 20 | ||
| - 70 | ||
| params: |
There was a problem hiding this comment.
Just a general note that changing the config file like this constitutes a breaking change -- it will require a new spras major release. Is it strictly necessary?
| # and we touch pathway_file still: Snakemake doesn't have optional files, so we output a 'artifact info' file, | ||
| # which contains the status (success/failure) of specific Snakemake jobs. | ||
| # We filter for the successful files (such as ones that didn't time out) with the `filter_successful` function. | ||
| Path(output.pathway_file).touch() |
There was a problem hiding this comment.
What are the consequences to the user here? This seems like it has the potential to confuse someone by producing empty pathway files alongside a new error reporting mechanism outside of Snakemake.
| ) | ||
|
|
||
| try: | ||
| container_obj.wait(timeout=timeout) |
There was a problem hiding this comment.
I brushed up on the docs for this API -- it seems like wait() should be returning the container's ultimate status:
Returns: The API’s response as a Python dictionary, including
the container’s exit code under the StatusCode attribute.
Should we be checking it?
| container_obj.stop() | ||
| client.close() | ||
| if timeout: raise TimeoutError(timeout) from err | ||
| else: raise RuntimeError("Timeout error but no timeout specified. Please file an issue with this error and stacktrace at https://github.com/Reed-CompBio/spras/issues/new.") from None |
There was a problem hiding this comment.
Should this catch other types of errors?
| if timeout: raise TimeoutError(timeout) from err | ||
| else: raise RuntimeError("Timeout error but no timeout specified. Please file an issue with this error and stacktrace at https://github.com/Reed-CompBio/spras/issues/new.") from None | ||
|
|
||
| out = container_obj.attach(stderr=True).decode('utf-8') |
There was a problem hiding this comment.
I think the completion of wait() implies the container is already stopped (either of its own accord or because the timeout was reached). What happens to this attach command if the container is already stopped?
|
|
||
| # As per unix `timeout`, this is the status if the command times out and --preserve-status is not initially specified | ||
| # (where the latter above holds). | ||
| if proc.returncode == 124: |
There was a problem hiding this comment.
Should this have handling for other error codes?
we also add mark_success as the non-negated analogue for testing.
Adds
timeoutto algorithms as a demonstration of passing through errors, and introduces run settings, breaking the original configuration design by introducing a newparamsoption. Closes #316.Caveat: ML requires at least one pathway, and failing pathways can break ML-work. How do we want to handle downstream analysis when errors occur (including in the future heuristic errors?)