feat!: timeouts by tristan-f-r · Pull Request #457 · Reed-CompBio/spras

tristan-f-r · 2026-01-13T14:29:13Z

Adds timeout to algorithms as a demonstration of passing through errors, and introduces run settings, breaking the original configuration design by introducing a new params option. Closes #316.

Caveat: ML requires at least one pathway, and failing pathways can break ML-work. How do we want to handle downstream analysis when errors occur (including in the future heuristic errors?)

read-the-docs-community · 2026-01-13T14:34:06Z

Documentation build overview

📚 spras | 🛠️ Build #32896638 | 📁 Comparing 01cd649 against latest (87b314c)

🔍 Preview build

9 files changed · + 2 added · ± 7 modified

+ Added

± Modified

ntalluri · 2026-02-05T16:14:41Z

This does not work with singularity (singularity has no docker wait equivalent and to implement timeouts in singularity would probably require constant polling of a detached thread)

This is a problem; if this is something we are going to use for the benchmarking study we need this to work with singularity because CHTC only uses singularity/apptainer.

tristan-f-r · 2026-02-05T17:25:31Z

I originally assumed this was a more esoteric PR to test the error-handling workflow. Though, based on the meeting just now, I'll look into a nice way to get this working with Singularity.

ntalluri

Here is my first round review. I like the empty files if an error occurs, and it is good to have an associated log explaining why.
(Adding this comment again here) My only issue with this is the definition of "error." If a parameter combination fails the heuristics check, I do not want the output to be empty. I want the output to reflect what was actually produced, so I do not have to rerun that combination even though it "failed" the heuristics. I should be able to freely update the heuristics and have that output counted if it now passes, without rerunning combinations that previously produced empty output. In short, a parameter combination failing the heuristics should not be classified as an error as defined in this PR.

This PR not working with singularity is a very big problem because of the chtc integration.

"ML requires at least one pathway, and failing pathways can break ML-work. How do we want to handle downstream analysis when errors occur." This seems like a separate problem that I will fix internally in the ML code (I want to still make figures if we only have one pathway or a set of empty pathways).

This perplexes me but from my tests we do not need --keep-going. I do not know my original intent here

Co-Authored-By: Neha Talluri <78840540+ntalluri@users.noreply.github.com>

tristan-f-r · 2026-04-23T20:27:11Z

My only issue with this is the definition of "error." If a parameter combination fails the heuristics check, I do not want the output to be empty. I want the output to reflect what was actually produced, so I do not have to rerun that combination even though it "failed" the heuristics.

The small part here that is adaptable is that errors, in this PR, are only made in the reconstruct rule. We can define other errors in some heuristics rule, and we can rewire other rules to depend on the success of heuristics instead of on reconstruct (via that resource_info = rules.reconstruct.output.resource_info input rule above, but instead we would say rules.heuristics.output.resource_info instead.)

tristan-f-r · 2026-04-25T22:13:29Z

This works with apptainer now, but this now touches an untested part of profiling, so that part needs a review from @jhiemstrawisc.

ntalluri

here is half of a review on this pr

Co-Authored-By: Neha Talluri <78840540+ntalluri@users.noreply.github.com>

tristan-f-r · 2026-05-04T03:00:55Z

I've made a few changes:

With your line of thought of considering errors as important objects in documentation, I've pushed a commit that moves out some of the error object handling out from Snakefile into an errors.py, and we further use pydantic to make sure we avoid arbitrary JSON handling (also to give SPRAS automation users a way to easily parse our error objects). This was also motivated by encountering a type error with mark_error, which is why I've opted to make this type-safe.
I've also renamed resource -> artifact, as I found the former a little more confusing, though the latter isn't an optimal name either: see the top-level comment on errors.py.
timeout is now per-run, with optional specification for an entire algorithm. This refactor was intentionally done for conditionals, though it can also find its use with many other configuration settings. This also introduces a new key, params under all runs. We mark the PR as breaking because of this.

The final bullet point here also means that the conditionals PR will now actually depend on this PR.

ntalluri · 2026-05-06T15:06:58Z

does this need to use #464?

tristan-f-r · 2026-05-06T19:31:52Z

No - I was thinking that the RunSettings here can be more generally applicable to #464.

why did i not do this before???

this is why I didn't do this earlier...

tristan-f-r · 2026-05-07T17:03:42Z

I seem to have messed with imports when I went to pass in the entirety of run_settings instead of just timeout. I'll fix this 👍.

jhiemstrawisc

A few comments sprinkled throughout. The bigger issue for me is that this doesn't appear to run in its current state. Using a relatively stock config/config.yaml I get a runtime error:

RuleException:
AttributeError in file "/Users/jhiemstra/Desktop/dev/spras/spras/Snakefile", line 302:
'int' object has no attribute 'timeout'
  File "/Users/jhiemstra/Desktop/dev/spras/spras/Snakefile", line 302, in __rule_reconstruct
  File "/Users/jhiemstra/Desktop/dev/spras/spras/spras/runner.py", line 58, in run
  File "/Users/jhiemstra/Desktop/dev/spras/spras/spras/prm.py", line 81, in run_typeless
  File "/Users/jhiemstra/Desktop/dev/spras/spras/spras/pathlinker.py", line 119, in run
Exiting because a job execution failed. Look below for error messages
WorkflowError:
At least one job did not complete successfully.

jhiemstrawisc · 2026-05-27T15:36:07Z

-          - 10
-          - 20
-          - 70
+        params:


Just a general note that changing the config file like this constitutes a breaking change -- it will require a new spras major release. Is it strictly necessary?

jhiemstrawisc · 2026-05-27T17:42:12Z

+            # and we touch pathway_file still: Snakemake doesn't have optional files, so we output a 'artifact info' file,
+            # which contains the status (success/failure) of specific Snakemake jobs.
+            # We filter for the successful files (such as ones that didn't time out) with the `filter_successful` function.  
+            Path(output.pathway_file).touch()


What are the consequences to the user here? This seems like it has the potential to confuse someone by producing empty pathway files alongside a new error reporting mechanism outside of Snakemake.

jhiemstrawisc · 2026-05-27T18:04:55Z

+    )
+
+    try:
+        container_obj.wait(timeout=timeout)


I brushed up on the docs for this API -- it seems like wait() should be returning the container's ultimate status:

Returns: The API’s response as a Python dictionary, including the container’s exit code under the StatusCode attribute.

Should we be checking it?

jhiemstrawisc · 2026-05-27T18:15:01Z

+        container_obj.stop()
+        client.close()
+        if timeout: raise TimeoutError(timeout) from err
+        else: raise RuntimeError("Timeout error but no timeout specified. Please file an issue with this error and stacktrace at https://github.com/Reed-CompBio/spras/issues/new.") from None


Should this catch other types of errors?

jhiemstrawisc · 2026-05-27T18:19:07Z

+        if timeout: raise TimeoutError(timeout) from err
+        else: raise RuntimeError("Timeout error but no timeout specified. Please file an issue with this error and stacktrace at https://github.com/Reed-CompBio/spras/issues/new.") from None
+
+    out = container_obj.attach(stderr=True).decode('utf-8')


I think the completion of wait() implies the container is already stopped (either of its own accord or because the timeout was reached). What happens to this attach command if the container is already stopped?

jhiemstrawisc · 2026-05-27T18:20:09Z

+
+    # As per unix `timeout`, this is the status if the command times out and --preserve-status is not initially specified
+    # (where the latter above holds).
+    if proc.returncode == 124:


Should this have handling for other error codes?

we also add mark_success as the non-negated analogue for testing.

tristan-f-r added 4 commits January 12, 2026 09:41

feat: timeout

f81a33e

feat: snakemake err checkpoint

0342b5c

fix: use timeout correctly

841d242

fix: filter files w/ errors

75fd7f1

tristan-f-r added the enhancement New feature or request label Jan 13, 2026

tristan-f-r added 2 commits January 13, 2026 07:22

fix: correct timeout order

7abd709

fix(cytoscape): specify optional timeout

e07c961

tristan-f-r added the P-high This is a blocker for many PRs/issues/features label Jan 13, 2026

tristan-f-r added 2 commits January 13, 2026 16:03

chore(Snakefile): decheckpointify reconstruct

d5b7e18

perf(Snakefile): make is_error check not consume the entire file

111e53f

tristan-f-r force-pushed the timeout-arg branch from 7182a85 to 111e53f Compare January 13, 2026 20:23

github-actions Bot added the merge-conflict This PR has merge conflicts. label Jan 31, 2026

Merge branch 'main' into timeout-arg

c2febff

github-actions Bot removed the merge-conflict This PR has merge conflicts. label Jan 31, 2026

github-actions Bot added the merge-conflict This PR has merge conflicts. label Mar 18, 2026

ntalluri reviewed Apr 23, 2026

View reviewed changes

tristan-f-r and others added 4 commits April 23, 2026 19:31

docs: timeout

cc46eed

docs: clarification on container_obj

83c5ed0

docs: remove the strange comment

71c1976

This perplexes me but from my tests we do not need --keep-going. I do not know my original intent here

refactor: use mark_error and is_error more often

6e60afe

Co-Authored-By: Neha Talluri <78840540+ntalluri@users.noreply.github.com>

tristan-f-r added 2 commits April 23, 2026 15:36

Merge branch 'umain' into timeout-arg

7ce0580

style: fmt

699ddca

github-actions Bot removed the merge-conflict This PR has merge conflicts. label Apr 23, 2026

tristan-f-r added 3 commits April 25, 2026 02:18

docs: on errors

4e3c28f

fix: tests and such

a322f4d

feat: singularity timeouts

641608f

tristan-f-r requested a review from jhiemstrawisc April 25, 2026 22:13

tristan-f-r added 2 commits April 25, 2026 23:29

fix: don't use capture_output and stderr in the same command

208eb4a

fix: use correct variable reference for reconstruct

7a0c4f3

ntalluri reviewed Apr 30, 2026

View reviewed changes

Comment thread docs/design/errors.rst Outdated

Comment thread docs/timeout.rst Outdated

Comment thread docs/timeout.rst Outdated

Comment thread spras/containers.py Outdated

Comment thread spras/containers.py

refactor: move errors to be pydantic, add duration cmt

fe68153

Co-Authored-By: Neha Talluri <78840540+ntalluri@users.noreply.github.com>

tristan-f-r added 3 commits May 4, 2026 03:04

style: fmt

071d4bd

fix: regenerate fordevs

e42c3f0

feat: RunSettings

28cfec0

tristan-f-r changed the title ~~feat: timeouts~~ feat!: timeouts May 4, 2026

tristan-f-r added 3 commits May 4, 2026 09:51

test(test_config): use correct config

5291335

test(generate_inputs): fix config

efa45fc

docs(algorithms): fix doc ordering

cae2edd

tristan-f-r added the tuning Workflow-spanning algorithm tuning label May 4, 2026

refactor: move params index grabbing into func

0475d45

ntalluri mentioned this pull request May 4, 2026

Review Queue #466

Open

tristan-f-r mentioned this pull request May 5, 2026

Add override mechanism for algorithm containers, transfer explicit .sif images with HTCondor #464

Open

tristan-f-r mentioned this pull request May 7, 2026

feat: conditional runs #471

Open

tristan-f-r added 2 commits May 7, 2026 08:06

fix: actually pass in run settings

91fe5eb

why did i not do this before???

fix: no cyclic imports

2eb0d57

this is why I didn't do this earlier...

ntalluri reviewed May 7, 2026

View reviewed changes

Comment thread config/config.yaml

tristan-f-r added 3 commits May 7, 2026 17:07

docs(timeout): use nested params

15d00a6

fix: properly define validate_duration

73f7b8d

fix(Snakefile): properly fetch run settings

4a95d79

jhiemstrawisc reviewed May 27, 2026

View reviewed changes

fix: correct incorrect function use

01cd649

we also add mark_success as the non-negated analogue for testing.

Conversation

tristan-f-r commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

read-the-docs-community Bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Documentation build overview

Uh oh!

ntalluri commented Feb 5, 2026

Uh oh!

tristan-f-r commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntalluri left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tristan-f-r commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tristan-f-r commented Apr 25, 2026

Uh oh!

ntalluri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tristan-f-r commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ntalluri commented May 6, 2026

Uh oh!

tristan-f-r commented May 6, 2026

Uh oh!

tristan-f-r commented May 7, 2026

Uh oh!

Uh oh!

jhiemstrawisc left a comment

Choose a reason for hiding this comment

Uh oh!

jhiemstrawisc May 27, 2026

Choose a reason for hiding this comment

Uh oh!

jhiemstrawisc May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jhiemstrawisc May 27, 2026

Choose a reason for hiding this comment

Uh oh!

jhiemstrawisc May 27, 2026

Choose a reason for hiding this comment

Uh oh!

jhiemstrawisc May 27, 2026

Choose a reason for hiding this comment

Uh oh!

jhiemstrawisc May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

tristan-f-r commented Jan 13, 2026 •

edited

Loading

read-the-docs-community Bot commented Jan 13, 2026 •

edited

Loading

tristan-f-r commented Feb 5, 2026 •

edited

Loading

ntalluri left a comment •

edited

Loading

tristan-f-r commented Apr 23, 2026 •

edited

Loading

tristan-f-r commented May 4, 2026 •

edited

Loading