Preemption resilience by raistlin · Pull Request #8 · pausan/mocha-distributed

raistlin · 2026-06-03T15:57:32Z

No description provided.

Split the single 7-day TTL into a short in-flight claim TTL and a long tombstone/result TTL so runners on preemptible infrastructure (GKE spot, etc.) can recover interrupted tests. - MOCHA_DISTRIBUTED_CLAIM_EXPIRATION_TIME (default 600s) controls the TTL of the per-test claim key set in beforeEach. - A keepalive refreshes the claim every claim_ttl/3 seconds so a slow test never lets its claim expire under it. - afterEach promotes the claim key to MOCHA_DISTRIBUTED_EXPIRATION_TIME (7d) as a tombstone, preventing replacement runners from re-running an already-completed test. - SIGTERM/SIGINT handler DELs the in-flight claim if we still own it, so a replacement pod can immediately re-claim and run the test. Hard kills fall back to the short claim TTL. Result/count writes still use the 7d TTL, so external result viewers are unaffected.

Five new cases in test/test-preemption.js, sharing the require.cache redis-mock pattern from test-retry.js: - claim TTL split: SET NX uses the short claim TTL (default 600s) and the claim key is promoted to the long TTL (7d) tombstone after the test completes. - env override: MOCHA_DISTRIBUTED_CLAIM_EXPIRATION_TIME is honoured. - skipped test: when another runner owns the claim, no tombstone and no result row are written. - keepalive: setInterval is patched to capture the keepalive callback; firing it during the test produces extra expire() calls with the short TTL, and the interval period is claim_ttl/3 seconds. - SIGTERM: invoking the installed SIGTERM listener (with process.exit stubbed) DELs the in-flight claim key.

pausan

Let me see if I get this PR.

Basically instead of having each key live for 7 days (after a worker claims the ownership), the idea is for the key to live a shorter amount and for all workers to check for "liveness" on every X period, so that if the worker that was the owner dies (e.g spotless/preemptible machine dying) then because of the liveness check, another worker would pick it up later and finish the execution. Did I understood correctly?

If so, the only thing I'm missing is... will workers wait to each other or at the end? meaning, if a worker dies when the other workers have gone through all their jobs, what happens to that job? Are we assuming it will be picked up by a new worker created in a new spotless/preemptible machine and due to this liveness check it will be picked up again? Is this expiration time set so that it assumes the time it takes to spawn a new machine is larger than preemptible time, so by the time the test is executed again the new worker will pick it up?

raistlin · 2026-07-03T05:40:50Z

Mostly correct:

Each key is claimed for a short period (10 minutes). The claim is refreshed every ~300 seconds as long as the test runner remains alive. When the test runner finish executing the test or dies, it should release the lock gracefully. If it does not (e.g., when spot instances receive a kill -9), the lock will be released automatically when the claim TTL expires, allowing another runner to pick up the test.

When a test completes or fails after all retry attempts, it should restore the previous 7-day expiration behavior.

We already have our end-to-end tests running with this commit for a while (via monkey-patching), and I can share some feedback:
We are seeing improved behavior—fewer tests are omitted on each run compared to before—but it is not yet perfect. I suspect this is due to workers terminating without waiting for other tests to finish. I expected workers to coordinate and wait for one another, but it appears they don’t. Ideally, workers should not exit as long as there is any key that has not been tombstoned.

I’m not sure whether these additional improvements should be included in this PR or handled in a new one.

raistlin added 2 commits June 3, 2026 17:37

pausan reviewed Jul 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preemption resilience#8

Preemption resilience#8
raistlin wants to merge 2 commits into
pausan:masterfrom
raistlin:preemption-resilience

raistlin commented Jun 3, 2026

Uh oh!

pausan left a comment

Uh oh!

raistlin commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raistlin commented Jun 3, 2026

Uh oh!

pausan left a comment

Choose a reason for hiding this comment

Uh oh!

raistlin commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

raistlin commented Jul 3, 2026 •

edited

Loading