Skip to content

Preemption resilience#8

Open
raistlin wants to merge 2 commits into
pausan:masterfrom
raistlin:preemption-resilience
Open

Preemption resilience#8
raistlin wants to merge 2 commits into
pausan:masterfrom
raistlin:preemption-resilience

Conversation

@raistlin

@raistlin raistlin commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

No description provided.

raistlin added 2 commits June 3, 2026 17:37
Split the single 7-day TTL into a short in-flight claim TTL and a
long tombstone/result TTL so runners on preemptible infrastructure
(GKE spot, etc.) can recover interrupted tests.

- MOCHA_DISTRIBUTED_CLAIM_EXPIRATION_TIME (default 600s) controls the
  TTL of the per-test claim key set in beforeEach.
- A keepalive refreshes the claim every claim_ttl/3 seconds so a slow
  test never lets its claim expire under it.
- afterEach promotes the claim key to MOCHA_DISTRIBUTED_EXPIRATION_TIME
  (7d) as a tombstone, preventing replacement runners from re-running
  an already-completed test.
- SIGTERM/SIGINT handler DELs the in-flight claim if we still own it,
  so a replacement pod can immediately re-claim and run the test.
  Hard kills fall back to the short claim TTL.

Result/count writes still use the 7d TTL, so external result viewers
are unaffected.
Five new cases in test/test-preemption.js, sharing the require.cache
redis-mock pattern from test-retry.js:

- claim TTL split: SET NX uses the short claim TTL (default 600s) and
  the claim key is promoted to the long TTL (7d) tombstone after the
  test completes.
- env override: MOCHA_DISTRIBUTED_CLAIM_EXPIRATION_TIME is honoured.
- skipped test: when another runner owns the claim, no tombstone and
  no result row are written.
- keepalive: setInterval is patched to capture the keepalive callback;
  firing it during the test produces extra expire() calls with the
  short TTL, and the interval period is claim_ttl/3 seconds.
- SIGTERM: invoking the installed SIGTERM listener (with process.exit
  stubbed) DELs the in-flight claim key.

@pausan pausan left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see if I get this PR.

Basically instead of having each key live for 7 days (after a worker claims the ownership), the idea is for the key to live a shorter amount and for all workers to check for "liveness" on every X period, so that if the worker that was the owner dies (e.g spotless/preemptible machine dying) then because of the liveness check, another worker would pick it up later and finish the execution. Did I understood correctly?

If so, the only thing I'm missing is... will workers wait to each other or at the end? meaning, if a worker dies when the other workers have gone through all their jobs, what happens to that job? Are we assuming it will be picked up by a new worker created in a new spotless/preemptible machine and due to this liveness check it will be picked up again? Is this expiration time set so that it assumes the time it takes to spawn a new machine is larger than preemptible time, so by the time the test is executed again the new worker will pick it up?

@raistlin

raistlin commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

Mostly correct:

Each key is claimed for a short period (10 minutes). The claim is refreshed every ~300 seconds as long as the test runner remains alive. When the test runner finish executing the test or dies, it should release the lock gracefully. If it does not (e.g., when spot instances receive a kill -9), the lock will be released automatically when the claim TTL expires, allowing another runner to pick up the test.

When a test completes or fails after all retry attempts, it should restore the previous 7-day expiration behavior.

We already have our end-to-end tests running with this commit for a while (via monkey-patching), and I can share some feedback:
We are seeing improved behavior—fewer tests are omitted on each run compared to before—but it is not yet perfect. I suspect this is due to workers terminating without waiting for other tests to finish. I expected workers to coordinate and wait for one another, but it appears they don’t. Ideally, workers should not exit as long as there is any key that has not been tombstoned.

I’m not sure whether these additional improvements should be included in this PR or handled in a new one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants