Preemption resilience#8
Conversation
Split the single 7-day TTL into a short in-flight claim TTL and a long tombstone/result TTL so runners on preemptible infrastructure (GKE spot, etc.) can recover interrupted tests. - MOCHA_DISTRIBUTED_CLAIM_EXPIRATION_TIME (default 600s) controls the TTL of the per-test claim key set in beforeEach. - A keepalive refreshes the claim every claim_ttl/3 seconds so a slow test never lets its claim expire under it. - afterEach promotes the claim key to MOCHA_DISTRIBUTED_EXPIRATION_TIME (7d) as a tombstone, preventing replacement runners from re-running an already-completed test. - SIGTERM/SIGINT handler DELs the in-flight claim if we still own it, so a replacement pod can immediately re-claim and run the test. Hard kills fall back to the short claim TTL. Result/count writes still use the 7d TTL, so external result viewers are unaffected.
Five new cases in test/test-preemption.js, sharing the require.cache redis-mock pattern from test-retry.js: - claim TTL split: SET NX uses the short claim TTL (default 600s) and the claim key is promoted to the long TTL (7d) tombstone after the test completes. - env override: MOCHA_DISTRIBUTED_CLAIM_EXPIRATION_TIME is honoured. - skipped test: when another runner owns the claim, no tombstone and no result row are written. - keepalive: setInterval is patched to capture the keepalive callback; firing it during the test produces extra expire() calls with the short TTL, and the interval period is claim_ttl/3 seconds. - SIGTERM: invoking the installed SIGTERM listener (with process.exit stubbed) DELs the in-flight claim key.
pausan
left a comment
There was a problem hiding this comment.
Let me see if I get this PR.
Basically instead of having each key live for 7 days (after a worker claims the ownership), the idea is for the key to live a shorter amount and for all workers to check for "liveness" on every X period, so that if the worker that was the owner dies (e.g spotless/preemptible machine dying) then because of the liveness check, another worker would pick it up later and finish the execution. Did I understood correctly?
If so, the only thing I'm missing is... will workers wait to each other or at the end? meaning, if a worker dies when the other workers have gone through all their jobs, what happens to that job? Are we assuming it will be picked up by a new worker created in a new spotless/preemptible machine and due to this liveness check it will be picked up again? Is this expiration time set so that it assumes the time it takes to spawn a new machine is larger than preemptible time, so by the time the test is executed again the new worker will pick it up?
|
Mostly correct: Each key is claimed for a short period (10 minutes). The claim is refreshed every ~300 seconds as long as the test runner remains alive. When the test runner finish executing the test or dies, it should release the lock gracefully. If it does not (e.g., when spot instances receive a When a test completes or fails after all retry attempts, it should restore the previous 7-day expiration behavior. We already have our end-to-end tests running with this commit for a while (via monkey-patching), and I can share some feedback: I’m not sure whether these additional improvements should be included in this PR or handled in a new one. |
No description provided.