fix(scheduler): gate is_due on last attempt to stop retry-storm#170
Merged
Conversation
…etry-storm) failing job 會每個 tick 重試、無 backoff:is_due 只比對最近一次*成功* (get_last_successful_run),永遠失敗的 job 在排程窗口內每個 tick 都判為 due。 實測 fewer-permission-prompts-weekly 連續失敗 3746 次(每週一整天每 60 秒一次)。 修法:以最後一次*嘗試*(任何狀態 success/failed/running)為準。 - db.py:新增 get_last_run(job_id)(不過濾 status,回傳最新一筆)。 - cli.py tick:last_runs 改用 get_last_run 餵 is_due(原本 get_last_successful_run)。 - service.py:is_due 參數 last_success → last_run、docstring 載明「同一排程週期內 已嘗試過就不再 due,失敗 job 等下一週期才重跑」。日期比對邏輯不變。 - get_last_successful_run 保留(status 報表等仍可用)。 tests: - test_service:新增 TestIsDueRetryStorm(failed-today→not due、failed-yesterday→due、 weekly failed-today→not due、running-today→not due 防重複啟動)。 - test_db:get_last_run 回傳最新(含 failed)、對照 get_last_successful_run 仍只回成功。 make ci 全綠(1061 passed)。
Collaborator
Author
Final Aggregated Review — PR #170 (scheduler retry-storm fix)Modegroup-review (3/3 voices: Claude [code-reviewer + silent-failure-hunter] / Codex / agy) Consensus / verified Critical (must fix)C1. Dependency gating regression across ticks (Codex — lead-verified REAL)
C2. Crash / failure becomes a SILENT period-skip; skips never surfaced (silent-failure-hunter)
ImportantI1. Transient failure on weekly/monthly/quarterly silently parks a whole period (silent-failure-hunter)
I2. (agy) previous-period 'running' row → potential double-run — LARGELY MITIGATED
Refuted (lead)
Actionable NIT (clean up)
VerdictNEEDS_CHANGES — C1 is a real correctness regression (dependency chains), C2/I1 are real silent Voices unavailable
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
背景
排查
fewer-permission-prompts-weeklyjob 失敗時發現 scheduler 的 retry-storm bug:is_due()只比對最近一次成功執行(db.get_last_successful_run),所以一個永遠失敗的 job 在它的排程窗口內每個 tick(~60 秒)都被判為 due → 不斷重試、無 backoff。實測:
fewer-permission-prompts-weekly連續失敗 3746 次(2026-05-25 起;每週一整天每 60 秒一次),灌爆job_runs表與.runtime/logs/。修法
以最後一次嘗試(任何狀態 success / failed / running)為準,而非僅最後一次成功。同一排程週期內已嘗試過就不再 due;失敗的 job 等到下一個排程週期才重跑。
db.py:新增get_last_run(job_id)(不過濾 status,回傳最新一筆)。cli.pytick:last_runs改用get_last_run餵is_due(原get_last_successful_run)。service.py:is_due參數last_success→last_run、docstring 載明新語意。日期比對邏輯不變。get_last_successful_run保留(status 報表等仍可用)。取捨
失敗的 job 不再即時重試,而是等下一個排程週期(daily→隔天、weekly→隔週)。這對 daily 影響極小,且遠優於每 60 秒 storm。若未來需要「有限次數 + 間隔 backoff」的即時重試,可另案在此基礎上加。
驗證
make ci全綠:1061 passed(+6 新測試)。TestIsDueRetryStorm):failed-today → not due、failed-yesterday → due、weekly failed-today → not due、running-today → not due(防重複啟動)。get_last_run回傳最新(含 failed)、對照get_last_successful_run仍只回成功。備註
已另外(本機
.runtime/schedules.json,不在此 PR)把該 misconfigured job 停用止血——它本就註冊錯誤(fewer-permission-prompts是 Claude Code 內建 skill,非 reposkills/skill,runner 找不到skills/<name>/SKILL.md)。本 PR 修的是 scheduler 不該對任何失敗 job retry-storm 的通用 robustness。