Skip to content

fix(scheduler): gate is_due on last attempt to stop retry-storm#170

Merged
howie merged 1 commit into
mainfrom
worktree-scheduler-retry-backoff
Jun 24, 2026
Merged

fix(scheduler): gate is_due on last attempt to stop retry-storm#170
howie merged 1 commit into
mainfrom
worktree-scheduler-retry-backoff

Conversation

@howie

@howie howie commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

背景

排查 fewer-permission-prompts-weekly job 失敗時發現 scheduler 的 retry-storm bug:is_due() 只比對最近一次成功執行(db.get_last_successful_run),所以一個永遠失敗的 job 在它的排程窗口內每個 tick(~60 秒)都被判為 due → 不斷重試、無 backoff。

實測:fewer-permission-prompts-weekly 連續失敗 3746 次(2026-05-25 起;每週一整天每 60 秒一次),灌爆 job_runs 表與 .runtime/logs/

修法

以最後一次嘗試(任何狀態 success / failed / running)為準,而非僅最後一次成功。同一排程週期內已嘗試過就不再 due;失敗的 job 等到下一個排程週期才重跑。

  • db.py:新增 get_last_run(job_id)(不過濾 status,回傳最新一筆)。
  • cli.py tick:last_runs 改用 get_last_runis_due(原 get_last_successful_run)。
  • service.pyis_due 參數 last_successlast_run、docstring 載明新語意。日期比對邏輯不變。
  • get_last_successful_run 保留(status 報表等仍可用)。

取捨

失敗的 job 不再即時重試,而是等下一個排程週期(daily→隔天、weekly→隔週)。這對 daily 影響極小,且遠優於每 60 秒 storm。若未來需要「有限次數 + 間隔 backoff」的即時重試,可另案在此基礎上加。

驗證

  • make ci 全綠:1061 passed(+6 新測試)。
  • regression(test_service TestIsDueRetryStorm:failed-today → not due、failed-yesterday → due、weekly failed-today → not due、running-today → not due(防重複啟動)。
  • test_dbget_last_run 回傳最新(含 failed)、對照 get_last_successful_run 仍只回成功。

備註

已另外(本機 .runtime/schedules.json,不在此 PR)把該 misconfigured job 停用止血——它本就註冊錯誤(fewer-permission-prompts 是 Claude Code 內建 skill,非 repo skills/ skill,runner 找不到 skills/<name>/SKILL.md)。本 PR 修的是 scheduler 不該對任何失敗 job retry-storm 的通用 robustness。

…etry-storm)

failing job 會每個 tick 重試、無 backoff:is_due 只比對最近一次*成功*
(get_last_successful_run),永遠失敗的 job 在排程窗口內每個 tick 都判為 due。
實測 fewer-permission-prompts-weekly 連續失敗 3746 次(每週一整天每 60 秒一次)。

修法:以最後一次*嘗試*(任何狀態 success/failed/running)為準。

- db.py:新增 get_last_run(job_id)(不過濾 status,回傳最新一筆)。
- cli.py tick:last_runs 改用 get_last_run 餵 is_due(原本 get_last_successful_run)。
- service.py:is_due 參數 last_success → last_run、docstring 載明「同一排程週期內
  已嘗試過就不再 due,失敗 job 等下一週期才重跑」。日期比對邏輯不變。
- get_last_successful_run 保留(status 報表等仍可用)。

tests:
- test_service:新增 TestIsDueRetryStorm(failed-today→not due、failed-yesterday→due、
  weekly failed-today→not due、running-today→not due 防重複啟動)。
- test_db:get_last_run 回傳最新(含 failed)、對照 get_last_successful_run 仍只回成功。

make ci 全綠(1061 passed)。
@howie howie added the enhancement New feature or request label Jun 23, 2026
@howie

howie commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Final Aggregated Review — PR #170 (scheduler retry-storm fix)

Mode

group-review (3/3 voices: Claude [code-reviewer + silent-failure-hunter] / Codex / agy)

Consensus / verified Critical (must fix)

C1. Dependency gating regression across ticks (Codex — lead-verified REAL)

  • File: tasks/scheduler/cli.py:82
  • The downstream-skip guard is dep in due_ids and dep not in completed_ok. It only blocks a
    downstream job when the failed upstream is in the current due set. The fix makes a
    failed/attempted upstream non-due (its today-dated run record removes it from due_ids).
    So when upstream and downstream have different times, or downstream becomes due in a later
    tick than the one where upstream failed, dep in due_ids is False → downstream runs with
    stale/missing upstream output
    . Old code kept the failed upstream due (storm) which, as a
    side effect, kept the dependency gate working.
  • Lead verification: read cli.py:78-103 — confirmed the gate keys off due_ids membership, not
    per-period dep success. Real correctness regression for any depends_on chain.
  • Fix direction: gate downstream on the upstream's actual status this period (query last run
    per period; skip downstream if any dep is not success-this-period), independent of due_ids.

C2. Crash / failure becomes a SILENT period-skip; skips never surfaced (silent-failure-hunter)

  • File: tasks/scheduler/service.py:30-39, cli.py:63,70
  • New semantics treat any same-period attempt (failed / running) as "done", so a job that
    crashed mid-run (record_start committed, process killed before record_finish) is silently
    skipped for the rest of the period. is_due returning False emits nothing; the tick exits
    silently. Old code re-ran a crashed job next tick. Net: the storm fix converts a crash into a
    silent multi-hour/period skip.
  • Fix direction (minimum): surface a warning when a job is skipped because of a same-period
    failed/running record, so a wedged/failed job is visible. cleanup_stale_runs() return value
    is also discarded (cli.py:63) — echo it when > 0.

Important

I1. Transient failure on weekly/monthly/quarterly silently parks a whole period (silent-failure-hunter)

  • A single transient blip (e.g. ACP gateway momentarily down) on a quarterly job now silently
    skips ~90 days with no surfaced signal. Period-skip is too blunt for long-period jobs.
  • Fix direction: bounded same-period retry (N attempts with a min interval) instead of
    all-or-nothing; or at least surface the "failed, dormant until next period" state.

I2. (agy) previous-period 'running' row → potential double-run — LARGELY MITIGATED

  • agy: a job still 'running' from yesterday → date < today True → is_due True → double-run.
  • Lead assessment: cleanup_stale_runs() runs first (cli.py:63) and flips >2h 'running' rows to
    timeout; all jobs have timeout_seconds ≤ 1800s, so a genuine >24h run is impossible. Mostly
    theoretical. agy's suggested fix (if status=='running': return False) is WRONG — it would make
    a stale-running job never run again. Do NOT apply that; cleanup already handles it. Optionally
    add a defensive test. Downgraded from Critical.

Refuted (lead)

  • agy [Important] timezone mismatch: all writers use now.isoformat() (local naive) and
    is_due uses datetime.now() (local naive) — consistent, no UTC storage anywhere. REFUTED.

Actionable NIT (clean up)

  • N1 (code-reviewer + agy): get_last_successful_run is now dead production code. Given the storm
    history, add a comment "retained for reporting; do NOT feed into is_due" OR remove it + its tests.
  • N2 (code-reviewer + agy): missing monthly/quarterly retry-storm test (the monthly|bimonthly| quarterly branch has distinct Y/M/D equality logic, untested for failed-same-day).
  • N3 (code-reviewer): stale comment at db.py cleanup ("1 小時" — actually 1h × multiplier = 2h).
  • N4 (silent-failure-hunter): ORDER BY started_at DESC tiebreak nondeterministic on same-second
    rows; consider , id DESC. Pre-existing (get_last_successful_run has it too). Minor.

Verdict

NEEDS_CHANGES — C1 is a real correctness regression (dependency chains), C2/I1 are real silent
failure modes introduced by the period-skip approach. The simple "attempted → not due" design is
too blunt. Two fix directions below; this is a design decision for the human.

Voices unavailable

  • none (3/3 active)

@howie howie merged commit 77af75b into main Jun 24, 2026
1 check passed
@howie howie deleted the worktree-scheduler-retry-backoff branch June 24, 2026 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant