fix(scheduler): gate is_due on last attempt to stop retry-storm by howie · Pull Request #170 · heyu-ai/yibi-stack

howie · 2026-06-23T12:10:35Z

背景

排查 fewer-permission-prompts-weekly job 失敗時發現 scheduler 的 retry-storm bug：is_due() 只比對最近一次成功執行（db.get_last_successful_run），所以一個永遠失敗的 job 在它的排程窗口內每個 tick（~60 秒）都被判為 due → 不斷重試、無 backoff。

實測：fewer-permission-prompts-weekly 連續失敗 3746 次（2026-05-25 起；每週一整天每 60 秒一次），灌爆 job_runs 表與 .runtime/logs/。

修法

以最後一次嘗試（任何狀態 success / failed / running）為準，而非僅最後一次成功。同一排程週期內已嘗試過就不再 due；失敗的 job 等到下一個排程週期才重跑。

db.py：新增 get_last_run(job_id)（不過濾 status，回傳最新一筆）。
cli.py tick：last_runs 改用 get_last_run 餵 is_due（原 get_last_successful_run）。
service.py：is_due 參數 last_success → last_run、docstring 載明新語意。日期比對邏輯不變。
get_last_successful_run 保留（status 報表等仍可用）。

取捨

失敗的 job 不再即時重試，而是等下一個排程週期（daily→隔天、weekly→隔週）。這對 daily 影響極小，且遠優於每 60 秒 storm。若未來需要「有限次數 + 間隔 backoff」的即時重試，可另案在此基礎上加。

驗證

make ci 全綠：1061 passed（+6 新測試）。
regression（test_service TestIsDueRetryStorm）：failed-today → not due、failed-yesterday → due、weekly failed-today → not due、running-today → not due（防重複啟動）。
test_db：get_last_run 回傳最新（含 failed）、對照 get_last_successful_run 仍只回成功。

備註

已另外（本機 .runtime/schedules.json，不在此 PR）把該 misconfigured job 停用止血——它本就註冊錯誤（fewer-permission-prompts 是 Claude Code 內建 skill，非 repo skills/ skill，runner 找不到 skills/<name>/SKILL.md）。本 PR 修的是 scheduler 不該對任何失敗 job retry-storm 的通用 robustness。

…etry-storm) failing job 會每個 tick 重試、無 backoff：is_due 只比對最近一次*成功* （get_last_successful_run），永遠失敗的 job 在排程窗口內每個 tick 都判為 due。實測 fewer-permission-prompts-weekly 連續失敗 3746 次（每週一整天每 60 秒一次）。修法：以最後一次*嘗試*（任何狀態 success/failed/running）為準。 - db.py：新增 get_last_run(job_id)（不過濾 status，回傳最新一筆）。 - cli.py tick：last_runs 改用 get_last_run 餵 is_due（原本 get_last_successful_run）。 - service.py：is_due 參數 last_success → last_run、docstring 載明「同一排程週期內已嘗試過就不再 due，失敗 job 等下一週期才重跑」。日期比對邏輯不變。 - get_last_successful_run 保留（status 報表等仍可用）。 tests： - test_service：新增 TestIsDueRetryStorm（failed-today→not due、failed-yesterday→due、 weekly failed-today→not due、running-today→not due 防重複啟動）。 - test_db：get_last_run 回傳最新（含 failed）、對照 get_last_successful_run 仍只回成功。 make ci 全綠（1061 passed）。

howie · 2026-06-23T13:21:52Z

Final Aggregated Review — PR #170 (scheduler retry-storm fix)

Mode

group-review (3/3 voices: Claude [code-reviewer + silent-failure-hunter] / Codex / agy)

Consensus / verified Critical (must fix)

C1. Dependency gating regression across ticks (Codex — lead-verified REAL)

File: tasks/scheduler/cli.py:82
The downstream-skip guard is dep in due_ids and dep not in completed_ok. It only blocks a
downstream job when the failed upstream is in the current due set. The fix makes a
failed/attempted upstream non-due (its today-dated run record removes it from due_ids).
So when upstream and downstream have different times, or downstream becomes due in a later
tick than the one where upstream failed, dep in due_ids is False → downstream runs with
stale/missing upstream output. Old code kept the failed upstream due (storm) which, as a
side effect, kept the dependency gate working.
Lead verification: read cli.py:78-103 — confirmed the gate keys off due_ids membership, not
per-period dep success. Real correctness regression for any depends_on chain.
Fix direction: gate downstream on the upstream's actual status this period (query last run
per period; skip downstream if any dep is not success-this-period), independent of due_ids.

C2. Crash / failure becomes a SILENT period-skip; skips never surfaced (silent-failure-hunter)

File: tasks/scheduler/service.py:30-39, cli.py:63,70
New semantics treat any same-period attempt (failed / running) as "done", so a job that
crashed mid-run (record_start committed, process killed before record_finish) is silently
skipped for the rest of the period. is_due returning False emits nothing; the tick exits
silently. Old code re-ran a crashed job next tick. Net: the storm fix converts a crash into a
silent multi-hour/period skip.
Fix direction (minimum): surface a warning when a job is skipped because of a same-period
failed/running record, so a wedged/failed job is visible. cleanup_stale_runs() return value
is also discarded (cli.py:63) — echo it when > 0.

Important

I1. Transient failure on weekly/monthly/quarterly silently parks a whole period (silent-failure-hunter)

A single transient blip (e.g. ACP gateway momentarily down) on a quarterly job now silently
skips ~90 days with no surfaced signal. Period-skip is too blunt for long-period jobs.
Fix direction: bounded same-period retry (N attempts with a min interval) instead of
all-or-nothing; or at least surface the "failed, dormant until next period" state.

I2. (agy) previous-period 'running' row → potential double-run — LARGELY MITIGATED

agy: a job still 'running' from yesterday → date < today True → is_due True → double-run.
Lead assessment: cleanup_stale_runs() runs first (cli.py:63) and flips >2h 'running' rows to
timeout; all jobs have timeout_seconds ≤ 1800s, so a genuine >24h run is impossible. Mostly
theoretical. agy's suggested fix (if status=='running': return False) is WRONG — it would make
a stale-running job never run again. Do NOT apply that; cleanup already handles it. Optionally
add a defensive test. Downgraded from Critical.

Refuted (lead)

agy [Important] timezone mismatch: all writers use now.isoformat() (local naive) and
is_due uses datetime.now() (local naive) — consistent, no UTC storage anywhere. REFUTED.

Actionable NIT (clean up)

N1 (code-reviewer + agy): get_last_successful_run is now dead production code. Given the storm
history, add a comment "retained for reporting; do NOT feed into is_due" OR remove it + its tests.
N2 (code-reviewer + agy): missing monthly/quarterly retry-storm test (the monthly|bimonthly| quarterly branch has distinct Y/M/D equality logic, untested for failed-same-day).
N3 (code-reviewer): stale comment at db.py cleanup ("1 小時" — actually 1h × multiplier = 2h).
N4 (silent-failure-hunter): ORDER BY started_at DESC tiebreak nondeterministic on same-second
rows; consider , id DESC. Pre-existing (get_last_successful_run has it too). Minor.

Verdict

NEEDS_CHANGES — C1 is a real correctness regression (dependency chains), C2/I1 are real silent
failure modes introduced by the period-skip approach. The simple "attempted → not due" design is
too blunt. Two fix directions below; this is a design decision for the human.

Voices unavailable

none (3/3 active)

howie added the enhancement New feature or request label Jun 23, 2026

howie merged commit 77af75b into main Jun 24, 2026
1 check passed

howie deleted the worktree-scheduler-retry-backoff branch June 24, 2026 06:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(scheduler): gate is_due on last attempt to stop retry-storm#170

fix(scheduler): gate is_due on last attempt to stop retry-storm#170
howie merged 1 commit into
mainfrom
worktree-scheduler-retry-backoff

howie commented Jun 23, 2026

Uh oh!

howie commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

howie commented Jun 23, 2026

背景

修法

取捨

驗證

備註

Uh oh!

howie commented Jun 23, 2026

Final Aggregated Review — PR #170 (scheduler retry-storm fix)

Mode

Consensus / verified Critical (must fix)

C1. Dependency gating regression across ticks (Codex — lead-verified REAL)

C2. Crash / failure becomes a SILENT period-skip; skips never surfaced (silent-failure-hunter)

Important

I1. Transient failure on weekly/monthly/quarterly silently parks a whole period (silent-failure-hunter)

I2. (agy) previous-period 'running' row → potential double-run — LARGELY MITIGATED

Refuted (lead)

Actionable NIT (clean up)

Verdict

Voices unavailable

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant