Add max total crash limit to stop relaunching persistently failing pr…#1436
Add max total crash limit to stop relaunching persistently failing pr…#1436kezhangMS wants to merge 2 commits into
Conversation
…ocesses The existing crash protection only stops relaunching a process if it exits more than 10 times within 60 seconds. This misses the case where a process fails at a slower but persistent rate (e.g. 2 times per minute), which causes it to be relaunched indefinitely. Add a total crash counter (max 20) alongside the existing per-minute rolling window so that persistently failing processes are eventually stopped regardless of crash rate.
|
Triage note (2026-05): Thanks — the rationale (slow but persistent crash loops bypass the per-minute rate limit and run forever) is correct and the fix is the right shape. Two small things for the next round:
Will queue for an internal buddy build once you reply or push an update. |
|
Thanks for the comments. Updated the total crash threshold from 20 to 50. |
There was a problem hiding this comment.
Pull request overview
Adds a cumulative crash counter to ProcessMonitor::Run so that processes which fail at a slow but persistent rate are eventually given up on, in addition to the existing per-minute rolling-window limit. Also replaces magic numbers in the log message with named constants.
Changes:
- Introduces
totalCrashesmap andc_maxTotalCrashesconstant to track lifetime crash counts per command. - Refactors the per-minute threshold to a named constant
c_maxCrashesPerMinuteand updates the log message. - Adds a new log/stop branch when total crashes exceed the lifetime threshold.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…ocesses
The existing crash protection only stops relaunching a process if it
exits more than 10 times within 60 seconds. This misses the case where
a process fails at a slower but persistent rate (e.g. 2 times per
minute), which causes it to be relaunched indefinitely.
Add a total crash counter (max 50) alongside the existing per-minute
rolling window so that persistently failing processes are eventually
stopped regardless of crash rate.