Skip to content

Add max total crash limit to stop relaunching persistently failing pr…#1436

Open
kezhangMS wants to merge 2 commits into
microsoft:mainfrom
kezhangMS:totalCrashLimit
Open

Add max total crash limit to stop relaunching persistently failing pr…#1436
kezhangMS wants to merge 2 commits into
microsoft:mainfrom
kezhangMS:totalCrashLimit

Conversation

@kezhangMS
Copy link
Copy Markdown
Member

@kezhangMS kezhangMS commented Apr 10, 2026

…ocesses

The existing crash protection only stops relaunching a process if it
exits more than 10 times within 60 seconds. This misses the case where
a process fails at a slower but persistent rate (e.g. 2 times per
minute), which causes it to be relaunched indefinitely.

Add a total crash counter (max 50) alongside the existing per-minute
rolling window so that persistently failing processes are eventually
stopped regardless of crash rate.

…ocesses

The existing crash protection only stops relaunching a process if it
  exits more than 10 times within 60 seconds. This misses the case where
  a process fails at a slower but persistent rate (e.g. 2 times per
  minute), which causes it to be relaunched indefinitely.

  Add a total crash counter (max 20) alongside the existing per-minute
  rolling window so that persistently failing processes are eventually
  stopped regardless of crash rate.
@benhillis benhillis added the bug Something isn't working label May 17, 2026
@benhillis
Copy link
Copy Markdown
Member

Triage note (2026-05): Thanks — the rationale (slow but persistent crash loops bypass the per-minute rate limit and run forever) is correct and the fix is the right shape. Two small things for the next round:

  1. c_maxTotalCrashes = 20 over the lifetime of a WSLGd process is quite tight — a long-lived session that crashes Weston a few times across, say, a network change, a sleep/wake, and an rdp-disconnect over the course of a workday could plausibly hit it. Consider 50 or making it a function of uptime. Happy to take 20 if you''ve actually seen the pathology you''re guarding against trigger inside that budget.
  2. Please add a brief log line when the total cap is reached that distinguishes it from the per-minute cap in /mnt/wslg/stderr.log — your LOG_INFO does that, good.

Will queue for an internal buddy build once you reply or push an update.

Copilot AI review requested due to automatic review settings May 18, 2026 18:18
@kezhangMS
Copy link
Copy Markdown
Member Author

Thanks for the comments. Updated the total crash threshold from 20 to 50.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a cumulative crash counter to ProcessMonitor::Run so that processes which fail at a slow but persistent rate are eventually given up on, in addition to the existing per-minute rolling-window limit. Also replaces magic numbers in the log message with named constants.

Changes:

  • Introduces totalCrashes map and c_maxTotalCrashes constant to track lifetime crash counts per command.
  • Refactors the per-minute threshold to a named constant c_maxCrashesPerMinute and updates the log message.
  • Adds a new log/stop branch when total crashes exceed the lifetime threshold.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread WSLGd/ProcessMonitor.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants