Skip to content

Fix two flaky Windows unit tests in jobsupervisor#440

Draft
neddp wants to merge 1 commit into
mainfrom
fix-windows-test-cleanup-race
Draft

Fix two flaky Windows unit tests in jobsupervisor#440
neddp wants to merge 1 commit into
mainfrom
fix-windows-test-cleanup-race

Conversation

@neddp

@neddp neddp commented Jun 15, 2026

Copy link
Copy Markdown
Member

Summary

Two independent flaky tests in the Windows test-unit CI job.

1. Windows Job Supervisor — file handle race in test cleanup

After Stop() + RemoveAllJobs(), the Windows SCM marks services as deleted but the underlying WinSW/pipe.exe process may still hold file handles on job-service-wrapper.exe and its log files briefly. logDir (in AfterEach) and TempDir (in AfterSuite) used hard Expect calls that failed immediately on Access is denied / file is being used by another process. Wrap them in Eventually(..., 60s) to match the existing jobDir pattern.

Symptoms:

[FAILED] Timed out after 60.000s.
unlinkat ...job-3-.../job-service-wrapper.exe: Access is denied.
  In [AfterEach] at: windows_job_supervisor_test.go:489

unlinkat .../job-service-wrapper.err.log: The process cannot access the file because it is being used by another process.
  In [AfterSuite] at: windows_job_supervisor_test.go:68

2. Monit Job Supervisor — fake clock race in StopAndWait timer test

The test uses the same timer for waiting for pending and waiting for services to stop used a racy Eventually(WatcherCount) + Increment pattern. Between the Eventually seeing count=2 and Increment running, the goroutine could transition through checkServices() (where count drops to 1), causing Increment to miss the stop-loop Sleep watcher and leave the goroutine blocked on a fake timer that would never fire. Replace both pairs with advanceTime(), which atomically waits for the required watcher count before incrementing.

Symptoms:

[FAILED] Timed out after 1.001s.
When passed a matcher, ReceiveMatcher's channel *must* receive something.
  In [It] at: monit_job_supervisor_test.go:394

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 71324b8d-8a37-4e0c-8ccd-cce97999a6b6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-windows-test-cleanup-race

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@neddp neddp force-pushed the fix-windows-test-cleanup-race branch from 2633b35 to 1343eca Compare June 15, 2026 05:39
@neddp neddp changed the title Fix flaky Windows test cleanup: retry RemoveAll until file handles are released Fix two flaky Windows unit tests in jobsupervisor Jun 15, 2026
Windows Job Supervisor (AfterEach/AfterSuite):
After Stop()+RemoveAllJobs(), the Windows SCM marks services deleted but the
underlying WinSW/pipe.exe process may still hold file handles on
job-service-wrapper.exe and its log files briefly. With multiple services
and flapping restart cycles on a busy CI runner, this can exceed 60 seconds.
Increase the Eventually timeout to 2 minutes with a 1s poll interval, and
apply the same pattern to logDir (AfterEach) and TempDir (AfterSuite) which
previously used a hard Expect.

Monit Job Supervisor (StopAndWait timer test):
The test used a racy Eventually(WatcherCount)+Increment pattern. Between the
Eventually seeing count=2 and Increment running, the goroutine could transition
through checkServices() (where count drops to 1), causing Increment to miss the
stop-loop Sleep watcher and leave the goroutine blocked on a fake timer that
would never fire. Replace both pairs with advanceTime(), which atomically waits
for the required watcher count before incrementing.
@neddp neddp force-pushed the fix-windows-test-cleanup-race branch from 1343eca to 5b0beed Compare June 15, 2026 06:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant