#253: fix provider-workload gate ENOENT race (scope_failed flake) by seungpyoson · Pull Request #254 · seungpyoson/relay

seungpyoson · 2026-06-26T04:25:56Z

What

Fixes a real concurrency race in the production provider-workload lease
(scripts/lib/review-workload.mjs, byte-synced to every plugin/relay copy)
that flaked the concurrent-Grok smoke (6 !== 7).

Root cause

acquireProviderWorkloadGate acquires the mkdir-based gate in two non-atomic
steps: mkdirSync(gateDir) then writeFileSync(gateDir/owner.json, {flag:"wx"}).
Under concurrent same-provider source-bearing runs a racing release/reclaim
removes gateDir between the steps, so the owner write throws ENOENT. The
outer catch only retried EEXIST (if (error?.code !== "EEXIST") throw error),
so ENOENT escaped acquireProviderWorkloadLease and the Grok reviewer catch-all
mislabeled the loser scope_failed instead of provider_workload_blocked — a job
that is neither completed nor blocked, breaking the partition assertion.

Captured ground truth:
status:"failed", error_code:"scope_failed", error_message:"ENOENT: ... open '.../grok-web.json.gate/owner.json'".

Fix

Retry ENOENT (gate vanished mid-acquire) as a deadline-bounded acquisition
retry, alongside EEXIST. Any other error still throws (fail-loud preserved).
Move the Date.now() >= deadline check above the reclaim continue, so the
retry path is hard deadline-bounded and can never hot-loop.
Byte-synced to all 10 plugin/relay copies.

Tests

New deterministic fail-on-revert regression in
tests/unit/review-workload.test.mjs: it rewrites the module's node:fs import
to a shim that removes gateDir on the first owner.json write and throws the
exact ENOENT, then asserts the real acquireProviderWorkloadLease still acquires.
Revert the fix → the test throws ENOENT (# fail 1); with the fix → # pass 1.

Verification (local)

Fail-on-revert proven by hand (revert → fail, restore → pass).
npm run lint clean (incl. the full sync … --check chain; all copies identical).
Flaky smoke stressed 20× → 20/20, zero scope_failed / 6 !== 7.
npm run test:full → 3004 tests, 0 fail.

Out of scope (tracked, not fixed here)

The secondary state-index/list retention flake (false !== true) in the same
smoke is a different subsystem (persistRecord / withStateLock / cmdList)
— documented as a follow-up on Grok provider-workload gate: ENOENT-on-owner-write race mislabeled scope_failed (concurrent lease) #253 to keep this PR MECE.
The reviewer catch-all mislabel is left untouched: once the lease stops throwing
ENOENT it is unreachable for this case.

Closes #253

🤖 Generated with Claude Code

acquireProviderWorkloadGate won the gate via mkdirSync then wrote owner.json; a concurrent release/reclaim removing the gate dir between those two steps threw ENOENT, which the outer catch re-threw (only EEXIST was retried). It escaped acquireProviderWorkloadLease and the Grok reviewer catch-all mislabeled the loser scope_failed instead of provider_workload_blocked, flaking the concurrent-Grok smoke (6 !== 7). Treat ENOENT (gate vanished mid-acquire) as a deadline-bounded retry alongside EEXIST, and move the deadline check above the reclaim-continue so the retry path can never hot-loop. Any other error still throws (fail-loud). Synced to all plugin/relay copies. Adds a deterministic fail-on-revert regression injecting the exact mkdir->owner-write ENOENT race. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request updates the gate acquisition logic in several review-workload.mjs files to handle ENOENT errors and reorder the gate reclamation check, alongside adding a unit test for gate directory races. However, catching ENOENT without verifying if it originated from mkdirSync can cause a tight 100% CPU hot-loop if the parent directory is missing, because the reclamation check will also fail with ENOENT and trigger an immediate loop continuation without sleeping. It is recommended to check error.syscall === "mkdir" to sleep before continuing, and to avoid unnecessary optional chaining on the guaranteed non-null error object.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

…ning (#253) Review follow-up (gemini HIGH): catching ENOENT broadly in the gate-race fix could 100%-CPU hot-loop until the deadline when the gate's PARENT (lock root) is missing — mkdirSync(gateDir) throws ENOENT (syscall "mkdir"), tryReclaim's renameSync also ENOENTs and returns true, so the loop continues with no sleep and the next mkdir fails identically. Deadline-bounded, but a multi-second CPU burn in the lock-root-deleted degraded state (concurrent teardown / tmp cleaner). Distinguish structural ENOENT (syscall "mkdir") from the mid-acquire race ENOENT (owner-file open): re-establish the lock root and sleep-pace, so a vanished root self-heals on the next iteration and can never spin. The race path is unchanged. Adds a deterministic fail-on-revert regression injecting the exact missing-root mkdir ENOENT. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

greptile-apps

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

sonarqubecloud · 2026-06-26T11:59:06Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Resolve review-workload.mjs conflict by composing #242's counting-semaphore gate with #254's gate-race fix (now in main): - broaden the gate catch to swallow EEXIST OR ENOENT (fail-loud on all else) - on structural mkdir-ENOENT (lock root vanished), recreate dirname(gateDir) — the caller-provided lockRoot in #242's design — NOT main's env-derived lockRoot(env), which is undefined in #242 (a silently-swallowed ReferenceError that would hot-loop to the deadline) - keep #242's capture seam on the reclaim call (gateDir, env, capture) Port #254's two gate-race tests to #242's lease API (concurrencyKey/limit/ lockRoot + RELAY_WORKLOAD_TEST_MODE), rewrite the shim's ./process-identity.mjs import to the real module, and add existsSync/readdirSync to the fs shims. Both fail-on-revert proven against the two fixes above. 10 byte-identical copies regenerated via sync; lint green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… merge The #242 bring-current (1af8ef0) composed #254's forced deadline->recreate->reclaim gate ordering with #242's two reclaim-success tests that pinned GATE_TIMEOUT_MS="1". With a 1ms budget the deadline (now+1ms) can fire before reclaim is reached, returning {ok:false} -> ~5% flake (reproduced 3/60). Production default is 5000ms, so production never misfires; purely a test-budget artifact. Test-only fix: single-source RECLAIM_GATE_TIMEOUT_MS="250" const (documented, guarded against being lowered back to "1") for the two reclaim-success tests. Production review-workload.mjs unchanged. The age-guard (gateAgeMs<=gateTimeoutMs) is off the reclaim-success path -- both tests write a reclaimable owner so gateOwnerActive returns false and reclaim runs before the guard -- so the bump cannot weaken reclaim correctness. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

#234: replace the single-flight relay lock with a bounded per-provider counting semaphore so multiple source-bearing reviews run concurrently under a limit. - Counting-semaphore lease: slot files + inner mkdir count-then-claim gate - Composes #254's gate-race ENOENT fix (deadline → recreate → reclaim ordering) - Identity-free block messages (§8), exit-listener cleanup, 11-copy canonical byte-sync - Brought current to post-#252 main (1af8ef0) - Reclaim-test flake fixed test-only (256c5ea): GATE_TIMEOUT_MS "1" → single-source "250" Fix-delta re-review: unanimous SHIP (Kimi/Claude/GPT/GLM), CI 8/8 + SonarCloud green. Closes #234 Follow-ups tracked in #239

gemini-code-assist Bot reviewed Jun 26, 2026

View reviewed changes

Test User and others added 2 commits June 26, 2026 13:51

Merge main into fix/253-grok-workload-gate-race (bring current)

125cc9f

seungpyoson marked this pull request as ready for review June 26, 2026 11:57

greptile-apps Bot reviewed Jun 26, 2026

View reviewed changes

seungpyoson merged commit cc6f3ba into main Jun 26, 2026
8 checks passed

seungpyoson deleted the fix/253-grok-workload-gate-race branch June 26, 2026 11:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#253: fix provider-workload gate ENOENT race (scope_failed flake)#254

#253: fix provider-workload gate ENOENT race (scope_failed flake)#254
seungpyoson merged 3 commits into
mainfrom
fix/253-grok-workload-gate-race

seungpyoson commented Jun 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

seungpyoson commented Jun 26, 2026

What

Root cause

Fix

Tests

Verification (local)

Out of scope (tracked, not fixed here)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sonarqubecloud Bot commented Jun 26, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant