#253: fix provider-workload gate ENOENT race (scope_failed flake)#254
Conversation
acquireProviderWorkloadGate won the gate via mkdirSync then wrote owner.json; a concurrent release/reclaim removing the gate dir between those two steps threw ENOENT, which the outer catch re-threw (only EEXIST was retried). It escaped acquireProviderWorkloadLease and the Grok reviewer catch-all mislabeled the loser scope_failed instead of provider_workload_blocked, flaking the concurrent-Grok smoke (6 !== 7). Treat ENOENT (gate vanished mid-acquire) as a deadline-bounded retry alongside EEXIST, and move the deadline check above the reclaim-continue so the retry path can never hot-loop. Any other error still throws (fail-loud). Synced to all plugin/relay copies. Adds a deterministic fail-on-revert regression injecting the exact mkdir->owner-write ENOENT race. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request updates the gate acquisition logic in several review-workload.mjs files to handle ENOENT errors and reorder the gate reclamation check, alongside adding a unit test for gate directory races. However, catching ENOENT without verifying if it originated from mkdirSync can cause a tight 100% CPU hot-loop if the parent directory is missing, because the reclamation check will also fail with ENOENT and trigger an immediate loop continuation without sleeping. It is recommended to check error.syscall === "mkdir" to sleep before continuing, and to avoid unnecessary optional chaining on the guaranteed non-null error object.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…ning (#253) Review follow-up (gemini HIGH): catching ENOENT broadly in the gate-race fix could 100%-CPU hot-loop until the deadline when the gate's PARENT (lock root) is missing — mkdirSync(gateDir) throws ENOENT (syscall "mkdir"), tryReclaim's renameSync also ENOENTs and returns true, so the loop continues with no sleep and the next mkdir fails identically. Deadline-bounded, but a multi-second CPU burn in the lock-root-deleted degraded state (concurrent teardown / tmp cleaner). Distinguish structural ENOENT (syscall "mkdir") from the mid-acquire race ENOENT (owner-file open): re-establish the lock root and sleep-pace, so a vanished root self-heals on the next iteration and can never spin. The race path is unchanged. Adds a deterministic fail-on-revert regression injecting the exact missing-root mkdir ENOENT. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|
Resolve review-workload.mjs conflict by composing #242's counting-semaphore gate with #254's gate-race fix (now in main): - broaden the gate catch to swallow EEXIST OR ENOENT (fail-loud on all else) - on structural mkdir-ENOENT (lock root vanished), recreate dirname(gateDir) — the caller-provided lockRoot in #242's design — NOT main's env-derived lockRoot(env), which is undefined in #242 (a silently-swallowed ReferenceError that would hot-loop to the deadline) - keep #242's capture seam on the reclaim call (gateDir, env, capture) Port #254's two gate-race tests to #242's lease API (concurrencyKey/limit/ lockRoot + RELAY_WORKLOAD_TEST_MODE), rewrite the shim's ./process-identity.mjs import to the real module, and add existsSync/readdirSync to the fs shims. Both fail-on-revert proven against the two fixes above. 10 byte-identical copies regenerated via sync; lint green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… merge The #242 bring-current (1af8ef0) composed #254's forced deadline->recreate->reclaim gate ordering with #242's two reclaim-success tests that pinned GATE_TIMEOUT_MS="1". With a 1ms budget the deadline (now+1ms) can fire before reclaim is reached, returning {ok:false} -> ~5% flake (reproduced 3/60). Production default is 5000ms, so production never misfires; purely a test-budget artifact. Test-only fix: single-source RECLAIM_GATE_TIMEOUT_MS="250" const (documented, guarded against being lowered back to "1") for the two reclaim-success tests. Production review-workload.mjs unchanged. The age-guard (gateAgeMs<=gateTimeoutMs) is off the reclaim-success path -- both tests write a reclaimable owner so gateOwnerActive returns false and reclaim runs before the guard -- so the bump cannot weaken reclaim correctness. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
#234: replace the single-flight relay lock with a bounded per-provider counting semaphore so multiple source-bearing reviews run concurrently under a limit. - Counting-semaphore lease: slot files + inner mkdir count-then-claim gate - Composes #254's gate-race ENOENT fix (deadline → recreate → reclaim ordering) - Identity-free block messages (§8), exit-listener cleanup, 11-copy canonical byte-sync - Brought current to post-#252 main (1af8ef0) - Reclaim-test flake fixed test-only (256c5ea): GATE_TIMEOUT_MS "1" → single-source "250" Fix-delta re-review: unanimous SHIP (Kimi/Claude/GPT/GLM), CI 8/8 + SonarCloud green. Closes #234 Follow-ups tracked in #239



What
Fixes a real concurrency race in the production provider-workload lease
(
scripts/lib/review-workload.mjs, byte-synced to every plugin/relay copy)that flaked the concurrent-Grok smoke (
6 !== 7).Root cause
acquireProviderWorkloadGateacquires the mkdir-based gate in two non-atomicsteps:
mkdirSync(gateDir)thenwriteFileSync(gateDir/owner.json, {flag:"wx"}).Under concurrent same-provider source-bearing runs a racing release/reclaim
removes
gateDirbetween the steps, so the owner write throws ENOENT. Theouter
catchonly retriedEEXIST(if (error?.code !== "EEXIST") throw error),so ENOENT escaped
acquireProviderWorkloadLeaseand the Grok reviewer catch-allmislabeled the loser
scope_failedinstead ofprovider_workload_blocked— a jobthat is neither
completednorblocked, breaking the partition assertion.Captured ground truth:
status:"failed", error_code:"scope_failed", error_message:"ENOENT: ... open '.../grok-web.json.gate/owner.json'".Fix
retry, alongside
EEXIST. Any other error still throws (fail-loud preserved).Date.now() >= deadlinecheck above the reclaimcontinue, so theretry path is hard deadline-bounded and can never hot-loop.
Tests
tests/unit/review-workload.test.mjs: it rewrites the module'snode:fsimportto a shim that removes
gateDiron the firstowner.jsonwrite and throws theexact ENOENT, then asserts the real
acquireProviderWorkloadLeasestill acquires.Revert the fix → the test throws ENOENT (
# fail 1); with the fix →# pass 1.Verification (local)
npm run lintclean (incl. the fullsync … --checkchain; all copies identical).scope_failed/6 !== 7.npm run test:full→ 3004 tests, 0 fail.Out of scope (tracked, not fixed here)
listretention flake (false !== true) in the samesmoke is a different subsystem (
persistRecord/withStateLock/cmdList)— documented as a follow-up on Grok provider-workload gate: ENOENT-on-owner-write race mislabeled scope_failed (concurrent lease) #253 to keep this PR MECE.
ENOENT it is unreachable for this case.
Closes #253
🤖 Generated with Claude Code