Skip to content

#253: fix provider-workload gate ENOENT race (scope_failed flake)#254

Merged
seungpyoson merged 3 commits into
mainfrom
fix/253-grok-workload-gate-race
Jun 26, 2026
Merged

#253: fix provider-workload gate ENOENT race (scope_failed flake)#254
seungpyoson merged 3 commits into
mainfrom
fix/253-grok-workload-gate-race

Conversation

@seungpyoson

Copy link
Copy Markdown
Owner

What

Fixes a real concurrency race in the production provider-workload lease
(scripts/lib/review-workload.mjs, byte-synced to every plugin/relay copy)
that flaked the concurrent-Grok smoke (6 !== 7).

Root cause

acquireProviderWorkloadGate acquires the mkdir-based gate in two non-atomic
steps: mkdirSync(gateDir) then writeFileSync(gateDir/owner.json, {flag:"wx"}).
Under concurrent same-provider source-bearing runs a racing release/reclaim
removes gateDir between the steps, so the owner write throws ENOENT. The
outer catch only retried EEXIST (if (error?.code !== "EEXIST") throw error),
so ENOENT escaped acquireProviderWorkloadLease and the Grok reviewer catch-all
mislabeled the loser scope_failed instead of provider_workload_blocked — a job
that is neither completed nor blocked, breaking the partition assertion.

Captured ground truth:
status:"failed", error_code:"scope_failed", error_message:"ENOENT: ... open '.../grok-web.json.gate/owner.json'".

Fix

  • Retry ENOENT (gate vanished mid-acquire) as a deadline-bounded acquisition
    retry, alongside EEXIST. Any other error still throws (fail-loud preserved).
  • Move the Date.now() >= deadline check above the reclaim continue, so the
    retry path is hard deadline-bounded and can never hot-loop.
  • Byte-synced to all 10 plugin/relay copies.

Tests

  • New deterministic fail-on-revert regression in
    tests/unit/review-workload.test.mjs: it rewrites the module's node:fs import
    to a shim that removes gateDir on the first owner.json write and throws the
    exact ENOENT, then asserts the real acquireProviderWorkloadLease still acquires.
    Revert the fix → the test throws ENOENT (# fail 1); with the fix → # pass 1.

Verification (local)

  • Fail-on-revert proven by hand (revert → fail, restore → pass).
  • npm run lint clean (incl. the full sync … --check chain; all copies identical).
  • Flaky smoke stressed 20× → 20/20, zero scope_failed / 6 !== 7.
  • npm run test:full → 3004 tests, 0 fail.

Out of scope (tracked, not fixed here)

Closes #253

🤖 Generated with Claude Code

acquireProviderWorkloadGate won the gate via mkdirSync then wrote owner.json; a concurrent release/reclaim removing the gate dir between those two steps threw ENOENT, which the outer catch re-threw (only EEXIST was retried). It escaped acquireProviderWorkloadLease and the Grok reviewer catch-all mislabeled the loser scope_failed instead of provider_workload_blocked, flaking the concurrent-Grok smoke (6 !== 7).

Treat ENOENT (gate vanished mid-acquire) as a deadline-bounded retry alongside EEXIST, and move the deadline check above the reclaim-continue so the retry path can never hot-loop. Any other error still throws (fail-loud). Synced to all plugin/relay copies.

Adds a deterministic fail-on-revert regression injecting the exact mkdir->owner-write ENOENT race.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the gate acquisition logic in several review-workload.mjs files to handle ENOENT errors and reorder the gate reclamation check, alongside adding a unit test for gate directory races. However, catching ENOENT without verifying if it originated from mkdirSync can cause a tight 100% CPU hot-loop if the parent directory is missing, because the reclamation check will also fail with ENOENT and trigger an immediate loop continuation without sleeping. It is recommended to check error.syscall === "mkdir" to sleep before continuing, and to avoid unnecessary optional chaining on the guaranteed non-null error object.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread plugins/agy/scripts/lib/review-workload.mjs
Comment thread plugins/api-reviewers/scripts/lib/review-workload.mjs
Comment thread plugins/claude/scripts/lib/review-workload.mjs
Comment thread plugins/gemini/scripts/lib/review-workload.mjs
Comment thread plugins/grok/scripts/lib/review-workload.mjs
Comment thread relay/relay-agy/scripts/lib/review-workload.mjs
Comment thread relay/relay-gemini/scripts/lib/review-workload.mjs
Comment thread relay/relay-grok/scripts/lib/review-workload.mjs
Comment thread relay/relay-kimi/scripts/lib/review-workload.mjs
Comment thread scripts/lib/review-workload.mjs
Test User and others added 2 commits June 26, 2026 13:51
…ning (#253)

Review follow-up (gemini HIGH): catching ENOENT broadly in the gate-race fix could 100%-CPU hot-loop until the deadline when the gate's PARENT (lock root) is missing — mkdirSync(gateDir) throws ENOENT (syscall "mkdir"), tryReclaim's renameSync also ENOENTs and returns true, so the loop continues with no sleep and the next mkdir fails identically. Deadline-bounded, but a multi-second CPU burn in the lock-root-deleted degraded state (concurrent teardown / tmp cleaner).

Distinguish structural ENOENT (syscall "mkdir") from the mid-acquire race ENOENT (owner-file open): re-establish the lock root and sleep-pace, so a vanished root self-heals on the next iteration and can never spin. The race path is unchanged.

Adds a deterministic fail-on-revert regression injecting the exact missing-root mkdir ENOENT.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@seungpyoson seungpyoson marked this pull request as ready for review June 26, 2026 11:57

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@seungpyoson seungpyoson merged commit cc6f3ba into main Jun 26, 2026
8 checks passed
@seungpyoson seungpyoson deleted the fix/253-grok-workload-gate-race branch June 26, 2026 11:58
@sonarqubecloud

Copy link
Copy Markdown

seungpyoson pushed a commit that referenced this pull request Jun 26, 2026
Resolve review-workload.mjs conflict by composing #242's counting-semaphore
gate with #254's gate-race fix (now in main):
- broaden the gate catch to swallow EEXIST OR ENOENT (fail-loud on all else)
- on structural mkdir-ENOENT (lock root vanished), recreate dirname(gateDir) —
  the caller-provided lockRoot in #242's design — NOT main's env-derived
  lockRoot(env), which is undefined in #242 (a silently-swallowed ReferenceError
  that would hot-loop to the deadline)
- keep #242's capture seam on the reclaim call (gateDir, env, capture)

Port #254's two gate-race tests to #242's lease API (concurrencyKey/limit/
lockRoot + RELAY_WORKLOAD_TEST_MODE), rewrite the shim's ./process-identity.mjs
import to the real module, and add existsSync/readdirSync to the fs shims.
Both fail-on-revert proven against the two fixes above. 10 byte-identical
copies regenerated via sync; lint green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
seungpyoson pushed a commit that referenced this pull request Jun 26, 2026
… merge

The #242 bring-current (1af8ef0) composed #254's forced deadline->recreate->reclaim gate ordering with #242's two reclaim-success tests that pinned GATE_TIMEOUT_MS="1". With a 1ms budget the deadline (now+1ms) can fire before reclaim is reached, returning {ok:false} -> ~5% flake (reproduced 3/60). Production default is 5000ms, so production never misfires; purely a test-budget artifact.

Test-only fix: single-source RECLAIM_GATE_TIMEOUT_MS="250" const (documented, guarded against being lowered back to "1") for the two reclaim-success tests. Production review-workload.mjs unchanged. The age-guard (gateAgeMs<=gateTimeoutMs) is off the reclaim-success path -- both tests write a reclaimable owner so gateOwnerActive returns false and reclaim runs before the guard -- so the bump cannot weaken reclaim correctness.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
seungpyoson added a commit that referenced this pull request Jun 26, 2026
#234: replace the single-flight relay lock with a bounded per-provider counting
  semaphore so multiple source-bearing reviews run concurrently under a limit.

  - Counting-semaphore lease: slot files + inner mkdir count-then-claim gate
  - Composes #254's gate-race ENOENT fix (deadline → recreate → reclaim ordering)
  - Identity-free block messages (§8), exit-listener cleanup, 11-copy canonical byte-sync
  - Brought current to post-#252 main (1af8ef0)
  - Reclaim-test flake fixed test-only (256c5ea): GATE_TIMEOUT_MS "1" → single-source "250"

  Fix-delta re-review: unanimous SHIP (Kimi/Claude/GPT/GLM), CI 8/8 + SonarCloud green.

  Closes #234
  Follow-ups tracked in #239
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Grok provider-workload gate: ENOENT-on-owner-write race mislabeled scope_failed (concurrent lease)

1 participant