#234: bounded capability-driven concurrency for relay reviews#242
Conversation
Promote capturePidInfo({pid,starttime,argv0}) out of kimi's identity.mjs
into scripts/lib/process-identity.mjs so the #234 counting-semaphore in
review-workload.mjs can import one canonical (pid,starttime,argv0) liveness
check instead of a weaker kill(pid,0). Adds sync-process-identity.mjs, a
lint:sync chain entry, and a plugin-copies-in-sync assertion so the lib is
packaged into all 5 plugins before any consumer imports it. kimi/identity.mjs
now re-exports from the shared lib. Includes the #234 implementation plan.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
currentBootId's darwin path used kern.boottime (= wall - uptime), which an NTP/clock step shifts WITHOUT a reboot. The boot id is the sole safe trigger for clearing an unverifiable (capture_error) slot (design §6, invariant #4): 'boot id differs => reboot proven => holder gone'. A clock-derived id breaks that guarantee, so a clock step would falsely prove a reboot and reclaim a live holder's slot. Switch darwin to kern.bootsessionuuid (regenerated only at boot, clock-independent), keeping kern.boottime as the ancient-macOS fallback. New darwin test asserts the source is the uuid (no 'sec ='). Also corrects the #234 plan from the same depth-check: Task 2's reclaim scan must use a 4-state classifyHolder (alive|dead|unverifiable|foreign), not the boolean holderActive, which collapses alive/unverifiable/foreign and so cannot tell which occupied slots are boot-id-reclaimable. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Generalize the single-flight provider lock into a bounded, capability-driven
counting semaphore (spec §4/§6/§8):
- acquireProviderWorkloadLease now takes {concurrencyKey, limit, lockRoot, ...};
source-bearing calls with no concurrencyKey throw (no provider-name fallback).
- All-slot accounting under the per-key gate; lowest-free-index claim.
- 4-state classifyHolder (alive|dead|unverifiable|foreign) drives reclaim:
dead reclaimed always; unverifiable reclaimed only on a stale boot id;
foreign/alive occupied. invalid_pid => dead (preserves pre-#234 pidAlive
semantics; capture_error => unverifiable, fail-closed). classifyHolder gains
a capture-injection seam so the unverifiable branch is testable on any host.
- lockRoot is obeyed verbatim (no internal RELAY_PROVIDER_WORKLOAD_LOCK_DIR
re-read); limit must be a positive int; capacity is counts-only.
- Legacy <key>.json read as a slot-0-equivalent for one-time migration.
Targeted tests green: review-workload + process-identity (22 pass, 0 fail,
darwin-sandbox skips only). Re-verified on a host where the acquire tests
actually run, which surfaced and fixed a classifier regression codex could not
see under its sandbox (all acquire tests skipped there).
WIP: the full suite is intentionally red until Task 6 converts the 5
source-bearing consumers to the new admission context (spec §11 — the plumbing
lands atomically; an unconverted route has no key and must fail closed).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a declarative CONCURRENCY_FACTS table and resolveConcurrencyAdmission()
to provider-route-policy.mjs (spec §5/§5.1/§5.2). The resolver turns a
route's capability fact into the {concurrencyKey, limit, lockRoot} admission
context that Task 2's acquireProviderWorkloadLease consumes:
- shared_state -> concurrencyKey = sha256(st_dev:st_ino of the resolved dir),
NO provider prefix (same dir => same key); limit forced to 1 (declaredLimit>1
throws); lockRoot host-stable + non-overridable (ignores
RELAY_PROVIDER_WORKLOAD_LOCK_DIR except under RELAY_WORKLOAD_TEST_MODE).
- stateless -> concurrencyKey = provider.route; limit = min(declared, envCap)
cap-only; lockRoot host-stable, env-overridable.
- unknown category / unresolvable shared_state identity / non-directory =>
throw (fail closed, no provider-name fallback).
Lock root is versioned (.../relay/locks/v2) so the provider->identity key
migration cannot collide with stale locks. grok-web is classified shared_state
(D2). Facts seeded conservatively at limit 1 (Task 7 opts DeepSeek/GLM to 4).
provider-route-policy tests: 48 pass, 0 fail, 0 skip. lint:sync exit 0.
WIP: full suite still red until Task 6 wires consumers to the resolver.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…k 6)
Convert all five source-bearing review consumers (api-reviewer, claude/
gemini/kimi companions, grok-web-reviewer) from the single-flight workload
lock to the bounded capability-driven counting-semaphore admission path
(resolveConcurrencyAdmission -> acquireProviderWorkloadLease), with a
fail-loud null-lease invariant on every consumer.
Close the §8/§9 holder-identity disclosure at the class level — at both the
producer (review-workload.mjs blockResult/inspectSlot, counts-only
{active_count, limit}) and the persistence normalizer (job-record.mjs
normalizeProviderWorkloadDiagnostic x3, {reason, capacity}); no holder
job_id/provider/pid/host/cwd/lock_file is ever returned or persisted.
grok-web identity (GW-1/2/3): mode keys only on an explicit operator
endpoint (not an incidental admin key or a dead probe arg); explicit-endpoint
identity uses a transparent, deterministic admin=default marker when no
operator GROK2API_ADMIN_KEY is supplied — never a crash on the local-tunnel
happy path, never a fabricated identity, while real operator keys hash
distinctly so distinct accounts never silently merge.
gemini identity keys on GEMINI_CONFIG_DIR (the var the spawned CLI honors and
relay forwards), matching the CLAUDE_CONFIG_DIR pattern.
Tests: behavioral GW-1/2/3 + admission-identity unit tests, counts-only
job-record goldens, §9 no-leak smoke goldens; relay artifacts regenerated.
Full suite 2624 tests / 2614 pass / 0 fail / 10 skip; lint:sync --check clean.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add cross-process invariant tests for the bounded counting semaphore that in-process tests cannot prove: at most `limit` holders co-hold (peak ≤ limit from a timestamped cross-process event stream), a SIGKILLed holder frees exactly one slot (dead reclaimed, live sibling preserved), and reboot-id reclaim (an unverifiable slot with a STALE boot id is reclaimed; with the CURRENT boot id it fails closed). The reboot-id branch is reachable only via a `capture_error` (unverifiable) holder — an invalid pid classifies as `dead` and would silently no-op the test. Thread an optional `capture` seam through acquireProviderWorkloadLease → inspectSlot → classifyHolder (undefined → default capturePidInfo, so zero production behavior change) and inject a capture_error-throwing function. The negative case (CURRENT boot ⇒ blocked) fails unless the seam is wired, so the test self-validates that it exercises `unverifiable`, not `dead`. Synced to all 5 plugin copies + relay artifacts; lint:sync --check clean; existing review-workload suite 16/16 unchanged; full suite 2630 pass / 0 fail. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…cks (Task 5) Regression-pin §7 at HEAD via the public evaluateSourcePacketPolicy entrypoint (the composed path, not private helpers): - An admission-blocked retry never requires resend: sweep every PRE_TARGET_NOT_SENT_ERROR_CODES (incl. provider_workload_blocked) with source_content_transmission:"not_sent" and status:"failed" — the sent-fact must win first, so no resend gate fires. - A genuinely-sent post-transmission failure STILL gates (no over-relaxation): representative SOURCE_SEND_BLOCKING_FAILURES under both sent and may_be_sent, plus the generic status:"failed" path. - Record disagreement (source_sent:true + not_sent) gates conservatively and never silently auto-sends. Tests-only; provider-route-policy suite 52/52; full suite 2630 pass / 0 fail. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…k 7) DeepSeek and GLM direct_api routes are stateless (pure fetch, no shared local state), so raise their CONCURRENCY_FACTS limit from the conservative single-flight 1 to the D2 default of 4. Their per-route env caps (RELAY_DEEPSEEK_CONCURRENCY_LIMIT / RELAY_GLM_CONCURRENCY_LIMIT) can only LOWER the bound via Math.min, never raise it. custom direct_api stays at limit 1 — a user-defined endpoint's capacity is unknown until proven. Synced to all 5 plugin copies + 3 relay artifacts (lint:sync --check clean). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a declarative concurrency_budget section to the provider parity table mirroring CONCURRENCY_FACTS exactly (8 rows: shared_state limit 1 for claude/gemini/kimi/grok subscription + grok-web subscription_web; stateless limit 4 for deepseek/glm direct_api; limit 1 for custom direct_api), plus a "concurrency budget" policy area documenting the bounded counting-semaphore design. Schema: add concurrency_budget to required + properties (array of the new $defs.concurrency_fact). docs-contracts test derives the expected rows from CONCURRENCY_FACTS and deep-equals against the table so a future fact change cannot silently drift the doc, with headline spot-pins. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… hardening) Adversarial-review findings on the #234 counting semaphore: F3 — the gate owner was not a first-class holder: its payload recorded only pid/hostname/token, so classifyHolder() skipped the starttime/argv0 PID-reuse checks (a reused pid read as "alive") and the gate had no boot_id, so an unverifiable (capture_error) crashed owner could never be reclaimed — a regression from main's process.kill liveness. Gate owner now records starttime/argv0/boot_id and reclaims with the SAME dead / unverifiable-stale-boot semantics as slots, with the capture seam threaded through. Foreign-host handling is now consistent with slots (occupied, not wall-clock reclaimed). F4 — currentBootId's constant `unknown-<hostname>` fallback could never prove a reboot, so an unverifiable slot recorded with it deadlocked. Safe fix only (NO per-process nonce, which would reclaim live holders): emit a one-time warning and treat `unknown-`-prefixed boot ids as operator-cleanup-only in shouldReclaimUnverifiable (fail-closed). #2 — dedup slot.index when counting so the legacy lock and slot-0 can't double-count the same logical index. Tests: gate PID-reuse + stale-boot reclaim, cap-LOWERING over-admission across high-index slots (the core security invariant), legacy+slot-0 coexistence, foreign-host accounting, invalid-limit fail-closed. Synced to all 8 packaged/relay copies (lint:sync --check clean). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… 5 consumers
F7 — the counts-only strip (drop holder identity, keep {reason, capacity}) was
single-source only for claude/gemini/kimi (via job-record's
normalizeProviderWorkloadDiagnostic). api-reviewer and grok-web forwarded
diagnostics.provider_workload VERBATIM into the persisted+printed record,
bypassing the boundary. No live leak today (the producer emits counts only), but
any future producer attaching a holder would leak through 2 of 5 consumers — the
exact composed-path disclosure class to prevent.
Extract sanitizeProviderWorkloadDiagnostic into the shared privacy-redaction lib
(single source of truth); job-record now delegates to it (inline logic removed,
DRY) and api-reviewer + grok-web route through it instead of forwarding verbatim.
Cross-consumer negative test injects a holder {pid,host,cwd,lock_file,token,...}
and asserts all 5 consumers persist only {reason, capacity} with no identity.
Also: make api-reviewer's CLI entry guard realpath-robust. A prior refactor added
`resolve(argv[1]) === fileURLToPath(import.meta.url)`, which mismatches when the
script is spawned via a symlinked path (macOS /tmp -> /private/tmp) — runCli()
never fired, emitting empty stdout and failing every spawning test. isDirectCliEntry()
now compares canonical real paths; the dispatcher test asserts the guard property
robustly. Synced to all packaged/relay copies (lint:sync --check clean).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… hardening) F9 — pin that a malformed/zero RELAY_*_CONCURRENCY_LIMIT env cap is ignored and the declared stateless limit (4) stands (never-raise invariant intact); behavior unchanged, contract now guarded. T13 — prove the shared_state concurrency key is dev:inode identity, not a path string (two symlink-equivalent dirs resolve to the same key). T14 — actually validate docs/provider-parity-table.json concurrency_budget rows against the concurrency_fact JSON-schema $def (additionalProperties:false, category enum, limit>=1), which previously existed but was unenforced. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fail-on-revert validation found the F3 reclaim tests (which pre-seed a gate owner with starttime/argv0/boot_id) pass even when the production WRITE drops those fields — they exercise the read/classify path, not the write. Add a write-side assertion that acquireProviderWorkloadGate's owner payload records starttime/argv0/boot_id, so removing them (which would let a reused pid read "alive" and defeat stale-boot reclaim) now fails a test. Verified green->red on revert. Consistent with the existing write-path source assertion (atomic publish). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…test gaps Self-review of the remediation (a64cfc1..f2b56a8) caught a HIGH-severity over-admission I had introduced: the slot scan deduped occupied files by index, so the legacy lock (k.json, index 0) and slot-0.json — DISTINCT holders — collapsed to one count. At limit>1 this under-counted active leases and admitted past the limit. Reproduced: two distinct live holders at limit 2 returned ok:true and leased a third. Class fix: admission counts every occupied file per-holder; the index-keyed occupiedIndices set is retained ONLY for picking a free slot index, never for counting. Synced to all 8 copies. Test gaps closed: - review-workload: distinct index-0 holders (legacy lock + slot-0) both count toward the limit (green->red on dedup re-add). - review-workload: stale foreign-host gate owner is never reclaimed (wall-clock reclaim removed; fail-closed). - process-identity: boot-id fallback is the constant unknown-<hostname> sentinel + operator warning (rejects prefix-drop and per-process nonce). - smoke: removed vacuous holder===undefined assertion; the real counts-only disclosure guard lives in the job-record cross-consumer test. Full suite: 2648 pass / 0 fail / 10 skip (darwin ps gates). lint:sync clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request replaces the global single-flight provider workload lock with a bounded, capability-driven, file-based counting semaphore, allowing safe concurrent execution of source-bearing relays. It introduces a shared liveness and boot-id tracking module (process-identity.mjs) to handle process reclaim, defines route-specific concurrency limits, and updates all five companion plugins to enforce these budgets. A critical issue was identified in api-reviewer.mjs where fileURLToPath is used in isDirectCliEntry() without being imported from node:url, which will cause a runtime ReferenceError.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…ne + Sonar CPD) Task 1 (98b6591) introduced the synced process-identity.mjs lib but only half-registered it in the CI guards, surfacing as two red checks on PR #242 — both invisible locally because the branch had never run CI before: - Coverage gate: Task 1 re-baselined the new process-identity.mjs copies (at 0%) but left kimi/identity.mjs pinned at its pre-move 92.03%. Moving capturePidInfo into the shared lib shrank kimi/identity.mjs to a 59.32% re-export shim, tripping the baseline-regression check. The capability did NOT lose coverage — it stays tested via process-identity.test.mjs (canonical) and the claude/gemini inline identity.mjs copies (92.03%, target-enforced >=85%). Corrected the stale entry to the true 59.32/80/60. - SonarCloud: process-identity.mjs was the only #234 synced lib missing from sonar.cpd.exclusions, so its 5 analyzed copies counted as 18% duplicated new code (limit 3%). Added the 6 entries matching every sibling synced lib. Class fix: added the 6 process-identity.mjs paths to the enforced CPD list in ci-workflow.test.mjs, so a future synced lib left out of the Sonar exclusions fails a unit test instead of silently failing SonarCloud on a PR. Verified: coverage gate exit 0, 0 regressions, target met (85.00%); ci-workflow Sonar-CPD test passes; full unit suite 2044 pass / 0 fail. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…xport from process-identity.mjs
claude/gemini identity.mjs were 242-line byte-identical copies carrying
inline capturePidInfo/captureLinux/captureDarwin — the same code already
single-sourced in process-identity.mjs and consumed via re-export by kimi.
Converts both to the kimi re-export shim so all three plugins now share one
capturePidInfo implementation. Export surface (newJobId/capturePidInfo/
attachPidCapture/verifyPidInfo) unchanged.
Class fix for the drift that hid Task 1's divergence: the byte-identical
guard in plugin-copies-in-sync.test.mjs only covered claude+gemini, so when
Task 1 moved capturePidInfo out of kimi alone the copies silently diverged.
Moves identity.mjs from CLAUDE_GEMINI_VERBATIM_FILES to VERBATIM_FILES so
the guard now asserts byte-identity across claude/gemini/kimi.
- plugins/{claude,gemini}/scripts/lib/identity.mjs -> re-export shim
- relay/relay-gemini/scripts/lib/identity.mjs -> rebuilt artifact
- tests/unit/plugin-copies-in-sync.test.mjs -> widen drift guard to kimi
- scripts/ci/coverage-baseline.json -> ratchet all 3 identity.mjs to
measured 96.61/86.67/100 (now target-enforced >=85), replacing the
loose post-move values (claude 48.45, gemini 25.77, kimi 59.32)
Verified: full suite incl smoke 2648 tests 0 fail; coverage gate target
met 85.00% 0 regressions; drift guard byte-identical across all 3.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…essage channel, identity single-source Kimi + GPT/Codex adversarial review of PR #242. Fixes the confirmed security/correctness/completeness findings; low nits deferred to #239 (items 8-10). Fail-closed: - process-identity.mjs: unsupported platform throws capture_error (was process_gone) so classifyHolder returns unverifiable (occupied/fail-closed), never dead/reclaimable. 'Cannot inspect' is not 'proven gone'; a process_gone prefix made every holder reclaimable on an unsupported platform (fail-open over-admission + contract violation). [GPT #2] - review-workload.mjs tryReclaimProviderWorkloadGate: an unreadable (EACCES) or corrupt-JSON gate owner is now timeout-gated like a missing owner, matching inspectSlot's fail-closed treatment of an unparseable slot. Was reclaimed immediately, evicting a live holder whose owner.json briefly could not be read. [Kimi B.1] §8 disclosure channel (single class: error_message bypassed the sanitizer): - blockResult message is now identity-free; the shared_state concurrency key is an account-identity hash whose slug leaked into error_message. [Kimi C.1] - concurrencyAdmissionBlockedExecution extracted to one sanitized export in review-workload.mjs; all 5 source-bearing consumers route through it. Drops the raw error.message (a shared_state ENOENT carries an absolute config-dir path) and removes the 5-way duplication. [Kimi C.2] Single-source completeness: - New canonical scripts/lib/identity.mjs + scripts/ci/sync-identity.mjs, wired into lint:sync, sonar.cpd.exclusions, and a ci-workflow guard. Completes the dedup (the shim was byte-identical-guarded but had no canonical/sync). [Kimi F.1] Test robustness/quality: - review-workload-multiprocess.test.mjs: ps-preflight skip on the two real- liveness acquire tests; they hard-failed in ps-denied sandboxes instead of skipping. Runs on Linux CI / darwin-with-ps. [GPT #1] - Behavioral gate-owner-write test reading the real on-disk owner.json via the now-exported acquireProviderWorkloadGate, replacing a source-regex guard. [Lens-G] - New tests for every fix; lint:sync clean. Verified: full suite incl smoke 2653 tests / 0 fail / 10 skip; coverage gate target met 85.00% / 0 regressions. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR #242 CI flaked: smoke (claude) failed with ENOTEMPTY rmdir during cleanup of the 'cancel during OAuth preflight' test. Root cause: the test waited only for the durable JobRecord to reach a terminal status, then removed the temp data dir while the detached background worker was still flushing post-record tail writes (lease release, sidecar removal, lifecycle jsonl) -- rmSync raced a live writer. The Task-6 maxRetries 5->20 bump had treated the symptom, not the cause. Class fix (not the instance): a deterministic worker-process-exit barrier (waitForProcessExit) before cleanup, on every launching background test. - claude-companion.smoke: barrier on the failing test + 7 other launching-background tests; replace two setTimeout(250) 'settle' band-aids with the barrier. - gemini-companion.smoke: add the missing waitForProcessExit helper and the barrier on its 8 launching-background tests. - kimi-companion.smoke / invariants.test.mjs already used the barrier; preflight-reject tests (no detached worker) are untouched. Verified: claude smoke 139/139 (was 138 pass / 1 fail), gemini smoke 100/100, lint:sync clean, full suite 2643/0, coverage 85.00% met. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Brings #242 (bounded capability-driven concurrency) current with main (9e1298e, post-#233). Resolved 7 conflicts: - .sonarcloud.properties / package.json: union (main's #233 CPD entry + symlink lint check; #242's process-identity/identity sync checks + CPD entries). - grok / api-reviewers smoke: took main's env-helper refactor (grokSmokeEnv / apiReviewersSmokeEnv) and injected #242's load-bearing RELAY_WORKLOAD_TEST_MODE into the helpers (the merged review-workload.mjs throws without it). - gemini smoke: kept #242's smokeEnv (with test mode) + heldGeminiWorkloadLease, and main's smokeEnv(dataDir,{...}) call form + GEMINI_BINARY override. - job-record.test.mjs: import union (api-reviewer + grok + agy record builders). - coverage-baseline.json: per-metric min for kimi/identity.mjs; added AGY process-identity.mjs entry (0/0/0, matching the other provider copies). Propagated #242's new synced libs (process-identity/identity) to the AGY copy and rebuilt relay outputs. Verified: npm run lint exit 0; full suite 2964 pass / 0 fail; coverage gate with CI env (CODEX_PLUGIN_FULL_TESTS=1 CODEX_PLUGIN_SKIP_SMOKE=1 COVERAGE_ENFORCE_TARGET=1) target met 85.00%, no regressions.
…nd to gemini (#250) The first cut of #250 used a finally-first barrier with .catch() on the cancel tests -- a variant that diverged from the proven #234/#242 pattern and swallowed the 5s fail-loud timeout. Rework to the exact #242 form: end-of-try placement, plain await (fail-loud), a 30s *_SMOKE_POLL_TIMEOUT_MS budget on the cancel/status-poll tests, and the redundant setTimeout(250) cushions replaced rather than duplicated. Extend the same barrier to the gemini companion smoke suite, which shares the identical detached-worker teardown race. The resulting barriers are byte-identical to the companion-smoke barriers currently carried by PR #242 (verified: git diff feat/234 shows zero changed waitForProcessExit lines), so #242 dedups them automatically on its next main merge. This makes #250 the single source for the companion-smoke barrier class and removes the duplicative barrier work from the concurrent-relays PR. Class coverage: claude (9 tests) and gemini (helper + 8 tests) fixed here; kimi already barriered on main; agy is foreground-only (rejects --background, no detached worker, no race); identity-resume has no background launches. Test-only. Validated: lint (incl. sync checks), claude 138/138, gemini 100/100, sequential stress 10x claude + 5x gemini = 0 ENOTEMPTY (flake reproduced ~50%/run unfixed). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolve review-workload.mjs conflict by composing #242's counting-semaphore gate with #254's gate-race fix (now in main): - broaden the gate catch to swallow EEXIST OR ENOENT (fail-loud on all else) - on structural mkdir-ENOENT (lock root vanished), recreate dirname(gateDir) — the caller-provided lockRoot in #242's design — NOT main's env-derived lockRoot(env), which is undefined in #242 (a silently-swallowed ReferenceError that would hot-loop to the deadline) - keep #242's capture seam on the reclaim call (gateDir, env, capture) Port #254's two gate-race tests to #242's lease API (concurrencyKey/limit/ lockRoot + RELAY_WORKLOAD_TEST_MODE), rewrite the shim's ./process-identity.mjs import to the real module, and add existsSync/readdirSync to the fs shims. Both fail-on-revert proven against the two fixes above. 10 byte-identical copies regenerated via sync; lint green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… merge The #242 bring-current (1af8ef0) composed #254's forced deadline->recreate->reclaim gate ordering with #242's two reclaim-success tests that pinned GATE_TIMEOUT_MS="1". With a 1ms budget the deadline (now+1ms) can fire before reclaim is reached, returning {ok:false} -> ~5% flake (reproduced 3/60). Production default is 5000ms, so production never misfires; purely a test-budget artifact. Test-only fix: single-source RECLAIM_GATE_TIMEOUT_MS="250" const (documented, guarded against being lowered back to "1") for the two reclaim-success tests. Production review-workload.mjs unchanged. The age-guard (gateAgeMs<=gateTimeoutMs) is off the reclaim-success path -- both tests write a reclaimable owner so gateOwnerActive returns false and reclaim runs before the guard -- so the bump cannot weaken reclaim correctness. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.
|



Summary
Replaces the single-flight per-provider workload lock with a bounded, capability-driven counting semaphore, so independent relay reviews can run concurrently up to a per-provider limit instead of serializing globally.
limit=1is byte-for-byte equivalent to today's behavior.owner.jsontoken + hardlink lease) into per-key slot files<key>.slot-N.json. The engine holds zero provider knowledge — it consumes only a resolved{ concurrencyKey, limit, lockRoot }.shared_stateroutes (claude/gemini/kimi/grok subscription + grok-web) stay atlimit 1, keyed on resolved-dir inode identity (not provider name);direct_apistateless routes (DeepSeek/GLM) default tolimit 4(env-cap-only viaMath.min);customstayslimit 1.process-identity.mjs({pid, starttime, argv0}+ clock-independent boot-session id).capture_error ≠ death— fail closed; reclaim only on confirmed-dead pid, pid-reuse mismatch, or a stale boot id.{reason, capacity}; holder identity (pid/host/cwd/job_id/lock_file/argv0) goes to debug logs only. Centralized through one sharedsanitizeProviderWorkloadDiagnosticso all 5 consumers strip identically.provider_workload_blocked→not_sent→ no false resend gate; pinned with tests.Decisions
External panel (GPT-5.2-pro / Claude / Kimi / GLM) unanimous: shared-state keys on realpath inode identity + host-stable lock root (D1); stateless default
limit 4, env cap-only (D2); single-host local-disk, cross-host removed (D3). Global aggregate cap deliberately deferred with a compose seam — tracked in #239.Verification
psplatform gates; they run on CI Linux).NPMTEST_EXIT=0.sync-*+build-relay-plugin+lint:sync --check(20 checks) clean. Canonical libs synced to all 8 plugin/relay copies.slot-0into one count, admitting past the limit atlimit>1). Reproduced, fixed (count per-holder), and guarded with a fail-on-revert test (f830fe9).Follow-ups
Non-blocking hardening + deferred scope consolidated in #239 (global aggregate cap, grok-web concurrency proof before raising its limit, endpoint-string normalization, custom-limit raise, cross-consumer disclosure regression, and the fragile direct-CLI-entry guard shared by the grok scripts).
Closes #234.
🤖 Generated with Claude Code