Skip to content

#234: bounded capability-driven concurrency for relay reviews#242

Merged
seungpyoson merged 22 commits into
mainfrom
feat/234-concurrent-relays
Jun 26, 2026
Merged

#234: bounded capability-driven concurrency for relay reviews#242
seungpyoson merged 22 commits into
mainfrom
feat/234-concurrent-relays

Conversation

@seungpyoson

Copy link
Copy Markdown
Owner

Summary

Replaces the single-flight per-provider workload lock with a bounded, capability-driven counting semaphore, so independent relay reviews can run concurrently up to a per-provider limit instead of serializing globally. limit=1 is byte-for-byte equivalent to today's behavior.

  • Generalizes the existing gate + lease (mkdir-mutex + owner.json token + hardlink lease) into per-key slot files <key>.slot-N.json. The engine holds zero provider knowledge — it consumes only a resolved { concurrencyKey, limit, lockRoot }.
  • Capability-driven limits: shared_state routes (claude/gemini/kimi/grok subscription + grok-web) stay at limit 1, keyed on resolved-dir inode identity (not provider name); direct_api stateless routes (DeepSeek/GLM) default to limit 4 (env-cap-only via Math.min); custom stays limit 1.
  • Liveness-based reclaim via the new shared process-identity.mjs ({pid, starttime, argv0} + clock-independent boot-session id). capture_error ≠ death — fail closed; reclaim only on confirmed-dead pid, pid-reuse mismatch, or a stale boot id.
  • §8 info-disclosure: external/persisted diagnostics are counts-only {reason, capacity}; holder identity (pid/host/cwd/job_id/lock_file/argv0) goes to debug logs only. Centralized through one shared sanitizeProviderWorkloadDiagnostic so all 5 consumers strip identically.
  • Resend-guard (P0: Support multiple concurrent source-bearing relays per provider #234 item 3): provider_workload_blockednot_sent → no false resend gate; pinned with tests.

Decisions

External panel (GPT-5.2-pro / Claude / Kimi / GLM) unanimous: shared-state keys on realpath inode identity + host-stable lock root (D1); stateless default limit 4, env cap-only (D2); single-host local-disk, cross-host removed (D3). Global aggregate cap deliberately deferred with a compose seam — tracked in #239.

Verification

  • Full suite: 2648 pass / 0 fail / 10 skip (the 10 are the by-design macOS ps platform gates; they run on CI Linux). NPMTEST_EXIT=0.
  • sync-* + build-relay-plugin + lint:sync --check (20 checks) clean. Canonical libs synced to all 8 plugin/relay copies.
  • Two internal adversarial-review rounds. The second round reviewed my own remediation and caught a HIGH-severity over-admission I had introduced (an index-dedup that collapsed the legacy lock and slot-0 into one count, admitting past the limit at limit>1). Reproduced, fixed (count per-holder), and guarded with a fail-on-revert test (f830fe9).
  • Fail-on-revert validated for the over-admission, boot-id-reclaim safety, §8 disclosure strip, and the gate-owner write-payload guards.

Follow-ups

Non-blocking hardening + deferred scope consolidated in #239 (global aggregate cap, grok-web concurrency proof before raising its limit, endpoint-string normalization, custom-limit raise, cross-consumer disclosure regression, and the fragile direct-CLI-entry guard shared by the grok scripts).

Closes #234.

🤖 Generated with Claude Code

Test User and others added 14 commits June 22, 2026 03:43
Promote capturePidInfo({pid,starttime,argv0}) out of kimi's identity.mjs
into scripts/lib/process-identity.mjs so the #234 counting-semaphore in
review-workload.mjs can import one canonical (pid,starttime,argv0) liveness
check instead of a weaker kill(pid,0). Adds sync-process-identity.mjs, a
lint:sync chain entry, and a plugin-copies-in-sync assertion so the lib is
packaged into all 5 plugins before any consumer imports it. kimi/identity.mjs
now re-exports from the shared lib. Includes the #234 implementation plan.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
currentBootId's darwin path used kern.boottime (= wall - uptime), which an
NTP/clock step shifts WITHOUT a reboot. The boot id is the sole safe trigger
for clearing an unverifiable (capture_error) slot (design §6, invariant #4):
'boot id differs => reboot proven => holder gone'. A clock-derived id breaks
that guarantee, so a clock step would falsely prove a reboot and reclaim a
live holder's slot. Switch darwin to kern.bootsessionuuid (regenerated only
at boot, clock-independent), keeping kern.boottime as the ancient-macOS
fallback. New darwin test asserts the source is the uuid (no 'sec =').

Also corrects the #234 plan from the same depth-check: Task 2's reclaim scan
must use a 4-state classifyHolder (alive|dead|unverifiable|foreign), not the
boolean holderActive, which collapses alive/unverifiable/foreign and so
cannot tell which occupied slots are boot-id-reclaimable.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Generalize the single-flight provider lock into a bounded, capability-driven
counting semaphore (spec §4/§6/§8):
- acquireProviderWorkloadLease now takes {concurrencyKey, limit, lockRoot, ...};
  source-bearing calls with no concurrencyKey throw (no provider-name fallback).
- All-slot accounting under the per-key gate; lowest-free-index claim.
- 4-state classifyHolder (alive|dead|unverifiable|foreign) drives reclaim:
  dead reclaimed always; unverifiable reclaimed only on a stale boot id;
  foreign/alive occupied. invalid_pid => dead (preserves pre-#234 pidAlive
  semantics; capture_error => unverifiable, fail-closed). classifyHolder gains
  a capture-injection seam so the unverifiable branch is testable on any host.
- lockRoot is obeyed verbatim (no internal RELAY_PROVIDER_WORKLOAD_LOCK_DIR
  re-read); limit must be a positive int; capacity is counts-only.
- Legacy <key>.json read as a slot-0-equivalent for one-time migration.

Targeted tests green: review-workload + process-identity (22 pass, 0 fail,
darwin-sandbox skips only). Re-verified on a host where the acquire tests
actually run, which surfaced and fixed a classifier regression codex could not
see under its sandbox (all acquire tests skipped there).

WIP: the full suite is intentionally red until Task 6 converts the 5
source-bearing consumers to the new admission context (spec §11 — the plumbing
lands atomically; an unconverted route has no key and must fail closed).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a declarative CONCURRENCY_FACTS table and resolveConcurrencyAdmission()
to provider-route-policy.mjs (spec §5/§5.1/§5.2). The resolver turns a
route's capability fact into the {concurrencyKey, limit, lockRoot} admission
context that Task 2's acquireProviderWorkloadLease consumes:
- shared_state -> concurrencyKey = sha256(st_dev:st_ino of the resolved dir),
  NO provider prefix (same dir => same key); limit forced to 1 (declaredLimit>1
  throws); lockRoot host-stable + non-overridable (ignores
  RELAY_PROVIDER_WORKLOAD_LOCK_DIR except under RELAY_WORKLOAD_TEST_MODE).
- stateless -> concurrencyKey = provider.route; limit = min(declared, envCap)
  cap-only; lockRoot host-stable, env-overridable.
- unknown category / unresolvable shared_state identity / non-directory =>
  throw (fail closed, no provider-name fallback).
Lock root is versioned (.../relay/locks/v2) so the provider->identity key
migration cannot collide with stale locks. grok-web is classified shared_state
(D2). Facts seeded conservatively at limit 1 (Task 7 opts DeepSeek/GLM to 4).

provider-route-policy tests: 48 pass, 0 fail, 0 skip. lint:sync exit 0.

WIP: full suite still red until Task 6 wires consumers to the resolver.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…k 6)

Convert all five source-bearing review consumers (api-reviewer, claude/
gemini/kimi companions, grok-web-reviewer) from the single-flight workload
lock to the bounded capability-driven counting-semaphore admission path
(resolveConcurrencyAdmission -> acquireProviderWorkloadLease), with a
fail-loud null-lease invariant on every consumer.

Close the §8/§9 holder-identity disclosure at the class level — at both the
producer (review-workload.mjs blockResult/inspectSlot, counts-only
{active_count, limit}) and the persistence normalizer (job-record.mjs
normalizeProviderWorkloadDiagnostic x3, {reason, capacity}); no holder
job_id/provider/pid/host/cwd/lock_file is ever returned or persisted.

grok-web identity (GW-1/2/3): mode keys only on an explicit operator
endpoint (not an incidental admin key or a dead probe arg); explicit-endpoint
identity uses a transparent, deterministic admin=default marker when no
operator GROK2API_ADMIN_KEY is supplied — never a crash on the local-tunnel
happy path, never a fabricated identity, while real operator keys hash
distinctly so distinct accounts never silently merge.

gemini identity keys on GEMINI_CONFIG_DIR (the var the spawned CLI honors and
relay forwards), matching the CLAUDE_CONFIG_DIR pattern.

Tests: behavioral GW-1/2/3 + admission-identity unit tests, counts-only
job-record goldens, §9 no-leak smoke goldens; relay artifacts regenerated.
Full suite 2624 tests / 2614 pass / 0 fail / 10 skip; lint:sync --check clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add cross-process invariant tests for the bounded counting semaphore that
in-process tests cannot prove: at most `limit` holders co-hold (peak ≤ limit
from a timestamped cross-process event stream), a SIGKILLed holder frees
exactly one slot (dead reclaimed, live sibling preserved), and reboot-id
reclaim (an unverifiable slot with a STALE boot id is reclaimed; with the
CURRENT boot id it fails closed).

The reboot-id branch is reachable only via a `capture_error` (unverifiable)
holder — an invalid pid classifies as `dead` and would silently no-op the
test. Thread an optional `capture` seam through acquireProviderWorkloadLease →
inspectSlot → classifyHolder (undefined → default capturePidInfo, so zero
production behavior change) and inject a capture_error-throwing function. The
negative case (CURRENT boot ⇒ blocked) fails unless the seam is wired, so the
test self-validates that it exercises `unverifiable`, not `dead`.

Synced to all 5 plugin copies + relay artifacts; lint:sync --check clean;
existing review-workload suite 16/16 unchanged; full suite 2630 pass / 0 fail.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…cks (Task 5)

Regression-pin §7 at HEAD via the public evaluateSourcePacketPolicy entrypoint
(the composed path, not private helpers):
- An admission-blocked retry never requires resend: sweep every
  PRE_TARGET_NOT_SENT_ERROR_CODES (incl. provider_workload_blocked) with
  source_content_transmission:"not_sent" and status:"failed" — the sent-fact
  must win first, so no resend gate fires.
- A genuinely-sent post-transmission failure STILL gates (no over-relaxation):
  representative SOURCE_SEND_BLOCKING_FAILURES under both sent and may_be_sent,
  plus the generic status:"failed" path.
- Record disagreement (source_sent:true + not_sent) gates conservatively and
  never silently auto-sends.

Tests-only; provider-route-policy suite 52/52; full suite 2630 pass / 0 fail.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…k 7)

DeepSeek and GLM direct_api routes are stateless (pure fetch, no shared
local state), so raise their CONCURRENCY_FACTS limit from the conservative
single-flight 1 to the D2 default of 4. Their per-route env caps
(RELAY_DEEPSEEK_CONCURRENCY_LIMIT / RELAY_GLM_CONCURRENCY_LIMIT) can only
LOWER the bound via Math.min, never raise it. custom direct_api stays at
limit 1 — a user-defined endpoint's capacity is unknown until proven.

Synced to all 5 plugin copies + 3 relay artifacts (lint:sync --check clean).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a declarative concurrency_budget section to the provider parity table
mirroring CONCURRENCY_FACTS exactly (8 rows: shared_state limit 1 for
claude/gemini/kimi/grok subscription + grok-web subscription_web; stateless
limit 4 for deepseek/glm direct_api; limit 1 for custom direct_api), plus a
"concurrency budget" policy area documenting the bounded counting-semaphore
design.

Schema: add concurrency_budget to required + properties (array of the new
$defs.concurrency_fact). docs-contracts test derives the expected rows from
CONCURRENCY_FACTS and deep-equals against the table so a future fact change
cannot silently drift the doc, with headline spot-pins.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… hardening)

Adversarial-review findings on the #234 counting semaphore:

F3 — the gate owner was not a first-class holder: its payload recorded only
pid/hostname/token, so classifyHolder() skipped the starttime/argv0 PID-reuse
checks (a reused pid read as "alive") and the gate had no boot_id, so an
unverifiable (capture_error) crashed owner could never be reclaimed — a
regression from main's process.kill liveness. Gate owner now records
starttime/argv0/boot_id and reclaims with the SAME dead / unverifiable-stale-boot
semantics as slots, with the capture seam threaded through. Foreign-host handling
is now consistent with slots (occupied, not wall-clock reclaimed).

F4 — currentBootId's constant `unknown-<hostname>` fallback could never prove a
reboot, so an unverifiable slot recorded with it deadlocked. Safe fix only (NO
per-process nonce, which would reclaim live holders): emit a one-time warning and
treat `unknown-`-prefixed boot ids as operator-cleanup-only in
shouldReclaimUnverifiable (fail-closed).

#2 — dedup slot.index when counting so the legacy lock and slot-0 can't
double-count the same logical index.

Tests: gate PID-reuse + stale-boot reclaim, cap-LOWERING over-admission across
high-index slots (the core security invariant), legacy+slot-0 coexistence,
foreign-host accounting, invalid-limit fail-closed. Synced to all 8 packaged/relay
copies (lint:sync --check clean).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… 5 consumers

F7 — the counts-only strip (drop holder identity, keep {reason, capacity}) was
single-source only for claude/gemini/kimi (via job-record's
normalizeProviderWorkloadDiagnostic). api-reviewer and grok-web forwarded
diagnostics.provider_workload VERBATIM into the persisted+printed record,
bypassing the boundary. No live leak today (the producer emits counts only), but
any future producer attaching a holder would leak through 2 of 5 consumers — the
exact composed-path disclosure class to prevent.

Extract sanitizeProviderWorkloadDiagnostic into the shared privacy-redaction lib
(single source of truth); job-record now delegates to it (inline logic removed,
DRY) and api-reviewer + grok-web route through it instead of forwarding verbatim.
Cross-consumer negative test injects a holder {pid,host,cwd,lock_file,token,...}
and asserts all 5 consumers persist only {reason, capacity} with no identity.

Also: make api-reviewer's CLI entry guard realpath-robust. A prior refactor added
`resolve(argv[1]) === fileURLToPath(import.meta.url)`, which mismatches when the
script is spawned via a symlinked path (macOS /tmp -> /private/tmp) — runCli()
never fired, emitting empty stdout and failing every spawning test. isDirectCliEntry()
now compares canonical real paths; the dispatcher test asserts the guard property
robustly. Synced to all packaged/relay copies (lint:sync --check clean).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… hardening)

F9 — pin that a malformed/zero RELAY_*_CONCURRENCY_LIMIT env cap is ignored and
the declared stateless limit (4) stands (never-raise invariant intact); behavior
unchanged, contract now guarded.

T13 — prove the shared_state concurrency key is dev:inode identity, not a path
string (two symlink-equivalent dirs resolve to the same key).

T14 — actually validate docs/provider-parity-table.json concurrency_budget rows
against the concurrency_fact JSON-schema $def (additionalProperties:false,
category enum, limit>=1), which previously existed but was unenforced.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fail-on-revert validation found the F3 reclaim tests (which pre-seed a gate owner
with starttime/argv0/boot_id) pass even when the production WRITE drops those
fields — they exercise the read/classify path, not the write. Add a write-side
assertion that acquireProviderWorkloadGate's owner payload records
starttime/argv0/boot_id, so removing them (which would let a reused pid read
"alive" and defeat stale-boot reclaim) now fails a test. Verified green->red on
revert. Consistent with the existing write-path source assertion (atomic publish).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…test gaps

Self-review of the remediation (a64cfc1..f2b56a8) caught a HIGH-severity
over-admission I had introduced: the slot scan deduped occupied files by
index, so the legacy lock (k.json, index 0) and slot-0.json — DISTINCT
holders — collapsed to one count. At limit>1 this under-counted active
leases and admitted past the limit. Reproduced: two distinct live holders
at limit 2 returned ok:true and leased a third.

Class fix: admission counts every occupied file per-holder; the
index-keyed occupiedIndices set is retained ONLY for picking a free slot
index, never for counting. Synced to all 8 copies.

Test gaps closed:
- review-workload: distinct index-0 holders (legacy lock + slot-0) both
  count toward the limit (green->red on dedup re-add).
- review-workload: stale foreign-host gate owner is never reclaimed
  (wall-clock reclaim removed; fail-closed).
- process-identity: boot-id fallback is the constant unknown-<hostname>
  sentinel + operator warning (rejects prefix-drop and per-process nonce).
- smoke: removed vacuous holder===undefined assertion; the real
  counts-only disclosure guard lives in the job-record cross-consumer test.

Full suite: 2648 pass / 0 fail / 10 skip (darwin ps gates). lint:sync clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the global single-flight provider workload lock with a bounded, capability-driven, file-based counting semaphore, allowing safe concurrent execution of source-bearing relays. It introduces a shared liveness and boot-id tracking module (process-identity.mjs) to handle process reclaim, defines route-specific concurrency limits, and updates all five companion plugins to enforce these budgets. A critical issue was identified in api-reviewer.mjs where fileURLToPath is used in isDirectCliEntry() without being imported from node:url, which will cause a runtime ReferenceError.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread plugins/api-reviewers/scripts/api-reviewer.mjs
Test User and others added 5 commits June 22, 2026 18:18
…ne + Sonar CPD)

Task 1 (98b6591) introduced the synced process-identity.mjs lib but only
half-registered it in the CI guards, surfacing as two red checks on PR #242 —
both invisible locally because the branch had never run CI before:

- Coverage gate: Task 1 re-baselined the new process-identity.mjs copies
  (at 0%) but left kimi/identity.mjs pinned at its pre-move 92.03%. Moving
  capturePidInfo into the shared lib shrank kimi/identity.mjs to a 59.32%
  re-export shim, tripping the baseline-regression check. The capability did
  NOT lose coverage — it stays tested via process-identity.test.mjs (canonical)
  and the claude/gemini inline identity.mjs copies (92.03%, target-enforced
  >=85%). Corrected the stale entry to the true 59.32/80/60.

- SonarCloud: process-identity.mjs was the only #234 synced lib missing from
  sonar.cpd.exclusions, so its 5 analyzed copies counted as 18% duplicated new
  code (limit 3%). Added the 6 entries matching every sibling synced lib.

Class fix: added the 6 process-identity.mjs paths to the enforced CPD list in
ci-workflow.test.mjs, so a future synced lib left out of the Sonar exclusions
fails a unit test instead of silently failing SonarCloud on a PR.

Verified: coverage gate exit 0, 0 regressions, target met (85.00%); ci-workflow
Sonar-CPD test passes; full unit suite 2044 pass / 0 fail.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…xport from process-identity.mjs

claude/gemini identity.mjs were 242-line byte-identical copies carrying
inline capturePidInfo/captureLinux/captureDarwin — the same code already
single-sourced in process-identity.mjs and consumed via re-export by kimi.
Converts both to the kimi re-export shim so all three plugins now share one
capturePidInfo implementation. Export surface (newJobId/capturePidInfo/
attachPidCapture/verifyPidInfo) unchanged.

Class fix for the drift that hid Task 1's divergence: the byte-identical
guard in plugin-copies-in-sync.test.mjs only covered claude+gemini, so when
Task 1 moved capturePidInfo out of kimi alone the copies silently diverged.
Moves identity.mjs from CLAUDE_GEMINI_VERBATIM_FILES to VERBATIM_FILES so
the guard now asserts byte-identity across claude/gemini/kimi.

- plugins/{claude,gemini}/scripts/lib/identity.mjs -> re-export shim
- relay/relay-gemini/scripts/lib/identity.mjs -> rebuilt artifact
- tests/unit/plugin-copies-in-sync.test.mjs -> widen drift guard to kimi
- scripts/ci/coverage-baseline.json -> ratchet all 3 identity.mjs to
  measured 96.61/86.67/100 (now target-enforced >=85), replacing the
  loose post-move values (claude 48.45, gemini 25.77, kimi 59.32)

Verified: full suite incl smoke 2648 tests 0 fail; coverage gate target
met 85.00% 0 regressions; drift guard byte-identical across all 3.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…essage channel, identity single-source

Kimi + GPT/Codex adversarial review of PR #242. Fixes the confirmed
security/correctness/completeness findings; low nits deferred to #239 (items 8-10).

Fail-closed:
- process-identity.mjs: unsupported platform throws capture_error (was
  process_gone) so classifyHolder returns unverifiable (occupied/fail-closed),
  never dead/reclaimable. 'Cannot inspect' is not 'proven gone'; a process_gone
  prefix made every holder reclaimable on an unsupported platform (fail-open
  over-admission + contract violation). [GPT #2]
- review-workload.mjs tryReclaimProviderWorkloadGate: an unreadable (EACCES) or
  corrupt-JSON gate owner is now timeout-gated like a missing owner, matching
  inspectSlot's fail-closed treatment of an unparseable slot. Was reclaimed
  immediately, evicting a live holder whose owner.json briefly could not be read. [Kimi B.1]

§8 disclosure channel (single class: error_message bypassed the sanitizer):
- blockResult message is now identity-free; the shared_state concurrency key is
  an account-identity hash whose slug leaked into error_message. [Kimi C.1]
- concurrencyAdmissionBlockedExecution extracted to one sanitized export in
  review-workload.mjs; all 5 source-bearing consumers route through it. Drops the
  raw error.message (a shared_state ENOENT carries an absolute config-dir path)
  and removes the 5-way duplication. [Kimi C.2]

Single-source completeness:
- New canonical scripts/lib/identity.mjs + scripts/ci/sync-identity.mjs, wired
  into lint:sync, sonar.cpd.exclusions, and a ci-workflow guard. Completes the
  dedup (the shim was byte-identical-guarded but had no canonical/sync). [Kimi F.1]

Test robustness/quality:
- review-workload-multiprocess.test.mjs: ps-preflight skip on the two real-
  liveness acquire tests; they hard-failed in ps-denied sandboxes instead of
  skipping. Runs on Linux CI / darwin-with-ps. [GPT #1]
- Behavioral gate-owner-write test reading the real on-disk owner.json via the
  now-exported acquireProviderWorkloadGate, replacing a source-regex guard. [Lens-G]
- New tests for every fix; lint:sync clean.

Verified: full suite incl smoke 2653 tests / 0 fail / 10 skip; coverage gate
target met 85.00% / 0 regressions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PR #242 CI flaked: smoke (claude) failed with ENOTEMPTY rmdir during
cleanup of the 'cancel during OAuth preflight' test. Root cause: the
test waited only for the durable JobRecord to reach a terminal status,
then removed the temp data dir while the detached background worker was
still flushing post-record tail writes (lease release, sidecar removal,
lifecycle jsonl) -- rmSync raced a live writer. The Task-6 maxRetries
5->20 bump had treated the symptom, not the cause.

Class fix (not the instance): a deterministic worker-process-exit
barrier (waitForProcessExit) before cleanup, on every launching
background test.

- claude-companion.smoke: barrier on the failing test + 7 other
  launching-background tests; replace two setTimeout(250) 'settle'
  band-aids with the barrier.
- gemini-companion.smoke: add the missing waitForProcessExit helper and
  the barrier on its 8 launching-background tests.
- kimi-companion.smoke / invariants.test.mjs already used the barrier;
  preflight-reject tests (no detached worker) are untouched.

Verified: claude smoke 139/139 (was 138 pass / 1 fail), gemini smoke
100/100, lint:sync clean, full suite 2643/0, coverage 85.00% met.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Brings #242 (bounded capability-driven concurrency) current with main (9e1298e,
post-#233). Resolved 7 conflicts:
- .sonarcloud.properties / package.json: union (main's #233 CPD entry + symlink
  lint check; #242's process-identity/identity sync checks + CPD entries).
- grok / api-reviewers smoke: took main's env-helper refactor (grokSmokeEnv /
  apiReviewersSmokeEnv) and injected #242's load-bearing RELAY_WORKLOAD_TEST_MODE
  into the helpers (the merged review-workload.mjs throws without it).
- gemini smoke: kept #242's smokeEnv (with test mode) + heldGeminiWorkloadLease,
  and main's smokeEnv(dataDir,{...}) call form + GEMINI_BINARY override.
- job-record.test.mjs: import union (api-reviewer + grok + agy record builders).
- coverage-baseline.json: per-metric min for kimi/identity.mjs; added AGY
  process-identity.mjs entry (0/0/0, matching the other provider copies).
Propagated #242's new synced libs (process-identity/identity) to the AGY copy and
rebuilt relay outputs.

Verified: npm run lint exit 0; full suite 2964 pass / 0 fail; coverage gate with
CI env (CODEX_PLUGIN_FULL_TESTS=1 CODEX_PLUGIN_SKIP_SMOKE=1 COVERAGE_ENFORCE_TARGET=1)
target met 85.00%, no regressions.
seungpyoson pushed a commit that referenced this pull request Jun 26, 2026
…nd to gemini (#250)

The first cut of #250 used a finally-first barrier with .catch() on the cancel
tests -- a variant that diverged from the proven #234/#242 pattern and swallowed
the 5s fail-loud timeout. Rework to the exact #242 form: end-of-try placement,
plain await (fail-loud), a 30s *_SMOKE_POLL_TIMEOUT_MS budget on the
cancel/status-poll tests, and the redundant setTimeout(250) cushions replaced
rather than duplicated. Extend the same barrier to the gemini companion smoke
suite, which shares the identical detached-worker teardown race.

The resulting barriers are byte-identical to the companion-smoke barriers
currently carried by PR #242 (verified: git diff feat/234 shows zero changed
waitForProcessExit lines), so #242 dedups them automatically on its next main
merge. This makes #250 the single source for the companion-smoke barrier class
and removes the duplicative barrier work from the concurrent-relays PR.

Class coverage: claude (9 tests) and gemini (helper + 8 tests) fixed here; kimi
already barriered on main; agy is foreground-only (rejects --background, no
detached worker, no race); identity-resume has no background launches.

Test-only. Validated: lint (incl. sync checks), claude 138/138, gemini 100/100,
sequential stress 10x claude + 5x gemini = 0 ENOTEMPTY (flake reproduced
~50%/run unfixed).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Test User and others added 3 commits June 26, 2026 19:56
Resolve review-workload.mjs conflict by composing #242's counting-semaphore
gate with #254's gate-race fix (now in main):
- broaden the gate catch to swallow EEXIST OR ENOENT (fail-loud on all else)
- on structural mkdir-ENOENT (lock root vanished), recreate dirname(gateDir) —
  the caller-provided lockRoot in #242's design — NOT main's env-derived
  lockRoot(env), which is undefined in #242 (a silently-swallowed ReferenceError
  that would hot-loop to the deadline)
- keep #242's capture seam on the reclaim call (gateDir, env, capture)

Port #254's two gate-race tests to #242's lease API (concurrencyKey/limit/
lockRoot + RELAY_WORKLOAD_TEST_MODE), rewrite the shim's ./process-identity.mjs
import to the real module, and add existsSync/readdirSync to the fs shims.
Both fail-on-revert proven against the two fixes above. 10 byte-identical
copies regenerated via sync; lint green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… merge

The #242 bring-current (1af8ef0) composed #254's forced deadline->recreate->reclaim gate ordering with #242's two reclaim-success tests that pinned GATE_TIMEOUT_MS="1". With a 1ms budget the deadline (now+1ms) can fire before reclaim is reached, returning {ok:false} -> ~5% flake (reproduced 3/60). Production default is 5000ms, so production never misfires; purely a test-budget artifact.

Test-only fix: single-source RECLAIM_GATE_TIMEOUT_MS="250" const (documented, guarded against being lowered back to "1") for the two reclaim-success tests. Production review-workload.mjs unchanged. The age-guard (gateAgeMs<=gateTimeoutMs) is off the reclaim-success path -- both tests write a reclaimable owner so gateOwnerActive returns false and reclaim runs before the guard -- so the bump cannot weaken reclaim correctness.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@seungpyoson seungpyoson marked this pull request as ready for review June 26, 2026 15:07

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your free trial has ended. If you'd like to continue receiving code reviews, you can add a payment method here.

@seungpyoson seungpyoson merged commit 02d6d75 into main Jun 26, 2026
15 checks passed
@seungpyoson seungpyoson deleted the feat/234-concurrent-relays branch June 26, 2026 15:09
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

P0: Support multiple concurrent source-bearing relays per provider

1 participant