fix(daemon): prevent two daemons from running at once (split-brain)#104
Conversation
Two daemons could run at once, split-braining the system: one process owned the CLI socket while another owned the WebSocket port the extension connects to. The CLI then talked to an extension-less daemon and reported "no extensions connected" / "context '<id>' not found", even though the extension was connected — just to the other daemon. Root cause: the singleton election was driven only by the pid file, which is advisory and racy. The one OS-exclusive resource (the WS listen port) was treated as optional — a daemon that failed to bind it logged "continuing without WebSocket" and kept running, after unconditionally unlinking and rebinding the CLI socket. Fix: bind the WS port BEFORE claiming the CLI socket and treat it as the authoritative singleton token. If the bind fails, another singleton is already live, so exit instead of becoming a second, extension-less daemon. Only the process that holds the port may clear/rebind the socket. - decideSingletonGate(): tested pure decision for the win/lose outcomes - index.ts: acquire WS port first; fatal on loss; remove the unconditional socket unlink so a loser can never hijack the CLI socket
ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe daemon startup now enforces a single shared WebSocket port via early singleton gating. Early WS binding is attempted upfront, and ChangesWebSocket Singleton Gating and Daemon Startup
Sequence DiagramsequenceDiagram
participant Daemon
participant WSServer as WebSocket Server
participant Gate as Singleton Gate
participant CLISocket as CLI Socket
Daemon->>WSServer: startWsServer()
WSServer-->>Daemon: port acquired or failed
Daemon->>Gate: decideSingletonGate(wsPortAcquired)
alt Port acquired
Gate-->>Daemon: { action: serve }
Daemon->>CLISocket: Bind CLI socket
CLISocket-->>Daemon: Listening
else Port unavailable
Gate-->>Daemon: { action: exit, exitCode: 0 }
Daemon->>Daemon: Exit process
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
daemon/index.ts (1)
901-901: 💤 Low valueConsider more generic log message for WS binding failures.
The message assumes EADDRINUSE (port already in use), but other errors like EACCES (permission denied) are possible. While
wsBindError.messageis appended for debugging, a more generic prefix would be clearer.♻️ Optional refinement
- log(`ws port ${WS_PORT} already held by another daemon — ${singletonGate.reason}${wsBindError ? ` (${wsBindError.message})` : ""}`) + log(`failed to acquire ws port ${WS_PORT} — ${singletonGate.reason}${wsBindError ? ` (${wsBindError.message})` : ""}`)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@daemon/index.ts` at line 901, Change the hardcoded "already held" log to a more generic bind-failure message: update the log call that references WS_PORT, singletonGate.reason and wsBindError to say something like "failed to bind ws port" (or "WS port bind failure") and then append singletonGate.reason and the wsBindError.message (if present) for details; ensure the prefix no longer assumes EADDRINUSE so it covers EACCES and other errors while preserving the existing diagnostic info.test/daemon-lifecycle.test.ts (1)
95-105: 💤 Low valueTest coverage for singleton gate exit scenarios is good.
Both standalone and native daemon exit cases are verified with correct exitCode. Optionally, you could also verify the
reasonfield differs between standalone and native modes for completeness.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/daemon-lifecycle.test.ts` around lines 95 - 105, Add assertions to the existing tests that also verify the returned decision.reason differs between standalone and native modes: call decideSingletonGate for both wsPortAcquired: false with standalone: true and standalone: false, then assert decision.reason (or decision.reason?.message) contains the expected distinct text for each case (e.g., mentions "standalone" vs "native" or whichever unique strings are produced by decideSingletonGate). Keep the existing action and exitCode assertions and only add these extra expect(...) checks to the two tests referencing decideSingletonGate.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@daemon/index.ts`:
- Line 901: Change the hardcoded "already held" log to a more generic
bind-failure message: update the log call that references WS_PORT,
singletonGate.reason and wsBindError to say something like "failed to bind ws
port" (or "WS port bind failure") and then append singletonGate.reason and the
wsBindError.message (if present) for details; ensure the prefix no longer
assumes EADDRINUSE so it covers EACCES and other errors while preserving the
existing diagnostic info.
In `@test/daemon-lifecycle.test.ts`:
- Around line 95-105: Add assertions to the existing tests that also verify the
returned decision.reason differs between standalone and native modes: call
decideSingletonGate for both wsPortAcquired: false with standalone: true and
standalone: false, then assert decision.reason (or decision.reason?.message)
contains the expected distinct text for each case (e.g., mentions "standalone"
vs "native" or whichever unique strings are produced by decideSingletonGate).
Keep the existing action and exitCode assertions and only add these extra
expect(...) checks to the two tests referencing decideSingletonGate.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b77a42b0-436b-4c49-9f6b-a5ca6c859eb7
📒 Files selected for processing (3)
daemon/index.tsdaemon/lifecycle.tstest/daemon-lifecycle.test.ts
Resolve the daemon/index.ts conflict by taking main's post-0.17.2 structure (the 0.17.2 release rewrote the daemon and added the native relay architecture). The atomic WS-port singleton gate is reapplied onto that structure in the next commit. daemon/lifecycle.ts (decideSingletonGate + SingletonGateDecision) and the unit tests auto-merged cleanly from the original PR commit 5e3b498 — @vitorfhc's authorship on the core helper and tests is preserved.
…architecture Reapplies @vitorfhc's WS-port singleton-gate fix (PR Hacker-Valley-Media#104) onto the post-0.17.2 daemon. The 0.17.2 release rewrote daemon/index.ts and added the native relay architecture, so the original diff no longer applied cleanly; the fix's intent is unchanged and the split-brain bug it targets is still live on main. - Extract the WS server into startWsServer() and acquire the WS port BEFORE the CLI socket — the port is the one OS-exclusive singleton token (validated against Bun docs/source + Linux/BSD man pages: Bun binds exclusively by default; macOS/Windows ignore reusePort; Linux defaults it off). - decideSingletonGate(): a daemon that loses the WS-port race exits cleanly instead of "continuing without WebSocket" as a second, extension-less singleton. - Only the WS-port holder may clear/rebind the CLI socket (removed the unconditional unlink that let a loser hijack it); scoped to the non-Windows branch. - Stop the WS server if the CLI socket listen then fails. decideSingletonGate + its unit tests auto-merged from the original commit 5e3b498, so @vitorfhc's authorship on the core helper and tests is preserved. Verified: typecheck clean, bun test 477 pass / 0 fail, capability-blind audit passes, and a live two-daemon test confirms the duplicate exits without hijacking the socket (split-brain reproduced on main, gone with this change). Co-authored-by: Vitor Falcao <vitorfhcosta@gmail.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thanks for this, @vitorfhc — sharp diagnosis. Before integrating I validated every technical claim in the PR against primary sources (Bun docs/issues/blog + the
Your branch had gone stale against
Verification on the merged result:
The PR is now |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@daemon/index.ts`:
- Around line 1197-1231: The code allows WebSocket clients to self-assert native
delegation privileges by simply sending a NATIVE_REGISTER_TYPE request without
any credential verification, then immediately honoring privileged
NATIVE_DELEGATE_TYPE actions via the forwardDelegateToBridge function. Before
setting the __native flag in the NATIVE_REGISTER_TYPE handler and before
processing the NATIVE_DELEGATE_TYPE request in the subsequent block, add
validation logic to require a non-self-asserted capability or nonce credential
from the injected agent path to verify that the WebSocket client is genuinely
authorized to perform native operations, rather than allowing any client to
claim these privileges.
- Around line 538-551: The socket file cleanup via clearDaemonRuntimeFiles() is
invoked within bootstrapDaemonRole() before the WS port bind attempt, creating a
race condition where the socket could be cleared before WS acquisition is
confirmed. Move the socket cleanup operation to occur after successful WS port
acquisition and singleton gate confirmation (around the existing conditional
socket cleanup near line 1144-1146) to ensure the WS token is proven before
removing the socket file and prevent competing processes from acquiring the
token and orphaned socket path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@daemon/index.ts`:
- Around line 1197-1231: The code allows WebSocket clients to self-assert native
delegation privileges by simply sending a NATIVE_REGISTER_TYPE request without
any credential verification, then immediately honoring privileged
NATIVE_DELEGATE_TYPE actions via the forwardDelegateToBridge function. Before
setting the __native flag in the NATIVE_REGISTER_TYPE handler and before
processing the NATIVE_DELEGATE_TYPE request in the subsequent block, add
validation logic to require a non-self-asserted capability or nonce credential
from the injected agent path to verify that the WebSocket client is genuinely
authorized to perform native operations, rather than allowing any client to
claim these privileges.
- Around line 538-551: The socket file cleanup via clearDaemonRuntimeFiles() is
invoked within bootstrapDaemonRole() before the WS port bind attempt, creating a
race condition where the socket could be cleared before WS acquisition is
confirmed. Move the socket cleanup operation to occur after successful WS port
acquisition and singleton gate confirmation (around the existing conditional
socket cleanup near line 1144-1146) to ensure the WS token is proven before
removing the socket file and prevent competing processes from acquiring the
token and orphaned socket path.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4d70540c-eefd-4ce8-abb1-1ee99c071c26
📒 Files selected for processing (1)
daemon/index.ts
🛑 Comments failed to post (2)
daemon/index.ts (2)
538-551:
⚠️ Potential issue | 🟠 Major🧩 Analysis chain
🏁 Script executed:
#!/bin/bash # Description: Inspect pre-gate runtime cleanup to verify it cannot unlink the daemon IPC socket. rg -n -C8 'function clearDaemonRuntimeFiles|const clearDaemonRuntimeFiles|clearDaemonRuntimeFiles|socketPath|SOCKET_PATH|unlink' daemon/lifecycle.ts daemon/index.tsRepository: Hacker-Valley-Media/Interceptor
Length of output: 17074
Pre-gate socket cleanup violates singleton ordering; move socket removal after WS acquisition.
The code calls
bootstrapDaemonRole()at line 538, which invokesclearDaemonRuntimeFiles(deps, decision.reason)at line 526. This helper removesSOCKET_PATH(confirmed indaemon/lifecycle.ts:85), but only after the WS port bind is attempted at line 546. This ordering inverts the PR objective: the socket file must not be removed until after the WS singleton token is proven. A process that clears the socket early, fails the WS bind, and then exits will allow a competing process to acquire the WS token and claim the now-cleared socket path before losing the WS race completes.Move
clearDaemonRuntimeFiles()to after the WS gate succeeds (around line 1144–1146, where the socket file is already cleared again conditionally).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@daemon/index.ts` around lines 538 - 551, The socket file cleanup via clearDaemonRuntimeFiles() is invoked within bootstrapDaemonRole() before the WS port bind attempt, creating a race condition where the socket could be cleared before WS acquisition is confirmed. Move the socket cleanup operation to occur after successful WS port acquisition and singleton gate confirmation (around the existing conditional socket cleanup near line 1144-1146) to ensure the WS token is proven before removing the socket file and prevent competing processes from acquiring the token and orphaned socket path.
1197-1231:
⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDon’t let WebSocket clients self-assert native delegation privileges.
NATIVE_REGISTER_TYPEmarks the socket as native, andNATIVE_DELEGATE_TYPEthen forwards directly to the bridge before context validation. Require a non-self-asserted capability/nonce from the injected agent path before setting__nativeor honoring delegated bridge actions.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@daemon/index.ts` around lines 1197 - 1231, The code allows WebSocket clients to self-assert native delegation privileges by simply sending a NATIVE_REGISTER_TYPE request without any credential verification, then immediately honoring privileged NATIVE_DELEGATE_TYPE actions via the forwardDelegateToBridge function. Before setting the __native flag in the NATIVE_REGISTER_TYPE handler and before processing the NATIVE_DELEGATE_TYPE request in the subsequent block, add validation logic to require a non-self-asserted capability or nonce credential from the injected agent path to verify that the WebSocket client is genuinely authorized to perform native operations, rather than allowing any client to claim these privileges.
|
@vitorfhc that last comment was AI-assisted. I wanted to personally drop a THANK YOU for the contribution. This will help a lot with stability. |
…l flags Bumps every surface to 0.17.3. The headline is a daemon stability fix that hardens multi-context / multi-browser routing, plus first-class install support for Chrome's sibling channels. Daemon single-instance gate (Hacker-Valley-Media#104) - Fixes a split-brain where two daemons could run at once — one owning the CLI socket, another owning the WebSocket port the extension connects to — so the CLI talked to an extension-less daemon and reported "context '<id>' not found (no extensions connected)" even though the extension was connected (just to the other daemon). - The WebSocket listen port is now the atomic singleton token: it is bound BEFORE the CLI socket, and a daemon that loses that race exits cleanly instead of surviving as a second, extension-less daemon that unlinks and hijacks the CLI socket. The pid-file election is kept as a fast path; the port bind is the authoritative backstop. - Stress-validated: hundreds of concurrent duplicate-daemon spawns, a full kill-and-respawn thundering herd, and ~120 concurrent tab-opens across three live extensions — zero split-brains, zero socket hijacks, zero mis-routes. (PR by @vitorfhc; reapplied onto the 0.17.2 relay architecture.) Chrome channel install flags (Hacker-Valley-Media#98) - scripts/install.sh adds --chrome-beta, --chrome-canary, --chrome-dev, and --chrome-for-testing as first-class install targets (Darwin), so you can run stable + Beta side by side, each routed to a distinct --context. - Correct native-messaging host directory for Chrome for Testing (Chrome 146+ uses a dedicated ~/Library/.../Google/ChromeForTesting dir). (PR by @trillium.) Internal / CI - bash -n shell-syntax gate over scripts/*.sh (Hacker-Valley-Media#108); install-modes test updates for the channel changes + first-installed default (Hacker-Valley-Media#109). Rollup for users updating from 0.16.9 - This release also carries the full 0.17.x line: Electron/Chromium CDP app control, the Native Runtime Agent, and the capability-blind Extension Fabric. Version: 0.17.2 -> 0.17.3 (package.json + extension manifest; the build stamps the CLI / daemon / bridge from package.json). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Problem
A CLI command targeting a named context fails with
context '<id>' not found (no extensions connected)even though the extension is installed and connected.interceptor contextsis empty.The cause is two daemons running at once, each owning half the system:
:19222/tmp/interceptor.sock(the CLI's socket)The CLI talks to the socket → reaches Daemon B → sees zero extensions. The extension is connected to Daemon A → orphaned from the CLI. Split brain.
Root cause
The singleton election was driven only by the pid file, which is advisory and racy. The one genuinely OS-exclusive resource — the WebSocket listen port — was treated as optional:
:19222loggedcontinuing without WebSocketand kept running anyway.unlinked and rebound the CLI socket, stealing it from the real daemon.So a process that lost the only real race survived as a half-daemon: it owned the socket (and the CLI) but had no extension.
Fix
Make the WebSocket port the atomic singleton token, acquired before the CLI socket:
:19222first. The OS guarantees exactly one binder.The pid-file election is kept as a fast path; the port bind is the authoritative backstop that closes the race.
Verification
decideSingletonGate()— new pure helper with unit tests for the win/lose outcomes.--standalonedaemon while the first holds the port → the second exits and the first's socket inode is unchanged (not hijacked). Confirmed both for the pid-file fast path and for the stale-pid-file case that actually triggered the bug:bun testdaemon/transport suite: 35 pass, 0 fail.Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests