Skip to content

fix(codex): bound bridge app-server stalls#209

Open
yui-stingray wants to merge 1 commit into
fujibee:mainfrom
yui-stingray:codex/bridge-timeouts-watch-limit
Open

fix(codex): bound bridge app-server stalls#209
yui-stingray wants to merge 1 commit into
fujibee:mainfrom
yui-stingray:codex/bridge-timeouts-watch-limit

Conversation

@yui-stingray

Copy link
Copy Markdown

Summary

  • add bounded Codex bridge app-server connect/upgrade and JSON-RPC request timeouts
  • apply request timeout cleanup to both stdio and direct WebSocket app-server clients
  • stop after configurable consecutive real watch-once failures while keeping exit 2 as normal re-arm behavior
  • clear failed watch spawn state and fail the bridge when an app-server request times out during event handling
  • isolate Codex bridge Bats project roots under each test temp dir

Closes #195.

Behavior notes

New knobs:

  • --connect-timeout-ms, AGMSG_CODEX_BRIDGE_CONNECT_TIMEOUT_MS, default 10000
  • --request-timeout-ms, AGMSG_CODEX_BRIDGE_REQUEST_TIMEOUT_MS, default 30000
  • --watch-failure-limit, AGMSG_CODEX_BRIDGE_WATCH_FAILURE_LIMIT, default 3

0 disables the corresponding timeout/limit.

A request timeout inside an app-server event handler now intentionally terminates the bridge instead of only logging and continuing. With the new timeout behavior, continuing after a timed-out process/spawn or turn/start can leave the bridge alive but unable to monitor correctly, for example with a non-null watchHandle and no actual watch process. Failing fast gives a clear error instead of a silent pseudo-monitor stall.

Validation

  • node --check scripts/drivers/types/codex/codex-bridge.js
  • git diff --check
  • bats --print-output-on-failure tests/test_codex_bridge.bats -> 22/22
  • bats --print-output-on-failure -f 'codex' tests/test_delivery.bats -> 10/10
  • timeout 240s bats --print-output-on-failure tests/ -> reached 168/393 before the outer timeout; the changed Codex bridge section passed (33-54), Codex delivery checks passed (157-163), and the only observed failure before timeout was the existing unrelated delivery set monitor: existing settings with single-quoted hook commands stays valid JSON (#134) malformed JSON case.

Review notes

This was checked with separate read-only review passes for approach, test design, implementation diff, and final readiness. The remaining practical risk is live Codex app-server variance; the added tests use fake stdio/WebSocket app-servers to cover the protocol stall/failure paths without touching real db/, teams/, or run/ state.

@fujibee

fujibee commented Jun 25, 2026

Copy link
Copy Markdown
Owner

@yui-stingray

Thanks — bounding the app-server stalls and the JSON-RPC request timeouts is the right direction and addresses the core of #195. One blocking item before merge:

The re-arm after a sub-limit watch-once failure is still fire-and-forget. In onProcessExited, when watch-once exits non-zero but below --watch-failure-limit (default 3), the re-arm is setTimeout(() => this.armWatch().catch(err => console.error(...)), 5000). With the new request timeout in place, if that delayed process/spawn request times out on an app-server stall, the catch only logs — the bridge does not exit, watchHandle stays null, and the process keeps running with no watcher. That reintroduces the #195 "hung, no detection" state through a different path. The new re-arm spawn request timeout exits test exercises the exitCode === 2 branch (await this.armWatch()), so this non-zero-failure branch isn't covered.

Suggested direction: route the delayed re-arm through the same fatal path as the awaited call, or count an armWatch timeout as a watch failure toward the limit and shut down on reaching it — the point is to never leave watchHandle null with no pending timer.

What's solid: the request timeouts apply symmetrically to the stdio and direct-WS clients, pending-request cleanup looks right, the WS handshake timeout clears its timer / destroys the socket / rejects pending, and routing async handler errors to a fatal shutdown is a good call.

Heads-up on merge order (not a change to this diff): a storage change on main is moving the bridge's max_id handling to opaque string equality — on rebase, please keep that rather than restoring the numeric lastWakeMaxId = 0 / parseMaxId([0-9]+) / maxId <= 0 form.

@yui-stingray yui-stingray force-pushed the codex/bridge-timeouts-watch-limit branch from 464a978 to 85a9e4d Compare June 25, 2026 08:08
@yui-stingray

Copy link
Copy Markdown
Author

Thanks for the precise catch. Updated in 85a9e4d.

What changed:

  • The sub-limit non-zero watch failure path now schedules re-arm via a tracked watchRearmTimer.
  • The delayed armWatch() rejection is routed through the same fatal process/exited handler path, so a timed-out delayed process/spawn shuts the bridge down instead of logging and continuing with no watcher.
  • armWatch() and shutdown() clear pending re-arm timers to avoid stale timers or duplicate re-arms.
  • Added a regression test that exercises exitCode: 1 below --watch-failure-limit, then a delayed re-arm process/spawn timeout.

Validation:

  • node --check scripts/drivers/types/codex/codex-bridge.js
  • git diff --check
  • bats --print-output-on-failure --filter "delayed re-arm" tests/test_codex_bridge.bats -> 1/1
  • bats --print-output-on-failure tests/test_codex_bridge.bats -> 23/23
  • bats --print-output-on-failure -f "codex" tests/test_delivery.bats -> 10/10

I also rebased onto current origin/main (f142f07). I did not touch the current numeric max_id handling because the opaque string equality change is not on this base yet; when rebasing after the storage change lands, I will preserve that newer opaque handling rather than restoring the numeric form.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Codex bridge can hang indefinitely on app-server stalls and watch failures

2 participants