fix(codex): bound bridge app-server stalls by yui-stingray · Pull Request #209 · fujibee/agmsg

yui-stingray · 2026-06-24T01:34:53Z

Summary

add bounded Codex bridge app-server connect/upgrade and JSON-RPC request timeouts
apply request timeout cleanup to both stdio and direct WebSocket app-server clients
stop after configurable consecutive real watch-once failures while keeping exit 2 as normal re-arm behavior
clear failed watch spawn state and fail the bridge when an app-server request times out during event handling
isolate Codex bridge Bats project roots under each test temp dir

Closes #195.

Behavior notes

New knobs:

--connect-timeout-ms, AGMSG_CODEX_BRIDGE_CONNECT_TIMEOUT_MS, default 10000
--request-timeout-ms, AGMSG_CODEX_BRIDGE_REQUEST_TIMEOUT_MS, default 30000
--watch-failure-limit, AGMSG_CODEX_BRIDGE_WATCH_FAILURE_LIMIT, default 3

0 disables the corresponding timeout/limit.

A request timeout inside an app-server event handler now intentionally terminates the bridge instead of only logging and continuing. With the new timeout behavior, continuing after a timed-out process/spawn or turn/start can leave the bridge alive but unable to monitor correctly, for example with a non-null watchHandle and no actual watch process. Failing fast gives a clear error instead of a silent pseudo-monitor stall.

Validation

node --check scripts/drivers/types/codex/codex-bridge.js
git diff --check
bats --print-output-on-failure tests/test_codex_bridge.bats -> 22/22
bats --print-output-on-failure -f 'codex' tests/test_delivery.bats -> 10/10
timeout 240s bats --print-output-on-failure tests/ -> reached 168/393 before the outer timeout; the changed Codex bridge section passed (33-54), Codex delivery checks passed (157-163), and the only observed failure before timeout was the existing unrelated delivery set monitor: existing settings with single-quoted hook commands stays valid JSON (#134) malformed JSON case.

Review notes

This was checked with separate read-only review passes for approach, test design, implementation diff, and final readiness. The remaining practical risk is live Codex app-server variance; the added tests use fake stdio/WebSocket app-servers to cover the protocol stall/failure paths without touching real db/, teams/, or run/ state.

fujibee · 2026-06-25T07:40:58Z

@yui-stingray

Thanks — bounding the app-server stalls and the JSON-RPC request timeouts is the right direction and addresses the core of #195. One blocking item before merge:

The re-arm after a sub-limit watch-once failure is still fire-and-forget. In onProcessExited, when watch-once exits non-zero but below --watch-failure-limit (default 3), the re-arm is setTimeout(() => this.armWatch().catch(err => console.error(...)), 5000). With the new request timeout in place, if that delayed process/spawn request times out on an app-server stall, the catch only logs — the bridge does not exit, watchHandle stays null, and the process keeps running with no watcher. That reintroduces the #195 "hung, no detection" state through a different path. The new re-arm spawn request timeout exits test exercises the exitCode === 2 branch (await this.armWatch()), so this non-zero-failure branch isn't covered.

Suggested direction: route the delayed re-arm through the same fatal path as the awaited call, or count an armWatch timeout as a watch failure toward the limit and shut down on reaching it — the point is to never leave watchHandle null with no pending timer.

What's solid: the request timeouts apply symmetrically to the stdio and direct-WS clients, pending-request cleanup looks right, the WS handshake timeout clears its timer / destroys the socket / rejects pending, and routing async handler errors to a fatal shutdown is a good call.

Heads-up on merge order (not a change to this diff): a storage change on main is moving the bridge's max_id handling to opaque string equality — on rebase, please keep that rather than restoring the numeric lastWakeMaxId = 0 / parseMaxId([0-9]+) / maxId <= 0 form.

yui-stingray · 2026-06-25T08:08:45Z

Thanks for the precise catch. Updated in 85a9e4d.

What changed:

The sub-limit non-zero watch failure path now schedules re-arm via a tracked watchRearmTimer.
The delayed armWatch() rejection is routed through the same fatal process/exited handler path, so a timed-out delayed process/spawn shuts the bridge down instead of logging and continuing with no watcher.
armWatch() and shutdown() clear pending re-arm timers to avoid stale timers or duplicate re-arms.
Added a regression test that exercises exitCode: 1 below --watch-failure-limit, then a delayed re-arm process/spawn timeout.

Validation:

node --check scripts/drivers/types/codex/codex-bridge.js
git diff --check
bats --print-output-on-failure --filter "delayed re-arm" tests/test_codex_bridge.bats -> 1/1
bats --print-output-on-failure tests/test_codex_bridge.bats -> 23/23
bats --print-output-on-failure -f "codex" tests/test_delivery.bats -> 10/10

I also rebased onto current origin/main (f142f07). I did not touch the current numeric max_id handling because the opaque string equality change is not on this base yet; when rebasing after the storage change lands, I will preserve that newer opaque handling rather than restoring the numeric form.

fix(codex): bound bridge app-server stalls

85a9e4d

yui-stingray force-pushed the codex/bridge-timeouts-watch-limit branch from 464a978 to 85a9e4d Compare June 25, 2026 08:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(codex): bound bridge app-server stalls#209

fix(codex): bound bridge app-server stalls#209
yui-stingray wants to merge 1 commit into
fujibee:mainfrom
yui-stingray:codex/bridge-timeouts-watch-limit

yui-stingray commented Jun 24, 2026

Uh oh!

fujibee commented Jun 25, 2026

Uh oh!

yui-stingray commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yui-stingray commented Jun 24, 2026

Summary

Behavior notes

Validation

Review notes

Uh oh!

fujibee commented Jun 25, 2026

Uh oh!

yui-stingray commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants