fix(codex): bound bridge app-server stalls#209
Conversation
|
Thanks — bounding the app-server stalls and the JSON-RPC request timeouts is the right direction and addresses the core of #195. One blocking item before merge: The re-arm after a sub-limit watch-once failure is still fire-and-forget. In Suggested direction: route the delayed re-arm through the same fatal path as the awaited call, or count an What's solid: the request timeouts apply symmetrically to the stdio and direct-WS clients, pending-request cleanup looks right, the WS handshake timeout clears its timer / destroys the socket / rejects pending, and routing async handler errors to a fatal shutdown is a good call. Heads-up on merge order (not a change to this diff): a storage change on main is moving the bridge's |
464a978 to
85a9e4d
Compare
|
Thanks for the precise catch. Updated in 85a9e4d. What changed:
Validation:
I also rebased onto current |
Summary
watch-oncefailures while keeping exit2as normal re-arm behaviorCloses #195.
Behavior notes
New knobs:
--connect-timeout-ms,AGMSG_CODEX_BRIDGE_CONNECT_TIMEOUT_MS, default10000--request-timeout-ms,AGMSG_CODEX_BRIDGE_REQUEST_TIMEOUT_MS, default30000--watch-failure-limit,AGMSG_CODEX_BRIDGE_WATCH_FAILURE_LIMIT, default30disables the corresponding timeout/limit.A request timeout inside an app-server event handler now intentionally terminates the bridge instead of only logging and continuing. With the new timeout behavior, continuing after a timed-out
process/spawnorturn/startcan leave the bridge alive but unable to monitor correctly, for example with a non-nullwatchHandleand no actual watch process. Failing fast gives a clear error instead of a silent pseudo-monitor stall.Validation
node --check scripts/drivers/types/codex/codex-bridge.jsgit diff --checkbats --print-output-on-failure tests/test_codex_bridge.bats-> 22/22bats --print-output-on-failure -f 'codex' tests/test_delivery.bats-> 10/10timeout 240s bats --print-output-on-failure tests/-> reached 168/393 before the outer timeout; the changed Codex bridge section passed (33-54), Codex delivery checks passed (157-163), and the only observed failure before timeout was the existing unrelateddelivery set monitor: existing settings with single-quoted hook commands stays valid JSON (#134)malformed JSON case.Review notes
This was checked with separate read-only review passes for approach, test design, implementation diff, and final readiness. The remaining practical risk is live Codex app-server variance; the added tests use fake stdio/WebSocket app-servers to cover the protocol stall/failure paths without touching real
db/,teams/, orrun/state.