Skip to content

Codex bridge can hang indefinitely on app-server stalls and watch failures #195

Description

@yui-stingray

Summary

On latest main (e3031b8, v1.1.0), the Codex bridge now has --loaded-timeout and --turn-timeout, but there are still a few unbounded failure paths:

  1. --app-server ws://... TCP connection succeeds, but the WebSocket upgrade never completes.
  2. WebSocket upgrade succeeds, but a JSON-RPC request such as initialize never receives a response.
  3. watch-once repeatedly exits non-zero; the bridge logs the failure and re-arms forever.

These are not Windows-specific in the reproduction below; they are transport / guardrail behavior in the bridge itself.

Reproduction Evidence

I reproduced this against a temp copy of the scripts so the repo's real teams/, db/, and run/ were not touched.

1. WebSocket handshake stall

Fake app-server: listens on 127.0.0.1, accepts the TCP connection, reads data, and never sends the HTTP 101 WebSocket upgrade response.

Bridge command shape:

timeout 4s node scripts/drivers/types/codex/codex-bridge.js \
  --project "$tmp/proj" --team team --name alice --thread thread-existing \
  --app-server "ws://127.0.0.1:$port" --timeout 1 --interval 1

Observed:

status=124
stdout=
stderr=

The bridge did not fail itself; the outer timeout killed it.

2. JSON-RPC request stall

Fake app-server: completes the WebSocket upgrade, then ignores JSON-RPC frames, so initialize never receives a response.

Observed with the same bridge command shape:

status=124
stdout=
stderr=

Again, the bridge only stopped because of the outer timeout.

3. Repeated watch failure loop

Fake app-server: responds to initialize, thread/resume, and process/spawn, then sends process/exited with exitCode: 1 every time the watch process is armed.

Observed with timeout 8s:

status=124
stderr:
codex-bridge: resumed thread thread-existing
codex-bridge: armed team/alice
codex-bridge: watch-once failed with exit 1: fake watch failure
codex-bridge: armed team/alice
codex-bridge: watch-once failed with exit 1: fake watch failure
failure_count=2

The bridge kept running after the repeated failures and was killed by the outer timeout.

Expected Behavior

The bridge should bound these stalls and fail explicitly, for example:

  • WebSocket handshake timeout, with a clear error message.
  • JSON-RPC request timeout for app-server requests, with cleanup of pending requests.
  • A configurable consecutive watch failure limit, so persistent watch-once failures stop the bridge instead of re-arming forever.

Notes

This is complementary to the existing --loaded-timeout and --turn-timeout; those guard different parts of the bridge lifecycle and did not cover the reproductions above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions