Skip to content

Socket REPL can wedge after client disconnects immediately after prompt #364

Description

@DevGiuDev

Summary

The socket REPL on mobile targets (confirmed on Android) is generally usable, and reconnecting does work in normal conditions.

However, there is a teardown race: if a client disconnects too quickly after an evaluation completes (even after seeing the prompt again), the REPL can become wedged.

In that state:

  • the original REPL session remains blocked waiting on a promise
  • it still holds the global *compiler-state lock
  • subsequent REPL connections are accepted by the socket server but block before they can evaluate anything

So this is not a general “reconnection is broken” issue. It is a disconnect / teardown race that can poison later sessions.

Environment

  • Repo: Tensegritics/ClojureDart
  • Tested against: b62e1f9
  • Platform where reproduced: Android device (flutter run -d <android-device>)
  • Also independently confirmed desktop REPL parsing bug exists, but this report is specifically about the teardown race on mobile.

Reproduction

Control case: reconnect works

  1. Start the app with:
    clj -M:cljd flutter -d <android-device>
  2. Connect to the printed REPL port.
  3. Evaluate:
    (+ 1 2)
  4. Wait a few seconds after the prompt returns.
  5. Close the socket.
  6. Reconnect.
  7. Evaluate:
    (+ 10 20)

Result: works correctly. Banner is shown again, evaluation returns 30.

Failing case: immediate disconnect wedges later sessions

  1. Start the app with:
    clj -M:cljd flutter -d <android-device>
  2. Connect to the REPL socket.
  3. Evaluate either a trivial form or a slightly more active one, e.g.:
    (+ 1 2)
    or
    (pick!)
  4. As soon as the response/prompt returns, close the socket immediately.
  5. Reconnect to the REPL port.
  6. Try to evaluate another trivial form:
    (+ 10 20)

Result:

  • reconnect may still show the banner
  • the next evaluation hangs and produces no response
  • later sessions are also blocked

Evidence

Thread dump

After reproducing the failing case, jstack shows:

"Clojure Connection CLJD repl 1" ... WAITING (parking)
  at clojure.core$promise$reify__8625.deref(core.clj:7257)
  at cljd.build$repl$fn__6008.invoke(build.clj:259)
  - locked <0x...> (a clojure.lang.Atom)

"Clojure Connection CLJD repl 2" ... BLOCKED (on object monitor)
  at cljd.build$repl$fn__6008.invoke(build.clj:258)
  - waiting to lock <0x...> (a clojure.lang.Atom)

This strongly suggests the first REPL connection is waiting on @p while still holding the *compiler-state lock, and all later sessions block on that same lock.

Process log

The process log also shows a broken pipe while handling REPL output:

Exception in thread "Thread-18" java.net.SocketException: Broken pipe
...
 at cljd.build$compile_cli$fn__6192.invoke(build.clj:586)

Likely root cause

In clj/src/cljd/build.clj, repl currently does this:

(locking *compiler-state
  (let [p (promise)
        _ (swap! *repl-states assoc-in [repltag :ack!] #(deliver p %))
        _ (eval-to-repl repltag expr-or-throwable *compiler-state trigger-reload p)
        str-or-throwable @p]
    ...))

So the code waits for the ack promise inside locking *compiler-state.

If the client disconnects during teardown and the ack path never completes cleanly, the REPL session can remain blocked on @p while still holding the lock. That then blocks all later sessions.

Proposed fix

Move the @p wait outside the locking *compiler-state block so the global compiler-state lock is only held during setup / compile / trigger-reload, not during the potentially unbounded wait for the ack:

(let [p (promise)
      _ (locking *compiler-state
          (swap! *repl-states assoc-in [repltag :ack!] #(deliver p %))
          (eval-to-repl repltag expr-or-throwable *compiler-state trigger-reload p))
      str-or-throwable @p]
  ...)

This does not solve the fact that a disconnected session may remain stuck, but it should prevent one stuck session from poisoning all later ones.

Expected behavior

If a client disconnects abruptly or very quickly after an evaluation, that specific REPL session may fail, but later REPL sessions should still be able to connect and evaluate forms.

Actual behavior

A quick disconnect can wedge one REPL session in a way that blocks later sessions globally.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions