Skip to content

Surface FFI panic events instead of silently dropping them#343

Open
MaxHeimbrock wants to merge 1 commit into
mainfrom
max/panic-event-handling
Open

Surface FFI panic events instead of silently dropping them#343
MaxHeimbrock wants to merge 1 commit into
mainfrom
max/panic-event-handling

Conversation

@MaxHeimbrock

Copy link
Copy Markdown
Contributor

Summary

FfiEvent.Panic is the FFI layer's global unrecoverable-error notification, and Unity dropped it on the floor (DispatchEvent had case Panic: break; — a stub from an early protocol sync that was never wired). This PR surfaces panics through the disconnect path apps already handle.

What panic events are

The FFI sends Panic { message } (no room handle — it's global) from three kinds of places:

  1. A spawned task panickedwatch_panic wraps essentially every FFI task (room event forwarding, stream tasks, the connect task, ...) and forwards the JoinError. The Rust comment: "Recommended behaviour is to exit the process."
  2. A panic while handling a synchronous request (catch_unwind in cabi.rs).
  3. Deliberate send_panic for state errors with no callback left to carry them — currently the ready-handshake timeout, which force-closes a room after ConnectCallback already went out.

After a panic the FFI's state is unreliable: a room whose event task died silently stops receiving events (including Disconnected itself — a zombie room), and in-flight async requests never complete (hung YieldInstructions).

Caveat worth knowing: release FFI binaries build with panic = "abort" (workspace Cargo.toml), so genuine Rust panics crash the process before any event is sent. In shipped builds this wiring covers the deliberate send_panic paths — including the ready-timeout zombie-room case — and it covers everything in local debug builds.

What this does

Since "exit the process" is not acceptable in Unity (Editor / shipped games), the panic becomes an observable disconnect:

  • Log loudly via Utils.Error (works in Editor console and player logs; previously the only trace was a Rust-side log::error! that is invisible unless LK_VERBOSE).
  • Cancel all pending callbacks so awaiting instructions (e.g. a mid-flight ConnectInstruction) resolve with IsError instead of hanging forever. Done before raising the event so requests issued by user handlers reacting to the panic (reconnect attempts) are not swept up.
  • New internal FfiClient.PanicReceived event; each connected Room subscribes in OnConnect and tears itself down through the existing sequence: Disconnected, DisconnectedWithReason(UnknownReason), cleanup.

Deliberate trade-off: Panic carries no room identity and the FFI declares its whole state unreliable, so all live rooms are torn down — conservative, but it converts silent zombie state into the reconnect flow apps already implement.

Tests

  • EditMode PanicEventTests: drives RouteFfiEvent with a synthetic panic from a non-main thread; asserts main-thread marshalling, PanicReceived payload, pending-callback cancellation (not completion), and entry removal.
  • PlayMode PanicRoomTeardownTests (E2E): connects a real room, injects a synthetic panic through RouteFfiEvent, asserts Disconnected / DisconnectedWithReason(UnknownReason) fire and IsConnected goes false.
  • LateJoinTrackSubscriptionTests re-run as connect-path regression guard.

All green locally against livekit-server --dev.

Follow-ups (not in this PR)

  • Rust side: emit a proper Disconnected room event on the ready-timeout path, and/or reconsider the 15s ROOM_EVENT_READY_TIMEOUT.
  • Upstream question: whether release builds should move to panic = "unwind" so real panics become observable events instead of process crashes (binary-size trade-off).

🤖 Generated with Claude Code

FfiEvent.Panic is the FFI layer's unrecoverable-error notification: a
background task died (watch_panic), a request handler panicked, or the
FFI hit a state error with no callback left to carry it (e.g. the
ready-handshake timeout force-closing a room after ConnectCallback).
Unity's DispatchEvent had `case Panic: break` — a stub from an early
protocol sync — so all of these were fully silent: rooms kept looking
connected while their event pipeline was dead, and in-flight requests
hung forever.

DispatchEvent now logs the panic, cancels all pending callbacks (so
awaiting instructions resolve with IsError instead of hanging; done
before raising PanicReceived so requests issued by reacting user
handlers are not swept up), and raises a new internal
FfiClient.PanicReceived event. Each connected Room subscribes and tears
itself down through the disconnect path apps already handle:
Disconnected / DisconnectedWithReason(UnknownReason) + cleanup.

Note release FFI binaries build with panic=abort, so genuine Rust
panics crash the process before an event can be sent; this wiring
covers debug builds and the deliberate send_panic error paths, which
include the ready-timeout zombie-room case.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant