Align Relay's MatrixRTC call path with Element Call interop#134
Open
rexbron wants to merge 17 commits into
Open
Align Relay's MatrixRTC call path with Element Call interop#134rexbron wants to merge 17 commits into
rexbron wants to merge 17 commits into
Conversation
Document how affected users can capture Activity Log JSON exports filtered to the Call category, what each diagnostic signal in the export means (credential discovery, token exchange, key distribution, frame routing), and how to file useful reports. Also covers the unified-log fallback via `log stream` for harder cases and notes which data is and isn't safe to share publicly. Assisted-By: Claude Sonnet 4.7
Engineering reference under docs/internal/ that maps every Relay call-path function to its Element Call / matrix-js-sdk counterpart. Each entry cites the exact source line in the upstream reference and the matching range in Relay, and flags deviations confirmed against MSCs, source code, or real-world user traces. Acts as the worklog for the rtc-element-call-alignment branch. Assisted-By: Claude Sonnet 4.7
User trace 97853C31 showed an Activity Log that ended at "Connected to call" with no further events: the trace stops the moment Relay finished its setup checklist, even though the user's call kept running (and failing) for minutes afterward. Without post-connect events, every "connected but no media" or "call dropped silently" report is undiagnosable from the JSON export alone. Wire the LiveKit RoomDelegate callbacks that produce diagnostic signals through to the Activity Log: - didUpdateConnectionState .disconnected (with previous state) - didFailToConnectWithError (SFU rejected the initial connect) - didDisconnectWithError (mid-call drop with cause) - didSubscribeTrack (now lands in the JSON, not just os_log) - didFailToSubscribeTrack (firewall/NAT/codec failures) - local didPublishTrack (proves our own media went out) - didUpdateE2EEState (per-track cryptor failures, encrypted rooms) The describe(_:) helper provides stable labels for LiveKit ConnectionState values that won't drift if the enum gains cases. Assisted-By: Claude Sonnet 4.7
The previous fall-back path used `try?` to swallow every v2 error
and silently retry against legacy `/sfu/get`. That hid two
diagnostic signals:
- The actual reason v2 rejected us (Matrix `errcode` + `error`).
- That the fall-back happened at all.
Replace `try?` with a `do/catch` that:
- Parses Matrix-style `{errcode, error}` envelopes from the response
body so users see `M_BAD_JSON: The request body is missing 'room_id'
or 'slot_id'` instead of generic `tokenExchangeFailed`.
- Logs the v2 failure to both os_log and the Activity Log before
trying legacy, so the silent fall-back is visible after the fact.
- Carries the structured detail through a new
`LiveKitCredentialError.tokenExchangeRejected` case that surfaces
status, errcode, message, and which endpoint failed.
This unblocks self-diagnosis for the upcoming `slot_id` and v2
identity fixes — once those land, this logging will confirm v2 is
healthy without requiring users to hand-curate os_log traces.
Assisted-By: Claude Sonnet 4.7
Two coupled changes that must ship together. Shipping only the first
moves users from "fails to connect" to "connects but no media."
Item 1 — slot_id
================
The v2 endpoint in lk-jwt-service requires `slot_id` in the request
body; `SFURequest.Validate()` returns HTTP 400 M_BAD_JSON ("missing
'room_id' or 'slot_id'") if absent. Relay never sent it, so the v2
attempt failed and we silently fell back to /sfu/get. Hardcode
"m.call#ROOM" to match Element Call's
`getLiveunitJWTWithDelayDelegation`.
Item 2 — v2 LiveKit identity routing
====================================
On v2, lk-jwt-service issues a JWT whose `sub` claim is
`unpaddedBase64(sha256(json_marshal([matrixID, claimedDeviceID,
memberID])))` (per `helper.go::LiveKitIdentityFor`). LiveKit uses that
as the participant identity. On legacy, it's `<user>:<server>:<device>`.
The frame cryptor routes frames to remote peers' decoders by exact
string match on identity, so keying the cryptor under the legacy shape
silently breaks every frame on v2. Three sites needed updating:
- `CallEncryptionService.liveKitIdentity(matrixID:claimedDeviceID:
memberID:)` ports the lk-jwt-service algorithm. Inputs are all ASCII
(Matrix IDs, device IDs, UUIDs), so JSONSerialization is byte-
identical to Go's `json.Marshal` and the resulting hash matches.
- `CallViewModel.connect` now keys the local cryptor under
`room.localParticipant.identity?.stringValue` instead of
`<userID>:<deviceID>`. That's the JWT sub claim regardless of v1/v2.
If LiveKit hasn't assigned an identity (shouldn't happen), the
connect now fails fast with `CallViewModelError.
missingLocalParticipantIdentity` rather than silently misrouting.
- `CallWidgetBridge.handleIncomingToDevice` registers each inbound
encryption key under every plausible LiveKit identity for that peer
— both the legacy `<sender>:<device>` and the v2 hash. The cryptor
ring stores per-(participantId, index) entries and matches by
identity, so registering both is safe and works against peers
regardless of which endpoint they took.
- `CallViewModel.redistributeKey` previously parsed the LiveKit
identity by `:` to recover (userId, deviceId). On v2 the identity
has no colons and the parse fails, so new peers never got our key.
Drop the parse entirely; re-fetch `m.call.member` state and
broadcast to all current targets. Matches Element Call's
`RTCEncryptionManager` behaviour on membership change.
Assisted-By: Claude Sonnet 4.7
When the homeserver advertises no MatrixRTC SFU (neither the unstable
transports endpoint nor `.well-known org.matrix.msc4143.rtc_foci` is
configured) but a call is already in progress in the room, walk the
existing `m.call.member` state events and pick a peer's
`foci_preferred[0].livekit_service_url`.
Matches Element Call / matrix-js-sdk's third-fallback discovery
behaviour. Previously Relay would throw `sfuURLNotFound` and refuse
to join in this scenario, even though the SFU the existing
participants were using was right there in room state.
`discoverSFUURL` now takes the roomID so it can issue the
`/rooms/{id}/state` request; the previous parameterless form is gone.
Assisted-By: Claude Sonnet 4.7
Replace the raw-REST `/rooms/{id}/state` walk with
`RoomInfo.activeRoomCallParticipants` from the Rust SDK. The SDK list
is user-level only (no device IDs), so each user's device list becomes
`["*"]` — the to-device wildcard — and the SDK fans the Olm-encrypted
key payload out to all of that user's devices.
Matches `matrix-js-sdk/src/matrixrtc/ToDeviceKeyTransport.ts`. Some
warmed-up Olm sessions go to devices that aren't in the call, but the
AES key is per-call and only consumed by a LiveKit cryptor that
expects it — so the extra sessions are wasted, not unsafe. Element
Call accepts the same trade-off.
Removes the delegated-homeserver URL risk and the bespoke state-key
parser (`_<userId>_<deviceId>_m.call`) that filtered out any peer
using a non-Element-X key shape.
Assisted-By: Claude Sonnet 4.7
Adding `slot_id` flipped Relay onto lk-jwt-service's v2 /get_token,
which assigns LiveKit identities as `unpadded_base64(sha256(...))`. But
matrix-js-sdk's `CallMembership.parseFromEvent` reads our legacy
`org.matrix.msc3401.call.member` event under `MembershipKind.Session`,
where `rtcBackendIdentity` is hardcoded to `${sender}:${device_id}` —
the plain-concat form, not hashed. Peers running Element Call /
Element X / Element Web looked for that colon identity, never found us
on LiveKit, and dropped our video.
Invert the order in `fetchLiveKitToken`: try legacy first, fall forward
to v2 only when legacy fails. v2 path stays plumbed for when Relay also
publishes MSC4143 sticky `m.rtc.member` events (tracked separately).
Also surface `m.call.member` send failures in the Activity Log with a
power-level-aware hint. The most common failure shape — M_FORBIDDEN
from rooms whose `power_levels.events.org.matrix.msc3401.call.member`
defaults to `state_default` (50) instead of the override Relay sets at
room creation — silently locks non-admin participants out of E2EE call
media (no membership event in room state → peers don't send keys →
black tiles).
And rename track-kind logs from `publication.kind.rawValue` (integer)
to a named form (`audio`/`video`/`none`), and start writing remote
`didPublishTrack` events to the Activity Log.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`discoverSFUURL` previously preferred local SFU discovery (MSC4143 transports endpoint / `.well-known`) and only consulted peers' `foci_preferred` as a last-resort fallback. That breaks federated calls: when another homeserver's user starts the call first, they advertise their SFU under `focus_active.focus_selection == "oldest_membership"`, but a later-joining Relay client would ignore that and connect to its own homeserver's SFU instead — splitting the call across two SFUs and stranding both sides on "waiting for media". Reorder discovery so the existing-call SFU wins. Refactor `fetchSFUFromCallMembers` to pick the *oldest* surviving membership rather than the first one found, and to drop expired (`created_ts + expires < now`) and tombstoned (empty content) entries so a stale leftover can't outvote the live participants. Local discovery stays as the bootstrap path for the first joiner. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Plumb `RoomInfo.hasRoomCall` from the SDK through to the room toolbar. When a call is in progress, the call button flips from icon-only to icon + "Join Call" label, rendered in the app's accent color so the state change is visible at a glance, and the confirmation dialog re-words to "Join" instead of "Start". - Add `hasRoomCall: Bool` to `RelayInterface.RoomSummary` - Mirror `info.hasRoomCall` into the observable summary inside `RoomListManager.applyRoomInfo` - Branch the toolbar button label/foreground style and confirmation copy on `currentRoom?.hasRoomCall` Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
subpop
reviewed
Jun 15, 2026
subpop
left a comment
Owner
There was a problem hiding this comment.
A lot of additional debug logging; good stuff! But the refactor that automatically routes ActivityLog events to Console should allow you to unify a lot of the redundant lines (logger.log followed by activityLog.log).
I suspect the conflicts in this PR are a result of the changes made to unify logging.
Two related threads:
1. Unify logging on the Activity Log.
Drop every `logger.X("[RTC]…")` call in the four call-path files —
`LiveKitCredentialService`, `CallEncryptionService`, `CallWidgetBridge`,
and `CallViewModel`. Where an Activity Log entry already covered the
same information, the logger line was redundant; where only a logger
line existed, port the content into a new Activity Log entry first.
The Activity Log auto-routes events to Console, so anything previously
only visible there now lands in both Console and the exported
Activity Log JSON.
`CallEncryptionService.makeHKDFKeyProvider` now returns
`(provider, hkdfInstalled, fallbackReason)`, and `setRawKey` now
returns an optional failure reason string instead of os_log'ing
internally — both move responsibility for surfacing the outcome to
the caller, which has access to `activityLog` and the surrounding
call context.
`CallViewModel.startHeartbeat` now takes `activityLog` and `roomID`
so heartbeat refresh failures still land in the Activity Log without
a local Logger.
2. Fix federation E2EE key-distribution race.
When Relay is the first joiner and a federated peer arrives later,
LiveKit's `participantDidConnect` fires before the peer's
`m.call.member` event reaches the SDK. Our existing
`redistributeKey(to:)` then calls `fetchCallTargets`, which reads
`RoomInfo.activeRoomCallParticipants` — empty at that moment — and
exits without sending. The peer never receives our key, and Element
Call / Element X show our tile as black.
Add a second trigger that fires after the m.call.member event
actually arrives. `CallWidgetBridge` exposes a new
`onCallMemberStateChanged` callback invoked when the widget driver
delivers any inbound `org.matrix.msc3401.call.member` event;
`CallViewModel` wires it to `redistributeKeyOnMembershipChange()`,
which re-fetches targets and re-sends the key — but only when the
*user set* differs from the previous snapshot, so periodic
heartbeats from existing peers don't cause us to spam Olm-encrypted
to-device payloads.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
On poor connections the call window's "Joining call…" spinner can sit unchanged for 20+ seconds while the homeserver round-trips a `m.call.member` state event or LiveKit attaches media. The user has no way to tell whether something is wrong or whether the network is just slow. Add a `connectingPhase: String?` to `CallViewModelProtocol` and update it inside `CallViewModel.connect` as the connect path moves through LiveKit attach, encryption prep, membership publish, key distribution, and media start. The view layer reveals the phase label after a 300ms delay (and only if the same phase is still current), so on a fast network nothing flashes on screen — but a stalled step on bad wifi gets a concrete description after a third of a second. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The troubleshooting guide is now shipped as a Help Book bundle inside the app rather than as a markdown file in the repo. Replaces docs/troubleshooting-calls.md with Relay/Resources/Relay.help, registers the bundle via the new CFBundleHelpBookFolder and CFBundleHelpBookName Info.plist keys, and adds a light system-font stylesheet that matches macOS Help Viewer conventions in both light and dark mode. Also drops the in-repo docs/internal/rtc-element-call-diff.md engineering note (maintainer prefers agent-maintained notes stay out of the tree). Note: the Relay.help folder needs to be added to the Relay target in Xcode as a Folder Reference inside Copy Bundle Resources so the build copies it verbatim. The project.pbxproj edit that wires this up is not included here. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ignment # Conflicts: # RelayKit/Call/CallEncryptionService.swift # RelayKit/Call/CallViewModel.swift # RelayKit/Call/CallWidgetBridge.swift # RelayKit/Call/LiveKitCredentialService.swift
Follow-up to the upstream/main merge: the "LiveKit identity mismatch"
diagnostic block referenced `encryptionService.userID` directly, but
`encryptionService` is an optional on this branch. Wrap the local
identity construction in `encryptionService.map { ... }` and gate the
warning behind both the LiveKit identity and the Matrix-side identity
being present.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
subpop
reviewed
Jun 16, 2026
subpop
left a comment
Owner
There was a problem hiding this comment.
Nice! The only thing I'd change in the Help Book is the bundle identifier so that it matches the main app's bundle ID. I'll give this a test run and see how it behaves too.
Co-authored-by: Link Dupont <subpop@users.noreply.github.com>
Co-authored-by: Link Dupont <subpop@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This branch audits Relay's MatrixRTC call path against the published MSCs
(MSC4143, MSC4195), Element Call's
livekit/openIDSFU.ts, matrix-js-sdk'smatrixrtc/module, and lk-jwt-service's request handling, then closes thegaps that were causing federated-call failures, encrypted-call media
stalls, and credential-exchange dead-ends. It also adds runtime
instrumentation so future field reports show what's actually happening
post-connect, and surfaces ongoing-call state in the room toolbar.
Engineering reference for every deviation lives in
Moved out of tree.docs/internal/rtc-element-call-diff.md,including a design note for the follow-on MSC4143 sticky-event
dual-publish work.
Notable fixes (in order on the branch)
drops, subscribe failures, publish events, and per-track E2EE state
transitions now write Activity Log entries. Field reports for
"connected but no media" were previously ending at `Connected to
call` with nothing actionable after.
silently disappears into the legacy fallback — Matrix `errcode` and
message are logged before the next attempt.
the current legacy-first path — see below).
MSC4143 transports endpoint are both unavailable, then promoted to the
primary discovery path so `focus_selection: oldest_membership` is
honored — without that, federated calls split across two SFUs and
media never reaches the other side.
REST, using `RoomInfo.activeRoomCallParticipants` and the
to-device `"*"` wildcard.
Relay onto lk-jwt-service v2, which assigns hashed LiveKit identities.
matrix-js-sdk's `CallMembership.parseFromEvent` reads our legacy
`org.matrix.msc3401.call.member` event under
`MembershipKind.Session` where the expected identity is plain
`${sender}:${device_id}` (not the hash). Peers couldn't reconcile
our LiveKit participant with our Matrix membership → black tiles.
Reverted to legacy-first; v2 plumbing stays for the day Relay also
publishes MSC4143 sticky `m.rtc.member` events.
a power-level-aware hint. The most common failure mode in the wild —
M_FORBIDDEN because a room's
`power_levels.events.org.matrix.msc3401.call.member` defaults to
`state_default` (50) — silently locked non-admin participants out
of E2EE call media until now.
`RoomInfo.hasRoomCall` into the toolbar button so an ongoing call
flips it from icon-only to "Join Call" in accent blue, with
"Join" copy in the confirmation dialog. Fixes Detect active call in a room #124
Known follow-up
Tracked as a design note + task: MSC4143 sticky-event dual-publish.
Once Relay also publishes the sticky `m.rtc.member` event with a
stable UUID threaded through `/get_token`, the credential path can
flip back to v2-first and we drop the legacy-only dependency.
Test plan
(power-levels override is set at creation) — both ends see video
and audio.
room — both directions of video subscribe successfully.
`state_default` for `org.matrix.msc3401.call.member`. If the
Relay client has insufficient PL, the Activity Log shows the
power-level-aware error.
client on homeserver B joins. Both end up on homeserver A's SFU
(per `focus_selection: oldest_membership`) and exchange media.
in a call in the selected room; falls back to icon-only when no
ongoing call.
transitions, `didPublishTrack` (local and remote),
`didSubscribeTrack`, and clean disconnect.
🤖 Generated with Claude Code