Skip to content

Align Relay's MatrixRTC call path with Element Call interop#134

Open
rexbron wants to merge 17 commits into
subpop:mainfrom
rexbron:rtc-element-call-alignment
Open

Align Relay's MatrixRTC call path with Element Call interop#134
rexbron wants to merge 17 commits into
subpop:mainfrom
rexbron:rtc-element-call-alignment

Conversation

@rexbron

@rexbron rexbron commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Summary

This branch audits Relay's MatrixRTC call path against the published MSCs
(MSC4143, MSC4195), Element Call's livekit/openIDSFU.ts, matrix-js-sdk's
matrixrtc/ module, and lk-jwt-service's request handling, then closes the
gaps that were causing federated-call failures, encrypted-call media
stalls, and credential-exchange dead-ends. It also adds runtime
instrumentation so future field reports show what's actually happening
post-connect, and surfaces ongoing-call state in the room toolbar.

Engineering reference for every deviation lives in
docs/internal/rtc-element-call-diff.md,
including a design note for the follow-on MSC4143 sticky-event
dual-publish work.
Moved out of tree.

Notable fixes (in order on the branch)

  • Post-connect Activity Log instrumentation. Disconnects, mid-call
    drops, subscribe failures, publish events, and per-track E2EE state
    transitions now write Activity Log entries. Field reports for
    "connected but no media" were previously ending at `Connected to
    call` with nothing actionable after.
  • v2 `/get_token` error surfacing. A v2 rejection no longer
    silently disappears into the legacy fallback — Matrix `errcode` and
    message are logged before the next attempt.
  • `/get_token` slot_id + v2 identity helpers added (but inert on
    the current legacy-first path — see below).
  • Peer `foci_preferred` SFU discovery when `.well-known` and the
    MSC4143 transports endpoint are both unavailable, then promoted to the
    primary discovery path so `focus_selection: oldest_membership` is
    honored — without that, federated calls split across two SFUs and
    media never reaches the other side.
  • `fetchCallTargets` sourced from SDK room state instead of raw
    REST, using `RoomInfo.activeRoomCallParticipants` and the
    to-device `"*"` wildcard.
  • Legacy-first credential path restored. Adding `slot_id` flipped
    Relay onto lk-jwt-service v2, which assigns hashed LiveKit identities.
    matrix-js-sdk's `CallMembership.parseFromEvent` reads our legacy
    `org.matrix.msc3401.call.member` event under
    `MembershipKind.Session` where the expected identity is plain
    `${sender}:${device_id}` (not the hash). Peers couldn't reconcile
    our LiveKit participant with our Matrix membership → black tiles.
    Reverted to legacy-first; v2 plumbing stays for the day Relay also
    publishes MSC4143 sticky `m.rtc.member` events.
  • `m.call.member` send failures surfaced in the Activity Log with
    a power-level-aware hint. The most common failure mode in the wild —
    M_FORBIDDEN because a room's
    `power_levels.events.org.matrix.msc3401.call.member` defaults to
    `state_default` (50) — silently locked non-admin participants out
    of E2EE call media until now.
  • Ongoing-call state in the toolbar. Plumbs
    `RoomInfo.hasRoomCall` into the toolbar button so an ongoing call
    flips it from icon-only to "Join Call" in accent blue, with
    "Join" copy in the confirmation dialog. Fixes Detect active call in a room #124
  • Initial Helpbook Converts MatrixRTC troubleshooting into a helpbook format. Starts to address Look into adding an in-app Help Book #137

Known follow-up

Tracked as a design note + task: MSC4143 sticky-event dual-publish.
Once Relay also publishes the sticky `m.rtc.member` event with a
stable UUID threaded through `/get_token`, the credential path can
flip back to v2-first and we drop the legacy-only dependency.

Test plan

  • Encrypted call between two Relay clients in a Relay-created room
    (power-levels override is set at creation) — both ends see video
    and audio.
  • Encrypted call between Relay and Element X in a Relay-created
    room — both directions of video subscribe successfully.
  • Encrypted call in an Element-X-created room with PL >=
    `state_default` for `org.matrix.msc3401.call.member`. If the
    Relay client has insufficient PL, the Activity Log shows the
    power-level-aware error.
  • Federated call: Element X on homeserver A starts a call, Relay
    client on homeserver B joins. Both end up on homeserver A's SFU
    (per `focus_selection: oldest_membership`) and exchange media.
  • Toolbar shows "Join Call" in accent blue when another user is
    in a call in the selected room; falls back to icon-only when no
    ongoing call.
  • Activity Log on a successful call contains: connection state
    transitions, `didPublishTrack` (local and remote),
    `didSubscribeTrack`, and clean disconnect.

🤖 Generated with Claude Code

rexbron and others added 10 commits June 13, 2026 07:27
Document how affected users can capture Activity Log JSON exports
filtered to the Call category, what each diagnostic signal in the
export means (credential discovery, token exchange, key
distribution, frame routing), and how to file useful reports. Also
covers the unified-log fallback via `log stream` for harder cases
and notes which data is and isn't safe to share publicly.

Assisted-By: Claude Sonnet 4.7
Engineering reference under docs/internal/ that maps every Relay
call-path function to its Element Call / matrix-js-sdk counterpart.
Each entry cites the exact source line in the upstream reference and
the matching range in Relay, and flags deviations confirmed against
MSCs, source code, or real-world user traces. Acts as the worklog for
the rtc-element-call-alignment branch.

Assisted-By: Claude Sonnet 4.7
User trace 97853C31 showed an Activity Log that ended at "Connected
to call" with no further events: the trace stops the moment Relay
finished its setup checklist, even though the user's call kept
running (and failing) for minutes afterward. Without post-connect
events, every "connected but no media" or "call dropped silently"
report is undiagnosable from the JSON export alone.

Wire the LiveKit RoomDelegate callbacks that produce diagnostic
signals through to the Activity Log:

- didUpdateConnectionState .disconnected (with previous state)
- didFailToConnectWithError (SFU rejected the initial connect)
- didDisconnectWithError (mid-call drop with cause)
- didSubscribeTrack (now lands in the JSON, not just os_log)
- didFailToSubscribeTrack (firewall/NAT/codec failures)
- local didPublishTrack (proves our own media went out)
- didUpdateE2EEState (per-track cryptor failures, encrypted rooms)

The describe(_:) helper provides stable labels for LiveKit
ConnectionState values that won't drift if the enum gains cases.

Assisted-By: Claude Sonnet 4.7
The previous fall-back path used `try?` to swallow every v2 error
and silently retry against legacy `/sfu/get`. That hid two
diagnostic signals:

- The actual reason v2 rejected us (Matrix `errcode` + `error`).
- That the fall-back happened at all.

Replace `try?` with a `do/catch` that:

- Parses Matrix-style `{errcode, error}` envelopes from the response
  body so users see `M_BAD_JSON: The request body is missing 'room_id'
  or 'slot_id'` instead of generic `tokenExchangeFailed`.
- Logs the v2 failure to both os_log and the Activity Log before
  trying legacy, so the silent fall-back is visible after the fact.
- Carries the structured detail through a new
  `LiveKitCredentialError.tokenExchangeRejected` case that surfaces
  status, errcode, message, and which endpoint failed.

This unblocks self-diagnosis for the upcoming `slot_id` and v2
identity fixes — once those land, this logging will confirm v2 is
healthy without requiring users to hand-curate os_log traces.

Assisted-By: Claude Sonnet 4.7
Two coupled changes that must ship together. Shipping only the first
moves users from "fails to connect" to "connects but no media."

Item 1 — slot_id
================

The v2 endpoint in lk-jwt-service requires `slot_id` in the request
body; `SFURequest.Validate()` returns HTTP 400 M_BAD_JSON ("missing
'room_id' or 'slot_id'") if absent. Relay never sent it, so the v2
attempt failed and we silently fell back to /sfu/get. Hardcode
"m.call#ROOM" to match Element Call's
`getLiveunitJWTWithDelayDelegation`.

Item 2 — v2 LiveKit identity routing
====================================

On v2, lk-jwt-service issues a JWT whose `sub` claim is
`unpaddedBase64(sha256(json_marshal([matrixID, claimedDeviceID,
memberID])))` (per `helper.go::LiveKitIdentityFor`). LiveKit uses that
as the participant identity. On legacy, it's `<user>:<server>:<device>`.

The frame cryptor routes frames to remote peers' decoders by exact
string match on identity, so keying the cryptor under the legacy shape
silently breaks every frame on v2. Three sites needed updating:

- `CallEncryptionService.liveKitIdentity(matrixID:claimedDeviceID:
  memberID:)` ports the lk-jwt-service algorithm. Inputs are all ASCII
  (Matrix IDs, device IDs, UUIDs), so JSONSerialization is byte-
  identical to Go's `json.Marshal` and the resulting hash matches.

- `CallViewModel.connect` now keys the local cryptor under
  `room.localParticipant.identity?.stringValue` instead of
  `<userID>:<deviceID>`. That's the JWT sub claim regardless of v1/v2.
  If LiveKit hasn't assigned an identity (shouldn't happen), the
  connect now fails fast with `CallViewModelError.
  missingLocalParticipantIdentity` rather than silently misrouting.

- `CallWidgetBridge.handleIncomingToDevice` registers each inbound
  encryption key under every plausible LiveKit identity for that peer
  — both the legacy `<sender>:<device>` and the v2 hash. The cryptor
  ring stores per-(participantId, index) entries and matches by
  identity, so registering both is safe and works against peers
  regardless of which endpoint they took.

- `CallViewModel.redistributeKey` previously parsed the LiveKit
  identity by `:` to recover (userId, deviceId). On v2 the identity
  has no colons and the parse fails, so new peers never got our key.
  Drop the parse entirely; re-fetch `m.call.member` state and
  broadcast to all current targets. Matches Element Call's
  `RTCEncryptionManager` behaviour on membership change.

Assisted-By: Claude Sonnet 4.7
When the homeserver advertises no MatrixRTC SFU (neither the unstable
transports endpoint nor `.well-known org.matrix.msc4143.rtc_foci` is
configured) but a call is already in progress in the room, walk the
existing `m.call.member` state events and pick a peer's
`foci_preferred[0].livekit_service_url`.

Matches Element Call / matrix-js-sdk's third-fallback discovery
behaviour. Previously Relay would throw `sfuURLNotFound` and refuse
to join in this scenario, even though the SFU the existing
participants were using was right there in room state.

`discoverSFUURL` now takes the roomID so it can issue the
`/rooms/{id}/state` request; the previous parameterless form is gone.

Assisted-By: Claude Sonnet 4.7
Replace the raw-REST `/rooms/{id}/state` walk with
`RoomInfo.activeRoomCallParticipants` from the Rust SDK. The SDK list
is user-level only (no device IDs), so each user's device list becomes
`["*"]` — the to-device wildcard — and the SDK fans the Olm-encrypted
key payload out to all of that user's devices.

Matches `matrix-js-sdk/src/matrixrtc/ToDeviceKeyTransport.ts`. Some
warmed-up Olm sessions go to devices that aren't in the call, but the
AES key is per-call and only consumed by a LiveKit cryptor that
expects it — so the extra sessions are wasted, not unsafe. Element
Call accepts the same trade-off.

Removes the delegated-homeserver URL risk and the bespoke state-key
parser (`_<userId>_<deviceId>_m.call`) that filtered out any peer
using a non-Element-X key shape.

Assisted-By: Claude Sonnet 4.7
Adding `slot_id` flipped Relay onto lk-jwt-service's v2 /get_token,
which assigns LiveKit identities as `unpadded_base64(sha256(...))`. But
matrix-js-sdk's `CallMembership.parseFromEvent` reads our legacy
`org.matrix.msc3401.call.member` event under `MembershipKind.Session`,
where `rtcBackendIdentity` is hardcoded to `${sender}:${device_id}` —
the plain-concat form, not hashed. Peers running Element Call /
Element X / Element Web looked for that colon identity, never found us
on LiveKit, and dropped our video.

Invert the order in `fetchLiveKitToken`: try legacy first, fall forward
to v2 only when legacy fails. v2 path stays plumbed for when Relay also
publishes MSC4143 sticky `m.rtc.member` events (tracked separately).

Also surface `m.call.member` send failures in the Activity Log with a
power-level-aware hint. The most common failure shape — M_FORBIDDEN
from rooms whose `power_levels.events.org.matrix.msc3401.call.member`
defaults to `state_default` (50) instead of the override Relay sets at
room creation — silently locks non-admin participants out of E2EE call
media (no membership event in room state → peers don't send keys →
black tiles).

And rename track-kind logs from `publication.kind.rawValue` (integer)
to a named form (`audio`/`video`/`none`), and start writing remote
`didPublishTrack` events to the Activity Log.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`discoverSFUURL` previously preferred local SFU discovery (MSC4143
transports endpoint / `.well-known`) and only consulted peers'
`foci_preferred` as a last-resort fallback. That breaks federated
calls: when another homeserver's user starts the call first, they
advertise their SFU under `focus_active.focus_selection ==
"oldest_membership"`, but a later-joining Relay client would ignore
that and connect to its own homeserver's SFU instead — splitting the
call across two SFUs and stranding both sides on "waiting for media".

Reorder discovery so the existing-call SFU wins. Refactor
`fetchSFUFromCallMembers` to pick the *oldest* surviving membership
rather than the first one found, and to drop expired
(`created_ts + expires < now`) and tombstoned (empty content)
entries so a stale leftover can't outvote the live participants.
Local discovery stays as the bootstrap path for the first joiner.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Plumb `RoomInfo.hasRoomCall` from the SDK through to the room toolbar.
When a call is in progress, the call button flips from icon-only to
icon + "Join Call" label, rendered in the app's accent color so the
state change is visible at a glance, and the confirmation dialog
re-words to "Join" instead of "Start".

- Add `hasRoomCall: Bool` to `RelayInterface.RoomSummary`
- Mirror `info.hasRoomCall` into the observable summary inside
  `RoomListManager.applyRoomInfo`
- Branch the toolbar button label/foreground style and confirmation
  copy on `currentRoom?.hasRoomCall`

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@subpop subpop left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of additional debug logging; good stuff! But the refactor that automatically routes ActivityLog events to Console should allow you to unify a lot of the redundant lines (logger.log followed by activityLog.log).

I suspect the conflicts in this PR are a result of the changes made to unify logging.

Comment thread docs/troubleshooting-calls.md Outdated
Comment thread docs/internal/rtc-element-call-diff.md Outdated
Comment thread docs/internal/rtc-element-call-diff.md Outdated
Comment thread RelayKit/Call/CallViewModel.swift Outdated
rexbron and others added 5 commits June 15, 2026 09:15
Two related threads:

1. Unify logging on the Activity Log.

   Drop every `logger.X("[RTC]…")` call in the four call-path files —
   `LiveKitCredentialService`, `CallEncryptionService`, `CallWidgetBridge`,
   and `CallViewModel`. Where an Activity Log entry already covered the
   same information, the logger line was redundant; where only a logger
   line existed, port the content into a new Activity Log entry first.
   The Activity Log auto-routes events to Console, so anything previously
   only visible there now lands in both Console and the exported
   Activity Log JSON.

   `CallEncryptionService.makeHKDFKeyProvider` now returns
   `(provider, hkdfInstalled, fallbackReason)`, and `setRawKey` now
   returns an optional failure reason string instead of os_log'ing
   internally — both move responsibility for surfacing the outcome to
   the caller, which has access to `activityLog` and the surrounding
   call context.

   `CallViewModel.startHeartbeat` now takes `activityLog` and `roomID`
   so heartbeat refresh failures still land in the Activity Log without
   a local Logger.

2. Fix federation E2EE key-distribution race.

   When Relay is the first joiner and a federated peer arrives later,
   LiveKit's `participantDidConnect` fires before the peer's
   `m.call.member` event reaches the SDK. Our existing
   `redistributeKey(to:)` then calls `fetchCallTargets`, which reads
   `RoomInfo.activeRoomCallParticipants` — empty at that moment — and
   exits without sending. The peer never receives our key, and Element
   Call / Element X show our tile as black.

   Add a second trigger that fires after the m.call.member event
   actually arrives. `CallWidgetBridge` exposes a new
   `onCallMemberStateChanged` callback invoked when the widget driver
   delivers any inbound `org.matrix.msc3401.call.member` event;
   `CallViewModel` wires it to `redistributeKeyOnMembershipChange()`,
   which re-fetches targets and re-sends the key — but only when the
   *user set* differs from the previous snapshot, so periodic
   heartbeats from existing peers don't cause us to spam Olm-encrypted
   to-device payloads.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
On poor connections the call window's "Joining call…" spinner can sit
unchanged for 20+ seconds while the homeserver round-trips a
`m.call.member` state event or LiveKit attaches media. The user has no
way to tell whether something is wrong or whether the network is just
slow.

Add a `connectingPhase: String?` to `CallViewModelProtocol` and update
it inside `CallViewModel.connect` as the connect path moves through
LiveKit attach, encryption prep, membership publish, key distribution,
and media start. The view layer reveals the phase label after a 300ms
delay (and only if the same phase is still current), so on a fast
network nothing flashes on screen — but a stalled step on bad wifi
gets a concrete description after a third of a second.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The troubleshooting guide is now shipped as a Help Book bundle inside
the app rather than as a markdown file in the repo. Replaces
docs/troubleshooting-calls.md with Relay/Resources/Relay.help,
registers the bundle via the new CFBundleHelpBookFolder and
CFBundleHelpBookName Info.plist keys, and adds a light system-font
stylesheet that matches macOS Help Viewer conventions in both light
and dark mode.

Also drops the in-repo docs/internal/rtc-element-call-diff.md
engineering note (maintainer prefers agent-maintained notes stay
out of the tree).

Note: the Relay.help folder needs to be added to the Relay target
in Xcode as a Folder Reference inside Copy Bundle Resources so the
build copies it verbatim. The project.pbxproj edit that wires this
up is not included here.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ignment

# Conflicts:
#	RelayKit/Call/CallEncryptionService.swift
#	RelayKit/Call/CallViewModel.swift
#	RelayKit/Call/CallWidgetBridge.swift
#	RelayKit/Call/LiveKitCredentialService.swift
Follow-up to the upstream/main merge: the "LiveKit identity mismatch"
diagnostic block referenced `encryptionService.userID` directly, but
`encryptionService` is an optional on this branch. Wrap the local
identity construction in `encryptionService.map { ... }` and gate the
warning behind both the LiveKit identity and the Matrix-side identity
being present.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@subpop subpop left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! The only thing I'd change in the Help Book is the bundle identifier so that it matches the main app's bundle ID. I'll give this a test run and see how it behaves too.

Comment thread Relay/Resources/Relay.help/Contents/Info.plist Outdated
Comment thread Relay/Info.plist Outdated
rexbron and others added 2 commits June 16, 2026 07:27
Co-authored-by: Link Dupont <subpop@users.noreply.github.com>
Co-authored-by: Link Dupont <subpop@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Detect active call in a room

2 participants