Skip to content

Custom STT bridge ignores per-segment is_user/person_id and collapses speakers to Speaker 1 #7982

@omi-discord-vector

Description

@omi-discord-vector

Summary

When using custom_stt=enabled, a self-hosted custom STT service can return per-segment speaker data that already matches TranscriptSegment, including speaker, speaker_id, is_user, and person_id. But the app bridge currently does not preserve that information end-to-end.

Current behavior

  • The backend correctly bypasses speech-profile speaker identification in custom STT mode.
  • However, the app-side custom STT bridge appears to:
    • force is_user: false for forwarded segments,
    • ignore upstream person_id,
    • derive speaker identity only from the configured schema speaker field,
    • and effectively collapse all segments to speaker_id = 0 / Speaker 1 when the response schema does not explicitly map the speaker field.

Expected behavior

For custom STT, if the provider already returns segment objects compatible with TranscriptSegment, the bridge should preserve and forward:

  • speaker
  • speaker_id
  • is_user
  • person_id
  • start
  • end
  • text
  • translations (if present)

Additionally, the response schema should support fields such as:

  • segments_is_user_field
  • segments_person_id_field
  • ideally explicit support for segments_speaker_id_field when speaker label and numeric speaker ID are both present

Example provider response

{
  "segments": [
    {
      "id": "0",
      "text": "...",
      "speaker": "SPEAKER_00",
      "speaker_id": 0,
      "is_user": true,
      "person_id": null,
      "start": 0.0,
      "end": 8.0,
      "translations": []
    },
    {
      "id": "1",
      "text": "...",
      "speaker": "SPEAKER_01",
      "speaker_id": 1,
      "is_user": false,
      "person_id": null,
      "start": 8.0,
      "end": 14.0,
      "translations": []
    }
  ]
}

Repro

  1. Enable custom STT (custom_stt=enabled).
  2. Use a self-hosted transcription service that returns distinct per-segment speakers and correct boolean is_user.
  3. Start a conversation with at least two speakers.
  4. Observe that the app shows every segment as Speaker 1, and the device owner is not rendered as the user.

Notes

The reporter is happy to test a fix and can also send a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    captureLayer: Audio recording, device pairing, BLEp2Priority: Important (score 14-21)

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions