Summary
When using custom_stt=enabled, a self-hosted custom STT service can return per-segment speaker data that already matches TranscriptSegment, including speaker, speaker_id, is_user, and person_id. But the app bridge currently does not preserve that information end-to-end.
Current behavior
- The backend correctly bypasses speech-profile speaker identification in custom STT mode.
- However, the app-side custom STT bridge appears to:
- force
is_user: false for forwarded segments,
- ignore upstream
person_id,
- derive speaker identity only from the configured schema speaker field,
- and effectively collapse all segments to
speaker_id = 0 / Speaker 1 when the response schema does not explicitly map the speaker field.
Expected behavior
For custom STT, if the provider already returns segment objects compatible with TranscriptSegment, the bridge should preserve and forward:
speaker
speaker_id
is_user
person_id
start
end
text
translations (if present)
Additionally, the response schema should support fields such as:
segments_is_user_field
segments_person_id_field
- ideally explicit support for
segments_speaker_id_field when speaker label and numeric speaker ID are both present
Example provider response
{
"segments": [
{
"id": "0",
"text": "...",
"speaker": "SPEAKER_00",
"speaker_id": 0,
"is_user": true,
"person_id": null,
"start": 0.0,
"end": 8.0,
"translations": []
},
{
"id": "1",
"text": "...",
"speaker": "SPEAKER_01",
"speaker_id": 1,
"is_user": false,
"person_id": null,
"start": 8.0,
"end": 14.0,
"translations": []
}
]
}
Repro
- Enable custom STT (
custom_stt=enabled).
- Use a self-hosted transcription service that returns distinct per-segment speakers and correct boolean
is_user.
- Start a conversation with at least two speakers.
- Observe that the app shows every segment as
Speaker 1, and the device owner is not rendered as the user.
Notes
The reporter is happy to test a fix and can also send a PR.
Summary
When using
custom_stt=enabled, a self-hosted custom STT service can return per-segment speaker data that already matchesTranscriptSegment, includingspeaker,speaker_id,is_user, andperson_id. But the app bridge currently does not preserve that information end-to-end.Current behavior
is_user: falsefor forwarded segments,person_id,speaker_id = 0/Speaker 1when the response schema does not explicitly map the speaker field.Expected behavior
For custom STT, if the provider already returns segment objects compatible with
TranscriptSegment, the bridge should preserve and forward:speakerspeaker_idis_userperson_idstartendtexttranslations(if present)Additionally, the response schema should support fields such as:
segments_is_user_fieldsegments_person_id_fieldsegments_speaker_id_fieldwhen speaker label and numeric speaker ID are both presentExample provider response
{ "segments": [ { "id": "0", "text": "...", "speaker": "SPEAKER_00", "speaker_id": 0, "is_user": true, "person_id": null, "start": 0.0, "end": 8.0, "translations": [] }, { "id": "1", "text": "...", "speaker": "SPEAKER_01", "speaker_id": 1, "is_user": false, "person_id": null, "start": 8.0, "end": 14.0, "translations": [] } ] }Repro
custom_stt=enabled).is_user.Speaker 1, and the device owner is not rendered as the user.Notes
The reporter is happy to test a fix and can also send a PR.