Describe the bug
Offline WAL uploads currently preserve 60-second transfer chunks and then run pre-recorded STT on VAD-derived segments, instead of reconstructing a longer continuous recording/session before ASR. This can make offline transcription substantially worse than live /v4/listen, especially for short utterances, multi-speaker context, language detection, and users relying on specific transcription settings.
The current backend does merge timestamp-adjacent transcript segments into a conversation after STT, but the STT request itself is still performed per VAD segment/window. That means the recognizer does not get the same continuous context as live streaming capture.
To Reproduce
Steps to reproduce the behavior:
- On iOS, record a continuous conversation long enough to create multiple offline WAL files.
- Let the app sync the offline recordings through
/v2/sync-local-files.
- Compare the generated transcript against a live
/v4/listen capture of similar audio with the same user language/transcription settings.
- Observe lower offline recognition quality and fragmented results, especially around short chunks, language-sensitive audio, or custom-STT users.
Current behavior
- iOS WAL chunking uses 60-second transfer chunks (
chunkSizeInSeconds = 60, sdcardChunkSizeSecs = 60).
uploadLocalFilesV2() sends only files and optional conversation_id; it does not include a recording manifest or the live-session settings snapshot (language, stt_service, conversation_timeout, custom STT mode, etc.).
- Backend
/v2/sync-local-files saves raw files and runs decode -> VAD -> STT -> LLM in an async job.
- VAD segmentation happens per uploaded WAV, and
process_segment() calls pre-recorded STT on each resulting segment.
- The backend later appends timestamp-adjacent transcript segments into the nearest conversation and reprocesses summaries, but this happens after ASR and cannot recover recognition quality lost from short ASR windows.
- Offline standard Omi STT does read current server-side transcription preferences (
language, single_language_mode, vocabulary), but it does not use the exact settings snapshot used by the client at recording time.
- Custom STT users are explicitly not equivalent offline: live mode forwards third-party results as
suggested_transcript, while offline manual sync uses Omi server STT with a confirmation prompt.
- Limitless ZIP import defaults to
language=en unless the caller passes a language; the app currently calls it without passing the user's language.
Relevant code pointers:
app/lib/services/wals/wal.dart: chunkSizeInSeconds = 60, sdcardChunkSizeSecs = 60
app/lib/backend/http/api/conversations.dart: uploadLocalFilesV2() posts files to /v2/sync-local-files without multipart fields
app/lib/backend/http/shared.dart: multipart helper already supports fields, so a manifest/settings field can be added without changing upload mechanics
backend/routers/sync.py: v2 async pipeline saves raw files, then decode/VAD/STT/LLM
backend/routers/sync.py: retrieve_vad_segments() merges VAD regions only within each decoded file, then exports per-segment WAVs
backend/routers/sync.py: process_segment() calls prerecorded(...) per segment and only later merges into conversations by timestamp
backend/database/conversations.py: get_closest_conversation_to_timestamps() uses a fixed +/-2 minute overlap lookup for sync merging
app/lib/services/sockets/transcription_service.dart and backend/routers/transcribe.py: live /v4/listen carries explicit language, stt_service, conversation_timeout, speaker assignment, and VAD-gate parameters
app/lib/services/sockets/composite_transcription_socket.dart and backend/routers/transcribe.py: custom STT live path forwards suggested_transcript
app/lib/backend/http/api/imports.dart and app/lib/pages/settings/import_history_page.dart: Limitless import defaults to language=en in the API wrapper and the UI does not pass the user language
Expected behavior
Offline transcription should be comparable to live capture for the same audio and user settings.
Suggested direction:
- Keep 60-second WAL files as transfer/retry units. Do not simply make the client upload much larger files as the first fix, because that increases BLE/download/upload retry cost.
- Add a multipart
manifest field to /v2/sync-local-files. Include per-file metadata such as filename, timer start, duration, codec, sample rate, channels, device/source, original storage, chunk index, and conversation id if known.
- Add a transcription settings snapshot to the upload, or resolve an equivalent server-side snapshot consistently:
language, single_language_mode, vocabulary, stt_service, conversation_timeout, custom_stt_mode.
- In the backend async job, reconstruct logical sessions from the manifest before ASR. Group chunks by time continuity and split sessions only when the gap exceeds the conversation timeout.
- Run ASR on session-level windows, not isolated 60-second/VAD fragments. For long sessions, use provider-safe 5-10 minute windows with 5-10 second overlap, then dedupe by word timestamps.
- Preserve global timestamps when stitching results so existing conversation merge/reprocess logic can still be reused.
- Normalize speaker IDs across windows using speaker embeddings or a session-level diarization pass so
SPEAKER_00 does not mean different people in different ASR windows.
- Keep backward compatibility for old clients that do not send a manifest by falling back to filename timestamps.
- Decide explicitly for custom STT users: either keep the current "offline uses Omi STT" behavior with clear UX, or add a separate batch custom-STT path so offline can match live custom STT.
- Pass the user's language into Limitless import instead of defaulting to English.
Screenshots
Not applicable.
user ID (can we access the user info to validate the bug?):
N/A. This is a pipeline-level issue found by inspecting the current iOS and backend code paths.
Smartphone + device (please complete the following information):
- Device: iPhone / Omi device WAL sync path
- OS: iOS
- Browser: N/A
- App Version: current
main
- Device version: Omi / Limitless offline recording sources
Additional context
There is already partial infrastructure for this:
- The app has enough WAL metadata locally (
timerStart, seconds, codec, sampleRate, channel, device, conversationId, etc.).
- The multipart helper already supports form fields.
- Backend v2 sync already has an async job model, GCS staging, Cloud Tasks dispatch, chronological assignment protection, and summary reprocessing after merged segments.
- Existing audio playback grouping uses timestamp-based chunk grouping, which is a useful precedent but is only for playback artifacts, not ASR reconstruction.
The main missing piece is an ASR-before-merge reconstruction layer: transfer chunks should be reassembled into longer recognition windows before pre-recorded STT, while still preserving the existing upload/retry and conversation merge behavior.
Describe the bug
Offline WAL uploads currently preserve 60-second transfer chunks and then run pre-recorded STT on VAD-derived segments, instead of reconstructing a longer continuous recording/session before ASR. This can make offline transcription substantially worse than live
/v4/listen, especially for short utterances, multi-speaker context, language detection, and users relying on specific transcription settings.The current backend does merge timestamp-adjacent transcript segments into a conversation after STT, but the STT request itself is still performed per VAD segment/window. That means the recognizer does not get the same continuous context as live streaming capture.
To Reproduce
Steps to reproduce the behavior:
/v2/sync-local-files./v4/listencapture of similar audio with the same user language/transcription settings.Current behavior
chunkSizeInSeconds = 60,sdcardChunkSizeSecs = 60).uploadLocalFilesV2()sends only files and optionalconversation_id; it does not include a recording manifest or the live-session settings snapshot (language,stt_service,conversation_timeout, custom STT mode, etc.)./v2/sync-local-filessaves raw files and runs decode -> VAD -> STT -> LLM in an async job.process_segment()calls pre-recorded STT on each resulting segment.language,single_language_mode,vocabulary), but it does not use the exact settings snapshot used by the client at recording time.suggested_transcript, while offline manual sync uses Omi server STT with a confirmation prompt.language=enunless the caller passes a language; the app currently calls it without passing the user's language.Relevant code pointers:
app/lib/services/wals/wal.dart:chunkSizeInSeconds = 60,sdcardChunkSizeSecs = 60app/lib/backend/http/api/conversations.dart:uploadLocalFilesV2()posts files to/v2/sync-local-fileswithout multipart fieldsapp/lib/backend/http/shared.dart: multipart helper already supportsfields, so a manifest/settings field can be added without changing upload mechanicsbackend/routers/sync.py: v2 async pipeline saves raw files, then decode/VAD/STT/LLMbackend/routers/sync.py:retrieve_vad_segments()merges VAD regions only within each decoded file, then exports per-segment WAVsbackend/routers/sync.py:process_segment()callsprerecorded(...)per segment and only later merges into conversations by timestampbackend/database/conversations.py:get_closest_conversation_to_timestamps()uses a fixed +/-2 minute overlap lookup for sync mergingapp/lib/services/sockets/transcription_service.dartandbackend/routers/transcribe.py: live/v4/listencarries explicitlanguage,stt_service,conversation_timeout, speaker assignment, and VAD-gate parametersapp/lib/services/sockets/composite_transcription_socket.dartandbackend/routers/transcribe.py: custom STT live path forwardssuggested_transcriptapp/lib/backend/http/api/imports.dartandapp/lib/pages/settings/import_history_page.dart: Limitless import defaults tolanguage=enin the API wrapper and the UI does not pass the user languageExpected behavior
Offline transcription should be comparable to live capture for the same audio and user settings.
Suggested direction:
manifestfield to/v2/sync-local-files. Include per-file metadata such as filename, timer start, duration, codec, sample rate, channels, device/source, original storage, chunk index, and conversation id if known.language,single_language_mode,vocabulary,stt_service,conversation_timeout,custom_stt_mode.SPEAKER_00does not mean different people in different ASR windows.Screenshots
Not applicable.
user ID (can we access the user info to validate the bug?):
N/A. This is a pipeline-level issue found by inspecting the current iOS and backend code paths.
Smartphone + device (please complete the following information):
mainAdditional context
There is already partial infrastructure for this:
timerStart,seconds,codec,sampleRate,channel,device,conversationId, etc.).The main missing piece is an ASR-before-merge reconstruction layer: transfer chunks should be reassembled into longer recognition windows before pre-recorded STT, while still preserving the existing upload/retry and conversation merge behavior.