Skip to content

Offline WAL uploads should reconstruct sessions before ASR and use live transcription settings #8006

@waffensam

Description

@waffensam

Describe the bug
Offline WAL uploads currently preserve 60-second transfer chunks and then run pre-recorded STT on VAD-derived segments, instead of reconstructing a longer continuous recording/session before ASR. This can make offline transcription substantially worse than live /v4/listen, especially for short utterances, multi-speaker context, language detection, and users relying on specific transcription settings.

The current backend does merge timestamp-adjacent transcript segments into a conversation after STT, but the STT request itself is still performed per VAD segment/window. That means the recognizer does not get the same continuous context as live streaming capture.

To Reproduce
Steps to reproduce the behavior:

  1. On iOS, record a continuous conversation long enough to create multiple offline WAL files.
  2. Let the app sync the offline recordings through /v2/sync-local-files.
  3. Compare the generated transcript against a live /v4/listen capture of similar audio with the same user language/transcription settings.
  4. Observe lower offline recognition quality and fragmented results, especially around short chunks, language-sensitive audio, or custom-STT users.

Current behavior

  • iOS WAL chunking uses 60-second transfer chunks (chunkSizeInSeconds = 60, sdcardChunkSizeSecs = 60).
  • uploadLocalFilesV2() sends only files and optional conversation_id; it does not include a recording manifest or the live-session settings snapshot (language, stt_service, conversation_timeout, custom STT mode, etc.).
  • Backend /v2/sync-local-files saves raw files and runs decode -> VAD -> STT -> LLM in an async job.
  • VAD segmentation happens per uploaded WAV, and process_segment() calls pre-recorded STT on each resulting segment.
  • The backend later appends timestamp-adjacent transcript segments into the nearest conversation and reprocesses summaries, but this happens after ASR and cannot recover recognition quality lost from short ASR windows.
  • Offline standard Omi STT does read current server-side transcription preferences (language, single_language_mode, vocabulary), but it does not use the exact settings snapshot used by the client at recording time.
  • Custom STT users are explicitly not equivalent offline: live mode forwards third-party results as suggested_transcript, while offline manual sync uses Omi server STT with a confirmation prompt.
  • Limitless ZIP import defaults to language=en unless the caller passes a language; the app currently calls it without passing the user's language.

Relevant code pointers:

  • app/lib/services/wals/wal.dart: chunkSizeInSeconds = 60, sdcardChunkSizeSecs = 60
  • app/lib/backend/http/api/conversations.dart: uploadLocalFilesV2() posts files to /v2/sync-local-files without multipart fields
  • app/lib/backend/http/shared.dart: multipart helper already supports fields, so a manifest/settings field can be added without changing upload mechanics
  • backend/routers/sync.py: v2 async pipeline saves raw files, then decode/VAD/STT/LLM
  • backend/routers/sync.py: retrieve_vad_segments() merges VAD regions only within each decoded file, then exports per-segment WAVs
  • backend/routers/sync.py: process_segment() calls prerecorded(...) per segment and only later merges into conversations by timestamp
  • backend/database/conversations.py: get_closest_conversation_to_timestamps() uses a fixed +/-2 minute overlap lookup for sync merging
  • app/lib/services/sockets/transcription_service.dart and backend/routers/transcribe.py: live /v4/listen carries explicit language, stt_service, conversation_timeout, speaker assignment, and VAD-gate parameters
  • app/lib/services/sockets/composite_transcription_socket.dart and backend/routers/transcribe.py: custom STT live path forwards suggested_transcript
  • app/lib/backend/http/api/imports.dart and app/lib/pages/settings/import_history_page.dart: Limitless import defaults to language=en in the API wrapper and the UI does not pass the user language

Expected behavior
Offline transcription should be comparable to live capture for the same audio and user settings.

Suggested direction:

  1. Keep 60-second WAL files as transfer/retry units. Do not simply make the client upload much larger files as the first fix, because that increases BLE/download/upload retry cost.
  2. Add a multipart manifest field to /v2/sync-local-files. Include per-file metadata such as filename, timer start, duration, codec, sample rate, channels, device/source, original storage, chunk index, and conversation id if known.
  3. Add a transcription settings snapshot to the upload, or resolve an equivalent server-side snapshot consistently: language, single_language_mode, vocabulary, stt_service, conversation_timeout, custom_stt_mode.
  4. In the backend async job, reconstruct logical sessions from the manifest before ASR. Group chunks by time continuity and split sessions only when the gap exceeds the conversation timeout.
  5. Run ASR on session-level windows, not isolated 60-second/VAD fragments. For long sessions, use provider-safe 5-10 minute windows with 5-10 second overlap, then dedupe by word timestamps.
  6. Preserve global timestamps when stitching results so existing conversation merge/reprocess logic can still be reused.
  7. Normalize speaker IDs across windows using speaker embeddings or a session-level diarization pass so SPEAKER_00 does not mean different people in different ASR windows.
  8. Keep backward compatibility for old clients that do not send a manifest by falling back to filename timestamps.
  9. Decide explicitly for custom STT users: either keep the current "offline uses Omi STT" behavior with clear UX, or add a separate batch custom-STT path so offline can match live custom STT.
  10. Pass the user's language into Limitless import instead of defaulting to English.

Screenshots
Not applicable.

user ID (can we access the user info to validate the bug?):
N/A. This is a pipeline-level issue found by inspecting the current iOS and backend code paths.

Smartphone + device (please complete the following information):

  • Device: iPhone / Omi device WAL sync path
  • OS: iOS
  • Browser: N/A
  • App Version: current main
  • Device version: Omi / Limitless offline recording sources

Additional context
There is already partial infrastructure for this:

  • The app has enough WAL metadata locally (timerStart, seconds, codec, sampleRate, channel, device, conversationId, etc.).
  • The multipart helper already supports form fields.
  • Backend v2 sync already has an async job model, GCS staging, Cloud Tasks dispatch, chronological assignment protection, and summary reprocessing after merged segments.
  • Existing audio playback grouping uses timestamp-based chunk grouping, which is a useful precedent but is only for playback artifacts, not ASR reconstruction.

The main missing piece is an ASR-before-merge reconstruction layer: transfer chunks should be reassembled into longer recognition windows before pre-recorded STT, while still preserving the existing upload/retry and conversation merge behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions