Offline WAL uploads should reconstruct sessions before ASR and use live transcription settings

**Describe the bug**
Offline WAL uploads currently preserve 60-second transfer chunks and then run pre-recorded STT on VAD-derived segments, instead of reconstructing a longer continuous recording/session before ASR. This can make offline transcription substantially worse than live `/v4/listen`, especially for short utterances, multi-speaker context, language detection, and users relying on specific transcription settings.

The current backend does merge timestamp-adjacent transcript segments into a conversation after STT, but the STT request itself is still performed per VAD segment/window. That means the recognizer does not get the same continuous context as live streaming capture.

**To Reproduce**
Steps to reproduce the behavior:
1. On iOS, record a continuous conversation long enough to create multiple offline WAL files.
2. Let the app sync the offline recordings through `/v2/sync-local-files`.
3. Compare the generated transcript against a live `/v4/listen` capture of similar audio with the same user language/transcription settings.
4. Observe lower offline recognition quality and fragmented results, especially around short chunks, language-sensitive audio, or custom-STT users.

**Current behavior**
- iOS WAL chunking uses 60-second transfer chunks (`chunkSizeInSeconds = 60`, `sdcardChunkSizeSecs = 60`).
- `uploadLocalFilesV2()` sends only files and optional `conversation_id`; it does not include a recording manifest or the live-session settings snapshot (`language`, `stt_service`, `conversation_timeout`, custom STT mode, etc.).
- Backend `/v2/sync-local-files` saves raw files and runs decode -> VAD -> STT -> LLM in an async job.
- VAD segmentation happens per uploaded WAV, and `process_segment()` calls pre-recorded STT on each resulting segment.
- The backend later appends timestamp-adjacent transcript segments into the nearest conversation and reprocesses summaries, but this happens after ASR and cannot recover recognition quality lost from short ASR windows.
- Offline standard Omi STT does read current server-side transcription preferences (`language`, `single_language_mode`, `vocabulary`), but it does not use the exact settings snapshot used by the client at recording time.
- Custom STT users are explicitly not equivalent offline: live mode forwards third-party results as `suggested_transcript`, while offline manual sync uses Omi server STT with a confirmation prompt.
- Limitless ZIP import defaults to `language=en` unless the caller passes a language; the app currently calls it without passing the user's language.

Relevant code pointers:
- `app/lib/services/wals/wal.dart`: `chunkSizeInSeconds = 60`, `sdcardChunkSizeSecs = 60`
- `app/lib/backend/http/api/conversations.dart`: `uploadLocalFilesV2()` posts files to `/v2/sync-local-files` without multipart fields
- `app/lib/backend/http/shared.dart`: multipart helper already supports `fields`, so a manifest/settings field can be added without changing upload mechanics
- `backend/routers/sync.py`: v2 async pipeline saves raw files, then decode/VAD/STT/LLM
- `backend/routers/sync.py`: `retrieve_vad_segments()` merges VAD regions only within each decoded file, then exports per-segment WAVs
- `backend/routers/sync.py`: `process_segment()` calls `prerecorded(...)` per segment and only later merges into conversations by timestamp
- `backend/database/conversations.py`: `get_closest_conversation_to_timestamps()` uses a fixed +/-2 minute overlap lookup for sync merging
- `app/lib/services/sockets/transcription_service.dart` and `backend/routers/transcribe.py`: live `/v4/listen` carries explicit `language`, `stt_service`, `conversation_timeout`, speaker assignment, and VAD-gate parameters
- `app/lib/services/sockets/composite_transcription_socket.dart` and `backend/routers/transcribe.py`: custom STT live path forwards `suggested_transcript`
- `app/lib/backend/http/api/imports.dart` and `app/lib/pages/settings/import_history_page.dart`: Limitless import defaults to `language=en` in the API wrapper and the UI does not pass the user language

**Expected behavior**
Offline transcription should be comparable to live capture for the same audio and user settings.

Suggested direction:
1. Keep 60-second WAL files as transfer/retry units. Do not simply make the client upload much larger files as the first fix, because that increases BLE/download/upload retry cost.
2. Add a multipart `manifest` field to `/v2/sync-local-files`. Include per-file metadata such as filename, timer start, duration, codec, sample rate, channels, device/source, original storage, chunk index, and conversation id if known.
3. Add a transcription settings snapshot to the upload, or resolve an equivalent server-side snapshot consistently: `language`, `single_language_mode`, `vocabulary`, `stt_service`, `conversation_timeout`, `custom_stt_mode`.
4. In the backend async job, reconstruct logical sessions from the manifest before ASR. Group chunks by time continuity and split sessions only when the gap exceeds the conversation timeout.
5. Run ASR on session-level windows, not isolated 60-second/VAD fragments. For long sessions, use provider-safe 5-10 minute windows with 5-10 second overlap, then dedupe by word timestamps.
6. Preserve global timestamps when stitching results so existing conversation merge/reprocess logic can still be reused.
7. Normalize speaker IDs across windows using speaker embeddings or a session-level diarization pass so `SPEAKER_00` does not mean different people in different ASR windows.
8. Keep backward compatibility for old clients that do not send a manifest by falling back to filename timestamps.
9. Decide explicitly for custom STT users: either keep the current "offline uses Omi STT" behavior with clear UX, or add a separate batch custom-STT path so offline can match live custom STT.
10. Pass the user's language into Limitless import instead of defaulting to English.

**Screenshots**
Not applicable.

**user ID (can we access the user info to validate the bug?):**
N/A. This is a pipeline-level issue found by inspecting the current iOS and backend code paths.

**Smartphone + device (please complete the following information):**
 - Device: iPhone / Omi device WAL sync path
 - OS: iOS
 - Browser: N/A
 - App Version: current `main`
 - Device version: Omi / Limitless offline recording sources

**Additional context**
There is already partial infrastructure for this:
- The app has enough WAL metadata locally (`timerStart`, `seconds`, `codec`, `sampleRate`, `channel`, `device`, `conversationId`, etc.).
- The multipart helper already supports form fields.
- Backend v2 sync already has an async job model, GCS staging, Cloud Tasks dispatch, chronological assignment protection, and summary reprocessing after merged segments.
- Existing audio playback grouping uses timestamp-based chunk grouping, which is a useful precedent but is only for playback artifacts, not ASR reconstruction.

The main missing piece is an ASR-before-merge reconstruction layer: transfer chunks should be reassembled into longer recognition windows before pre-recorded STT, while still preserving the existing upload/retry and conversation merge behavior.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offline WAL uploads should reconstruct sessions before ASR and use live transcription settings #8006

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Offline WAL uploads should reconstruct sessions before ASR and use live transcription settings #8006

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions