feat(wave4): reliability — threadpools, timeouts, WAL, batching, OCR guard#15
Merged
Merged
Conversation
…guard
Wave 4 of 5 from .gstack/qa-reports/PLAN.md. The structural weaknesses the
tech audit called out as "threats to actually works" — unbounded timeouts,
event-loop-blocking inference, OOM-able embedding batches, reads blocking
on writes in SQLite, silent no-op ingestions for scanned PDFs, conversation
splits when a meta event drops, startup deadlocks on slow Ollama.
## Event-loop stays responsive now
- Embedding backend grows an aembed() that offloads model.encode() to the
default threadpool (SentenceTransformers) or runs inline (HashEmbedding
is trivial). VectorStoreManager grows aquery / aquery_across_notebooks /
aquery_document_summaries that use it. RAGService.prepare_prompt and
prepare_prompt_cross_notebook are now truly async; a 200–500ms encode no
longer stalls health checks and concurrent sidebar refreshes.
- LlamaCppBackend.generate / stream_generate run llama_cpp's blocking calls
via loop.run_in_executor. Streams pull chunks on a worker thread and yield
on the event loop. Without this, every second of inference froze every
other request.
- OllamaBackend.stream_generate now uses a finite httpx.Timeout
(connect 5s, read 300s, write 10s, pool 5s). Previous timeout=None held
the connection open indefinitely when Ollama hung.
## OOM and re-upload guards
- VectorStoreManager.add_chunks embeds in batches of 64 instead of handing
the whole document to model.encode in one call. A 500-page PDF no longer
OOMs sentence-transformers.
- add_chunks also uses collection.upsert instead of add, so re-uploading
a file succeeds cleanly instead of failing mid-batch with
DuplicateIDError and leaving the collection in an inconsistent state.
- Scanned PDFs with no embedded text now raise a clear DocumentLoaderError
("run it through OCR first") instead of silently ingesting 0 chunks and
serving "no relevant documents found" on every subsequent query.
## SQLite stays unblocked
- All three stores (notebook_store, conversation_store, metrics_store) now
set PRAGMA journal_mode = WAL and PRAGMA busy_timeout = 5000. Readers no
longer block on writers — the sidebar refresh during stream completion
used to stall under the default DELETE journal.
## Data integrity on stream
- Backend now echoes conversation_id on the `done` event as well as `meta`.
If the meta event drops (parse error, transient network), the frontend
still learns the id via done and doesn't create a second conversation on
the next send. useChat honors both with a `!activeConversationId` guard
so we only set when it's missing.
## Startup and shipping correctness
- Ollama model resolution deferred to an asyncio lifespan so a slow
/api/tags round-trip at boot no longer delays FastAPI startup and
trips Electron's waitForBackend timeout. app.py imports consolidated
at the top of the file (was E402-violating before).
- Config split: llm_context_window default 4096 (was 2048), llm_max_tokens
default 1024 (was 2048). Previously both were 2048 — the same knob was
used for "how much context the model can hold" AND "how many tokens to
generate", which quietly truncated long RAG prompts AND capped replies.
- Chroma telemetry disabled via Settings(anonymized_telemetry=False).
Keeps the offline-first promise and stops filling logs with "Failed to
send telemetry event ClientCreateCollectionEvent" warnings.
## Frontend hardening
- api.ts request() accepts an optional timeoutMs and composes caller
signals with a default 30s AbortController. Prior code could hang on
any stalled endpoint forever. Uploads get their own 5-minute budget
rather than the default.
- Network errors surface through the errorMessages humanizer that
Wave 3 added, so the user sees "The backend isn't reachable…" instead
of a raw "Failed to fetch".
## Verified
- cd apps/desktop && npm run build — clean (272 modules, 864ms)
- cd backend && uv run ruff check . — clean
- cd backend && uv run pytest -q — 33 passed
## Intentionally not in this PR (for a later wave or follow-up)
- 4.1 Backend-crash recovery IPC — bigger Electron main-process refactor
- 4.3 Exponential backoff for 503/ECONNREFUSED — fiddly retry-policy work
- 4.9 Stream-to-disk upload + size limit — requires multipart-streaming
rewrite at the FastAPI boundary
- 4.13 Per-conversation asyncio lock — needs a store-level primitive
- 4.15 list_documents via summaries — risks regressing docs that failed
summary generation; needs a unified doc registry first
- 4.16 Document summary LLM deferred to background — needs a task queue
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Wave 4 of 5 from PLAN.md. This one targets the structural weaknesses the tech audit flagged as "threats to actually works": unbounded timeouts, event-loop-blocking inference, OOM-able embedding batches, reads blocking on writes in SQLite, silent no-op ingestions for scanned PDFs, conversation splits when a meta event drops, startup deadlocks on slow Ollama.
Event loop stays responsive
OOM and re-upload guards
SQLite stays unblocked
Data integrity on stream
Startup correctness
Frontend hardening
Verification
Intentionally not in this PR
These are planned for a follow-up or Wave 4b depending on priority.
🤖 Generated with Claude Code