Skip to content

feat(wave4): reliability — threadpools, timeouts, WAL, batching, OCR guard#15

Merged
vikranthreddimasu merged 1 commit into
mainfrom
feat/wave4-reliability
Apr 18, 2026
Merged

feat(wave4): reliability — threadpools, timeouts, WAL, batching, OCR guard#15
vikranthreddimasu merged 1 commit into
mainfrom
feat/wave4-reliability

Conversation

@vikranthreddimasu

Copy link
Copy Markdown
Owner

Context

Wave 4 of 5 from PLAN.md. This one targets the structural weaknesses the tech audit flagged as "threats to actually works": unbounded timeouts, event-loop-blocking inference, OOM-able embedding batches, reads blocking on writes in SQLite, silent no-op ingestions for scanned PDFs, conversation splits when a meta event drops, startup deadlocks on slow Ollama.

Event loop stays responsive

  • `aembed` on the EmbeddingBackend protocol. `SentenceTransformersBackend` offloads `model.encode()` to the default threadpool (a 200–500ms encode was stalling concurrent health checks and sidebar refreshes). `HashEmbeddingBackend` runs inline.
  • Async query variants. `VectorStoreManager.aquery` / `aquery_across_notebooks` / `aquery_document_summaries` use `aembed`. RAGService `prepare_prompt` / `prepare_prompt_cross_notebook` and `LlamaIndexRAGService.query` now actually async all the way down.
  • LlamaCpp backend. `generate` and `stream_generate` use `loop.run_in_executor` for the blocking `llama.create_completion` calls. Streams pull chunks on a worker thread and yield on the event loop. Previously every second of inference froze all other requests.
  • Ollama stream timeout. `httpx.AsyncClient(timeout=httpx.Timeout(connect=5, read=300, write=10, pool=5))` instead of `timeout=None`. A hung Ollama instance no longer pins the connection forever.

OOM and re-upload guards

  • Batched embedding. `add_chunks` embeds in batches of 64 instead of one mega-call. A 500-page PDF no longer OOMs sentence-transformers.
  • Upsert, not add. `collection.upsert` makes re-uploading a file succeed cleanly. The old path failed mid-batch with `DuplicateIDError` and left Chroma half-updated.
  • OCR guard for scanned PDFs. `_load_pdf` now raises a clear `DocumentLoaderError` ("run it through OCR first") if extracted text is below a minimum threshold. Before, a scanned PDF "succeeded" with 0 chunks and queries returned "no relevant documents found" forever.

SQLite stays unblocked

  • `PRAGMA journal_mode = WAL` + `busy_timeout = 5000` in `notebook_store`, `conversation_store`, `metrics_store`. Readers don't block on writers. The sidebar refresh during stream completion used to stall under default DELETE journal.

Data integrity on stream

  • `conversation_id` echoed on `done` events too. If the `meta` event drops (parse error, transient glitch), the frontend still learns the id via done and doesn't fork the thread. `useChat` guards with `!activeConversationId` so it only sets when missing.

Startup correctness

  • Lazy Ollama resolution. Model resolution moved to an `asyncio` lifespan; a slow `/api/tags` round-trip at boot no longer delays FastAPI startup or trips Electron's `waitForBackend` timeout. `app.py` imports consolidated at the top (were E402-violating before).
  • Split token config defaults. `llm_context_window` 4096 (was 2048), `llm_max_tokens` 1024 (was 2048). One knob was doing two jobs — it capped replies AND silently truncated long RAG prompts.
  • Chroma telemetry off. `Settings(anonymized_telemetry=False)`. Offline-first is the whole promise; also stops the recurring "Failed to send telemetry event" log noise.

Frontend hardening

  • API timeouts. `request()` composes any caller-provided `AbortSignal` with a default 30s timeout. Uploads get their own 5-minute budget via `uploadDocument`. Previously a stalled endpoint hung the UI indefinitely.
  • User-facing error. Timeout errors flow through the Wave-3 `humanizeError` translator so the user sees "The request took too long…" instead of an `AbortError` stack dive.

Verification

  • `cd apps/desktop && npm run build` — clean (272 modules, 864ms)
  • `cd backend && uv run ruff check .` — clean
  • `cd backend && uv run pytest -q` — 33 passed

Intentionally not in this PR

  • 4.1 Backend-crash recovery IPC (bigger Electron refactor)
  • 4.3 Exponential backoff for 503/ECONNREFUSED (fiddly retry policy)
  • 4.9 Stream-to-disk upload + size limit (multipart streaming rewrite)
  • 4.13 Per-conversation asyncio lock (needs a store-level primitive)
  • 4.15 `list_documents` via summaries collection (risks regressing docs that failed summary)
  • 4.16 Defer summary LLM to a background task queue

These are planned for a follow-up or Wave 4b depending on priority.

🤖 Generated with Claude Code

…guard

Wave 4 of 5 from .gstack/qa-reports/PLAN.md. The structural weaknesses the
tech audit called out as "threats to actually works" — unbounded timeouts,
event-loop-blocking inference, OOM-able embedding batches, reads blocking
on writes in SQLite, silent no-op ingestions for scanned PDFs, conversation
splits when a meta event drops, startup deadlocks on slow Ollama.

## Event-loop stays responsive now

- Embedding backend grows an aembed() that offloads model.encode() to the
  default threadpool (SentenceTransformers) or runs inline (HashEmbedding
  is trivial). VectorStoreManager grows aquery / aquery_across_notebooks /
  aquery_document_summaries that use it. RAGService.prepare_prompt and
  prepare_prompt_cross_notebook are now truly async; a 200–500ms encode no
  longer stalls health checks and concurrent sidebar refreshes.
- LlamaCppBackend.generate / stream_generate run llama_cpp's blocking calls
  via loop.run_in_executor. Streams pull chunks on a worker thread and yield
  on the event loop. Without this, every second of inference froze every
  other request.
- OllamaBackend.stream_generate now uses a finite httpx.Timeout
  (connect 5s, read 300s, write 10s, pool 5s). Previous timeout=None held
  the connection open indefinitely when Ollama hung.

## OOM and re-upload guards

- VectorStoreManager.add_chunks embeds in batches of 64 instead of handing
  the whole document to model.encode in one call. A 500-page PDF no longer
  OOMs sentence-transformers.
- add_chunks also uses collection.upsert instead of add, so re-uploading
  a file succeeds cleanly instead of failing mid-batch with
  DuplicateIDError and leaving the collection in an inconsistent state.
- Scanned PDFs with no embedded text now raise a clear DocumentLoaderError
  ("run it through OCR first") instead of silently ingesting 0 chunks and
  serving "no relevant documents found" on every subsequent query.

## SQLite stays unblocked

- All three stores (notebook_store, conversation_store, metrics_store) now
  set PRAGMA journal_mode = WAL and PRAGMA busy_timeout = 5000. Readers no
  longer block on writers — the sidebar refresh during stream completion
  used to stall under the default DELETE journal.

## Data integrity on stream

- Backend now echoes conversation_id on the `done` event as well as `meta`.
  If the meta event drops (parse error, transient network), the frontend
  still learns the id via done and doesn't create a second conversation on
  the next send. useChat honors both with a `!activeConversationId` guard
  so we only set when it's missing.

## Startup and shipping correctness

- Ollama model resolution deferred to an asyncio lifespan so a slow
  /api/tags round-trip at boot no longer delays FastAPI startup and
  trips Electron's waitForBackend timeout. app.py imports consolidated
  at the top of the file (was E402-violating before).
- Config split: llm_context_window default 4096 (was 2048), llm_max_tokens
  default 1024 (was 2048). Previously both were 2048 — the same knob was
  used for "how much context the model can hold" AND "how many tokens to
  generate", which quietly truncated long RAG prompts AND capped replies.
- Chroma telemetry disabled via Settings(anonymized_telemetry=False).
  Keeps the offline-first promise and stops filling logs with "Failed to
  send telemetry event ClientCreateCollectionEvent" warnings.

## Frontend hardening

- api.ts request() accepts an optional timeoutMs and composes caller
  signals with a default 30s AbortController. Prior code could hang on
  any stalled endpoint forever. Uploads get their own 5-minute budget
  rather than the default.
- Network errors surface through the errorMessages humanizer that
  Wave 3 added, so the user sees "The backend isn't reachable…" instead
  of a raw "Failed to fetch".

## Verified

- cd apps/desktop && npm run build — clean (272 modules, 864ms)
- cd backend && uv run ruff check . — clean
- cd backend && uv run pytest -q — 33 passed

## Intentionally not in this PR (for a later wave or follow-up)

- 4.1 Backend-crash recovery IPC — bigger Electron main-process refactor
- 4.3 Exponential backoff for 503/ECONNREFUSED — fiddly retry-policy work
- 4.9 Stream-to-disk upload + size limit — requires multipart-streaming
  rewrite at the FastAPI boundary
- 4.13 Per-conversation asyncio lock — needs a store-level primitive
- 4.15 list_documents via summaries — risks regressing docs that failed
  summary generation; needs a unified doc registry first
- 4.16 Document summary LLM deferred to background — needs a task queue

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vikranthreddimasu vikranthreddimasu merged commit 78b5ccc into main Apr 18, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant