Describe the bug
After the Windows desktop KG/main-thread bottleneck fix in PR #8008, local KG context retrieval appears fast, but chat still has significant latency that seems to come from the backend/tool path rather than the desktop KG path.
Observed frontend timings:
fetchStart → headers: ~1.8s–2.6s
firstSseLine → firstAssistantChunk: ~1.9s–3.3s on tool-heavy prompts
PR #8008 addresses a separate confirmed desktop-side issue where KG graph writes could block Electron main-thread IPC. After that fix, KG query/status timings are consistently low, so the remaining delay appears to be elsewhere in /v2/messages.
To Reproduce
Steps to reproduce the behavior:
-
Run the Windows desktop app.
-
Open the chat / floating ask overlay.
-
Ask simple prompts such as:
-
Ask a tool-heavy/context prompt such as:
-
Measure frontend timing around /v2/messages, especially:
- request
fetchStart
- response
headers
firstSseLine
firstAssistantChunk
Current behavior
The local KG path is fast after PR #8008, but /v2/messages still shows backend-looking latency:
fetchStart → headers: ~1.8s–2.6s
firstSseLine → firstAssistantChunk: ~1.9s–3.3s on tool-heavy prompts
From code inspection, possible areas include:
auth / rate limit / quota / subscription
get_chat_session
add_message_to_chat_session
add_message
get_available_app_by_id
get_messages(limit=10)
StreamingResponse returned
prompt prep
Anthropic first_event
conversation / memory tool calls
second Anthropic call
first assistant text
For context, the KG path after PR #8008 appears healthy:
kg queryNodes: ~0.5–2.7ms
db.queryKgNodes: ~0.3–2.0ms
kg status: ~0.1–0.5ms
So this issue is likely separate from the desktop KG bottleneck.
Expected behavior
After local desktop context is available quickly, the chat backend should begin streaming with lower delay. Ideally:
fetchStart → headers should be closer to sub-second or clearly explained by measured backend spans
firstSseLine → firstAssistantChunk should not take multiple seconds unless a specific tool/model span is responsible
The goal is not to guess a fix, but to identify the largest measured backend/tool span before submitting a backend optimization PR.
Screenshots
N/A. Timing logs are more useful than screenshots for this issue.
user ID (can we access the user info to validate the bug?):
Can provide privately if needed. I would prefer not to post the user ID publicly in the issue.
Smartphone + device (please complete the following information):
- Device: Windows desktop
- OS: Windows 11
- Browser: Electron desktop app
- App Version: local dev build from BasedHardware/omi
- Device version: Windows desktop
Additional context
Related PR: #8008
PR #8008 fixes a confirmed Windows desktop KG/main-thread bottleneck by moving KG graph writes off the Electron main thread. Before that fix, the renderer could wait over 1s on kg:queryNodes even when the actual SQLite query was only a few milliseconds, indicating main-thread queueing/blocking.
This new issue is for the remaining /v2/messages backend/tool latency.
It would help if a maintainer could provide or run sanitized backend timing traces for:
request_received
auth / rate_limit
subscription / quota checks
get_chat_session
add_message_to_chat_session
add_message
get_available_app_by_id
get_messages
StreamingResponse returned
stream.generator_start
prompt_prep
Anthropic first_event
first think event
tool call durations
second Anthropic call
first assistant text
stream done
Questions:
- Is there a staging backend endpoint contributors can point the desktop app to?
- Can a maintainer run a trace-only backend branch and share sanitized
/v2/messages timing logs?
- Does
/v2/messages return response headers before the generator starts, or are headers effectively delayed until first body bytes are yielded?
- Are quota/subscription checks known to be a meaningful part of chat TTFB?
- Are conversation/memory tool calls expected to be serial, or would parallelizing independent read-only tools be acceptable?
- Would it be acceptable to remove write-read-back dependencies before stream start, for example by appending the current user message in memory while persisting it separately?
Describe the bug
After the Windows desktop KG/main-thread bottleneck fix in PR #8008, local KG context retrieval appears fast, but chat still has significant latency that seems to come from the backend/tool path rather than the desktop KG path.
Observed frontend timings:
PR #8008 addresses a separate confirmed desktop-side issue where KG graph writes could block Electron main-thread IPC. After that fix, KG query/status timings are consistently low, so the remaining delay appears to be elsewhere in
/v2/messages.To Reproduce
Steps to reproduce the behavior:
Run the Windows desktop app.
Open the chat / floating ask overlay.
Ask simple prompts such as:
What do you see?Ask a tool-heavy/context prompt such as:
What did I just discuss?Measure frontend timing around
/v2/messages, especially:fetchStartheadersfirstSseLinefirstAssistantChunkCurrent behavior
The local KG path is fast after PR #8008, but
/v2/messagesstill shows backend-looking latency:From code inspection, possible areas include:
For context, the KG path after PR #8008 appears healthy:
So this issue is likely separate from the desktop KG bottleneck.
Expected behavior
After local desktop context is available quickly, the chat backend should begin streaming with lower delay. Ideally:
The goal is not to guess a fix, but to identify the largest measured backend/tool span before submitting a backend optimization PR.
Screenshots
N/A. Timing logs are more useful than screenshots for this issue.
user ID (can we access the user info to validate the bug?):
Can provide privately if needed. I would prefer not to post the user ID publicly in the issue.
Smartphone + device (please complete the following information):
Additional context
Related PR: #8008
PR #8008 fixes a confirmed Windows desktop KG/main-thread bottleneck by moving KG graph writes off the Electron main thread. Before that fix, the renderer could wait over 1s on
kg:queryNodeseven when the actual SQLite query was only a few milliseconds, indicating main-thread queueing/blocking.This new issue is for the remaining
/v2/messagesbackend/tool latency.It would help if a maintainer could provide or run sanitized backend timing traces for:
Questions:
/v2/messagestiming logs?/v2/messagesreturn response headers before the generator starts, or are headers effectively delayed until first body bytes are yielded?