Skip to content

Profile remaining /v2/messages backend latency after Windows KG main-thread fix (#8008) #8010

@karthikyeluripati

Description

@karthikyeluripati

Describe the bug
After the Windows desktop KG/main-thread bottleneck fix in PR #8008, local KG context retrieval appears fast, but chat still has significant latency that seems to come from the backend/tool path rather than the desktop KG path.

Observed frontend timings:

fetchStart → headers: ~1.8s–2.6s
firstSseLine → firstAssistantChunk: ~1.9s–3.3s on tool-heavy prompts

PR #8008 addresses a separate confirmed desktop-side issue where KG graph writes could block Electron main-thread IPC. After that fix, KG query/status timings are consistently low, so the remaining delay appears to be elsewhere in /v2/messages.

To Reproduce
Steps to reproduce the behavior:

  1. Run the Windows desktop app.

  2. Open the chat / floating ask overlay.

  3. Ask simple prompts such as:

    • What do you see?
  4. Ask a tool-heavy/context prompt such as:

    • What did I just discuss?
  5. Measure frontend timing around /v2/messages, especially:

    • request fetchStart
    • response headers
    • firstSseLine
    • firstAssistantChunk

Current behavior
The local KG path is fast after PR #8008, but /v2/messages still shows backend-looking latency:

fetchStart → headers: ~1.8s–2.6s
firstSseLine → firstAssistantChunk: ~1.9s–3.3s on tool-heavy prompts

From code inspection, possible areas include:

auth / rate limit / quota / subscription
get_chat_session
add_message_to_chat_session
add_message
get_available_app_by_id
get_messages(limit=10)
StreamingResponse returned
prompt prep
Anthropic first_event
conversation / memory tool calls
second Anthropic call
first assistant text

For context, the KG path after PR #8008 appears healthy:

kg queryNodes: ~0.5–2.7ms
db.queryKgNodes: ~0.3–2.0ms
kg status: ~0.1–0.5ms

So this issue is likely separate from the desktop KG bottleneck.

Expected behavior
After local desktop context is available quickly, the chat backend should begin streaming with lower delay. Ideally:

fetchStart → headers should be closer to sub-second or clearly explained by measured backend spans
firstSseLine → firstAssistantChunk should not take multiple seconds unless a specific tool/model span is responsible

The goal is not to guess a fix, but to identify the largest measured backend/tool span before submitting a backend optimization PR.

Screenshots
N/A. Timing logs are more useful than screenshots for this issue.

user ID (can we access the user info to validate the bug?):
Can provide privately if needed. I would prefer not to post the user ID publicly in the issue.

Smartphone + device (please complete the following information):

  • Device: Windows desktop
  • OS: Windows 11
  • Browser: Electron desktop app
  • App Version: local dev build from BasedHardware/omi
  • Device version: Windows desktop

Additional context
Related PR: #8008

PR #8008 fixes a confirmed Windows desktop KG/main-thread bottleneck by moving KG graph writes off the Electron main thread. Before that fix, the renderer could wait over 1s on kg:queryNodes even when the actual SQLite query was only a few milliseconds, indicating main-thread queueing/blocking.

This new issue is for the remaining /v2/messages backend/tool latency.

It would help if a maintainer could provide or run sanitized backend timing traces for:

request_received
auth / rate_limit
subscription / quota checks
get_chat_session
add_message_to_chat_session
add_message
get_available_app_by_id
get_messages
StreamingResponse returned
stream.generator_start
prompt_prep
Anthropic first_event
first think event
tool call durations
second Anthropic call
first assistant text
stream done

Questions:

  1. Is there a staging backend endpoint contributors can point the desktop app to?
  2. Can a maintainer run a trace-only backend branch and share sanitized /v2/messages timing logs?
  3. Does /v2/messages return response headers before the generator starts, or are headers effectively delayed until first body bytes are yielded?
  4. Are quota/subscription checks known to be a meaningful part of chat TTFB?
  5. Are conversation/memory tool calls expected to be serial, or would parallelizing independent read-only tools be acceptable?
  6. Would it be acceptable to remove write-read-back dependencies before stream start, for example by appending the current user message in memory while persisting it separately?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions