Profile remaining /v2/messages backend latency after Windows KG main-thread fix (#8008)

**Describe the bug**
After the Windows desktop KG/main-thread bottleneck fix in PR #8008, local KG context retrieval appears fast, but chat still has significant latency that seems to come from the backend/tool path rather than the desktop KG path.

Observed frontend timings:

```txt
fetchStart → headers: ~1.8s–2.6s
firstSseLine → firstAssistantChunk: ~1.9s–3.3s on tool-heavy prompts
```

PR #8008 addresses a separate confirmed desktop-side issue where KG graph writes could block Electron main-thread IPC. After that fix, KG query/status timings are consistently low, so the remaining delay appears to be elsewhere in `/v2/messages`.

**To Reproduce**
Steps to reproduce the behavior:

1. Run the Windows desktop app.
2. Open the chat / floating ask overlay.
3. Ask simple prompts such as:

   * `What do you see?`
4. Ask a tool-heavy/context prompt such as:

   * `What did I just discuss?`
5. Measure frontend timing around `/v2/messages`, especially:

   * request `fetchStart`
   * response `headers`
   * `firstSseLine`
   * `firstAssistantChunk`

**Current behavior**
The local KG path is fast after PR #8008, but `/v2/messages` still shows backend-looking latency:

```txt
fetchStart → headers: ~1.8s–2.6s
firstSseLine → firstAssistantChunk: ~1.9s–3.3s on tool-heavy prompts
```

From code inspection, possible areas include:

```txt
auth / rate limit / quota / subscription
get_chat_session
add_message_to_chat_session
add_message
get_available_app_by_id
get_messages(limit=10)
StreamingResponse returned
prompt prep
Anthropic first_event
conversation / memory tool calls
second Anthropic call
first assistant text
```

For context, the KG path after PR #8008 appears healthy:

```txt
kg queryNodes: ~0.5–2.7ms
db.queryKgNodes: ~0.3–2.0ms
kg status: ~0.1–0.5ms
```

So this issue is likely separate from the desktop KG bottleneck.

**Expected behavior**
After local desktop context is available quickly, the chat backend should begin streaming with lower delay. Ideally:

```txt
fetchStart → headers should be closer to sub-second or clearly explained by measured backend spans
firstSseLine → firstAssistantChunk should not take multiple seconds unless a specific tool/model span is responsible
```

The goal is not to guess a fix, but to identify the largest measured backend/tool span before submitting a backend optimization PR.

**Screenshots**
N/A. Timing logs are more useful than screenshots for this issue.

**user ID (can we access the user info to validate the bug?):**
Can provide privately if needed. I would prefer not to post the user ID publicly in the issue.

**Smartphone + device (please complete the following information):**

* Device: Windows desktop
* OS: Windows 11
* Browser: Electron desktop app
* App Version: local dev build from BasedHardware/omi
* Device version: Windows desktop

**Additional context**
Related PR: #8008

PR #8008 fixes a confirmed Windows desktop KG/main-thread bottleneck by moving KG graph writes off the Electron main thread. Before that fix, the renderer could wait over 1s on `kg:queryNodes` even when the actual SQLite query was only a few milliseconds, indicating main-thread queueing/blocking.

This new issue is for the remaining `/v2/messages` backend/tool latency.

It would help if a maintainer could provide or run sanitized backend timing traces for:

```txt
request_received
auth / rate_limit
subscription / quota checks
get_chat_session
add_message_to_chat_session
add_message
get_available_app_by_id
get_messages
StreamingResponse returned
stream.generator_start
prompt_prep
Anthropic first_event
first think event
tool call durations
second Anthropic call
first assistant text
stream done
```

Questions:

1. Is there a staging backend endpoint contributors can point the desktop app to?
2. Can a maintainer run a trace-only backend branch and share sanitized `/v2/messages` timing logs?
3. Does `/v2/messages` return response headers before the generator starts, or are headers effectively delayed until first body bytes are yielded?
4. Are quota/subscription checks known to be a meaningful part of chat TTFB?
5. Are conversation/memory tool calls expected to be serial, or would parallelizing independent read-only tools be acceptable?
6. Would it be acceptable to remove write-read-back dependencies before stream start, for example by appending the current user message in memory while persisting it separately?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile remaining /v2/messages backend latency after Windows KG main-thread fix (#8008) #8010

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Profile remaining /v2/messages backend latency after Windows KG main-thread fix (#8008) #8010

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions