Skip to content

fix(serving): unify UI latency/output and stop /no_think echoes#190

Merged
wesleyscholl merged 2 commits into
mainfrom
claude/squish-ui-performance-o11tbo
Jun 30, 2026
Merged

fix(serving): unify UI latency/output and stop /no_think echoes#190
wesleyscholl merged 2 commits into
mainfrom
claude/squish-ui-performance-o11tbo

Conversation

@konjoinfinity

Copy link
Copy Markdown
Collaborator

Summary

The SquishBar chat UI and the VS Code extension were far slower to respond than the web UI on the same model, and they leaked raw reasoning / tool-call JSON into the chat. Separately, models echoed the /no_think soft-switch directive into their replies. This makes every client behave like the web UI — fast, streamed, and clean — and guarantees /no_think-style directives never reach a client.

Root causes

  • VS Code (slow + leaked JSON/think): every message ran through the tool-calling loop, which always sent tools. The server forces stream=False whenever tools are present (server.py), so the client waited for the entire generation before seeing anything. The web UI's plain chat sends no tools → streams → fast.
  • SquishBar (slow + raw <think> blocks): the streaming chat path did no <think> handling, dumping the model's chain-of-thought straight into the bubble (the web UI hides it). It also sent no max_tokens cap.
  • /no_think echo: reasoning was disabled by literally injecting the string /no_think into the system message; models that don't honour the soft-switch parroted it back.

Fixes

  • VS Code: new squish.agentMode setting (default off). Off = plain streaming chat (fast TTFT, matches web UI / SquishBar); on = the multi-round tool loop. Text-mode tool-call detection is gated behind agent mode so a JSON-looking reply is never reinterpreted as a tool call.
  • SquishBar: a streaming ThinkFilter strips <think>…</think> from the token stream (mirroring the web UI), and the request now sends max_tokens/temperature.
  • Server: disable reasoning via the chat template's enable_thinking=False flag (_apply_chat_template) instead of injecting literal text. New strip_think_directives() in serving/tool_calling.py is applied on the /v1/chat/completions (streaming + non-streaming) and Ollama-compat (/api/generate, /api/chat) responses, so /think / /no_think / /nothink never reach any client regardless of model or endpoint.

Type of change

  • Bug fix
  • New feature
  • Performance improvement
  • Documentation update
  • Refactor / cleanup

Checklist

  • pytest tests/ passes locally (serving + ollama + openai-compat + server-unit suites green; full MLX suite needs Apple Silicon)
  • ruff check reports no errors on changed files (ruff format clean on new files)
  • No hardcoded absolute paths
  • No model weights, eval output files, or log files staged for commit
  • Changes are scoped to one logical concern (UI parity + directive leak)
  • Performance-sensitive changes include a before/after squish bench run — N/A; the speedup is the server no longer forcing non-streaming for VS Code and reasoning no longer being dumped/echoed (no kernel-path change). Best validated live on Apple Silicon.

Validation

  • New tests/serving/test_strip_think_directives.py (directive removal + false-positive guards).
  • New VS Code test: plain-chat mode streams without tools.
  • VS Code: tsc compiles clean; 180/180 jest tests pass.

Related issues

🤖 Generated with Claude Code


Generated by Claude Code

claude added 2 commits June 30, 2026 23:16
The SquishBar chat and VS Code extension responded far slower than the web
UI and leaked raw reasoning/JSON, and models echoed the `/no_think`
soft-switch into replies. Make every client behave like the web UI.

- VS Code: add `squish.agentMode` (default off). Plain chat now streams
  without tools, so the server no longer forces a non-streaming response
  (fast TTFT); tools are offered only in agent mode. Text-mode tool-call
  detection is also gated behind agent mode so a JSON-looking reply is never
  reinterpreted as a tool call.
- SquishBar: strip `<think>…</think>` reasoning from the streamed tokens
  (matching the web UI, which hides reasoning) and send max_tokens/temperature
  so plain chat responds predictably.
- Server: disable reasoning via the chat template's `enable_thinking=False`
  flag instead of injecting a literal `/no_think` string that weaker models
  parrot back. Add `strip_think_directives()` and apply it on the
  chat-completions and Ollama-compat responses so `/think`/`/no_think`/
  `/nothink` never reach any client, whatever the model or endpoint.

Tests: new unit tests for strip_think_directives; new VS Code test for the
plain-chat (no-tools) fast path. Full serving suites + 180 extension tests
green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WUTL48HAEHhqn9ai76tEHs
The directive/enable_thinking changes pushed server.py from 6087 to 6111
lines, tripping the Wave 120/123-126 line-count guards (ceiling <6100).
Condense the added comments and docstring — no logic change — back to 6098.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WUTL48HAEHhqn9ai76tEHs
@wesleyscholl wesleyscholl marked this pull request as ready for review June 30, 2026 23:40
@wesleyscholl wesleyscholl merged commit e366708 into main Jun 30, 2026
17 checks passed
@wesleyscholl wesleyscholl deleted the claude/squish-ui-performance-o11tbo branch June 30, 2026 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants