fix(serving): unify UI latency/output and stop /no_think echoes by konjoinfinity · Pull Request #190 · konjoai/squish

konjoinfinity · 2026-06-30T23:17:04Z

Summary

The SquishBar chat UI and the VS Code extension were far slower to respond than the web UI on the same model, and they leaked raw reasoning / tool-call JSON into the chat. Separately, models echoed the /no_think soft-switch directive into their replies. This makes every client behave like the web UI — fast, streamed, and clean — and guarantees /no_think-style directives never reach a client.

Root causes

VS Code (slow + leaked JSON/think): every message ran through the tool-calling loop, which always sent tools. The server forces stream=False whenever tools are present (server.py), so the client waited for the entire generation before seeing anything. The web UI's plain chat sends no tools → streams → fast.
SquishBar (slow + raw <think> blocks): the streaming chat path did no <think> handling, dumping the model's chain-of-thought straight into the bubble (the web UI hides it). It also sent no max_tokens cap.
/no_think echo: reasoning was disabled by literally injecting the string /no_think into the system message; models that don't honour the soft-switch parroted it back.

Fixes

VS Code: new squish.agentMode setting (default off). Off = plain streaming chat (fast TTFT, matches web UI / SquishBar); on = the multi-round tool loop. Text-mode tool-call detection is gated behind agent mode so a JSON-looking reply is never reinterpreted as a tool call.
SquishBar: a streaming ThinkFilter strips <think>…</think> from the token stream (mirroring the web UI), and the request now sends max_tokens/temperature.
Server: disable reasoning via the chat template's enable_thinking=False flag (_apply_chat_template) instead of injecting literal text. New strip_think_directives() in serving/tool_calling.py is applied on the /v1/chat/completions (streaming + non-streaming) and Ollama-compat (/api/generate, /api/chat) responses, so /think / /no_think / /nothink never reach any client regardless of model or endpoint.

Type of change

Checklist

pytest tests/ passes locally (serving + ollama + openai-compat + server-unit suites green; full MLX suite needs Apple Silicon)
ruff check reports no errors on changed files (ruff format clean on new files)
No hardcoded absolute paths
No model weights, eval output files, or log files staged for commit
Changes are scoped to one logical concern (UI parity + directive leak)
Performance-sensitive changes include a before/after squish bench run — N/A; the speedup is the server no longer forcing non-streaming for VS Code and reasoning no longer being dumped/echoed (no kernel-path change). Best validated live on Apple Silicon.

Validation

New tests/serving/test_strip_think_directives.py (directive removal + false-positive guards).
New VS Code test: plain-chat mode streams without tools.
VS Code: tsc compiles clean; 180/180 jest tests pass.

Related issues

🤖 Generated with Claude Code

Generated by Claude Code

The SquishBar chat and VS Code extension responded far slower than the web UI and leaked raw reasoning/JSON, and models echoed the `/no_think` soft-switch into replies. Make every client behave like the web UI. - VS Code: add `squish.agentMode` (default off). Plain chat now streams without tools, so the server no longer forces a non-streaming response (fast TTFT); tools are offered only in agent mode. Text-mode tool-call detection is also gated behind agent mode so a JSON-looking reply is never reinterpreted as a tool call. - SquishBar: strip `<think>…</think>` reasoning from the streamed tokens (matching the web UI, which hides reasoning) and send max_tokens/temperature so plain chat responds predictably. - Server: disable reasoning via the chat template's `enable_thinking=False` flag instead of injecting a literal `/no_think` string that weaker models parrot back. Add `strip_think_directives()` and apply it on the chat-completions and Ollama-compat responses so `/think`/`/no_think`/ `/nothink` never reach any client, whatever the model or endpoint. Tests: new unit tests for strip_think_directives; new VS Code test for the plain-chat (no-tools) fast path. Full serving suites + 180 extension tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WUTL48HAEHhqn9ai76tEHs

The directive/enable_thinking changes pushed server.py from 6087 to 6111 lines, tripping the Wave 120/123-126 line-count guards (ceiling <6100). Condense the added comments and docstring — no logic change — back to 6098. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WUTL48HAEHhqn9ai76tEHs

claude added 2 commits June 30, 2026 23:16

wesleyscholl marked this pull request as ready for review June 30, 2026 23:40

wesleyscholl merged commit e366708 into main Jun 30, 2026
17 checks passed

wesleyscholl deleted the claude/squish-ui-performance-o11tbo branch June 30, 2026 23:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(serving): unify UI latency/output and stop /no_think echoes#190

fix(serving): unify UI latency/output and stop /no_think echoes#190
wesleyscholl merged 2 commits into
mainfrom
claude/squish-ui-performance-o11tbo

konjoinfinity commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

konjoinfinity commented Jun 30, 2026

Summary

Type of change

Checklist

Related issues

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants