fix(serving): unify UI latency/output and stop /no_think echoes#190
Merged
Conversation
The SquishBar chat and VS Code extension responded far slower than the web UI and leaked raw reasoning/JSON, and models echoed the `/no_think` soft-switch into replies. Make every client behave like the web UI. - VS Code: add `squish.agentMode` (default off). Plain chat now streams without tools, so the server no longer forces a non-streaming response (fast TTFT); tools are offered only in agent mode. Text-mode tool-call detection is also gated behind agent mode so a JSON-looking reply is never reinterpreted as a tool call. - SquishBar: strip `<think>…</think>` reasoning from the streamed tokens (matching the web UI, which hides reasoning) and send max_tokens/temperature so plain chat responds predictably. - Server: disable reasoning via the chat template's `enable_thinking=False` flag instead of injecting a literal `/no_think` string that weaker models parrot back. Add `strip_think_directives()` and apply it on the chat-completions and Ollama-compat responses so `/think`/`/no_think`/ `/nothink` never reach any client, whatever the model or endpoint. Tests: new unit tests for strip_think_directives; new VS Code test for the plain-chat (no-tools) fast path. Full serving suites + 180 extension tests green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WUTL48HAEHhqn9ai76tEHs
The directive/enable_thinking changes pushed server.py from 6087 to 6111 lines, tripping the Wave 120/123-126 line-count guards (ceiling <6100). Condense the added comments and docstring — no logic change — back to 6098. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WUTL48HAEHhqn9ai76tEHs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The SquishBar chat UI and the VS Code extension were far slower to respond than the web UI on the same model, and they leaked raw reasoning / tool-call JSON into the chat. Separately, models echoed the
/no_thinksoft-switch directive into their replies. This makes every client behave like the web UI — fast, streamed, and clean — and guarantees/no_think-style directives never reach a client.Root causes
tools. The server forcesstream=Falsewhenever tools are present (server.py), so the client waited for the entire generation before seeing anything. The web UI's plain chat sends no tools → streams → fast.<think>blocks): the streaming chat path did no<think>handling, dumping the model's chain-of-thought straight into the bubble (the web UI hides it). It also sent nomax_tokenscap./no_thinkecho: reasoning was disabled by literally injecting the string/no_thinkinto the system message; models that don't honour the soft-switch parroted it back.Fixes
squish.agentModesetting (default off). Off = plain streaming chat (fast TTFT, matches web UI / SquishBar); on = the multi-round tool loop. Text-mode tool-call detection is gated behind agent mode so a JSON-looking reply is never reinterpreted as a tool call.ThinkFilterstrips<think>…</think>from the token stream (mirroring the web UI), and the request now sendsmax_tokens/temperature.enable_thinking=Falseflag (_apply_chat_template) instead of injecting literal text. Newstrip_think_directives()inserving/tool_calling.pyis applied on the/v1/chat/completions(streaming + non-streaming) and Ollama-compat (/api/generate,/api/chat) responses, so/think//no_think//nothinknever reach any client regardless of model or endpoint.Type of change
Checklist
pytest tests/passes locally (serving + ollama + openai-compat + server-unit suites green; full MLX suite needs Apple Silicon)ruff checkreports no errors on changed files (ruff formatclean on new files)squish benchrun — N/A; the speedup is the server no longer forcing non-streaming for VS Code and reasoning no longer being dumped/echoed (no kernel-path change). Best validated live on Apple Silicon.Validation
tests/serving/test_strip_think_directives.py(directive removal + false-positive guards).tsccompiles clean; 180/180 jest tests pass.Related issues
🤖 Generated with Claude Code
Generated by Claude Code