fix(sse): capture web-search answers, strip citation markers, avoid premature stop#3
fix(sse): capture web-search answers, strip citation markers, avoid premature stop#3Leslielu wants to merge 1 commit into
Conversation
…premature stop
Web-search ("finance"/"search" augmented) responses were returning empty or
truncated content, and citation markers leaked into the output. Three distinct
SSE-parsing bugs were involved:
1. Batch JSON Patch with no top-level path was dropped. The search/finance
answer text streams as {"o":"patch","v":[{"p":"/message/content/parts/0",
"o":"append","v":"..."}]} and a continuation form {"v":[{...}]} with only a
top-level "v" list. Neither was handled, so the answer text was discarded.
Added a branch that processes any top-level "v" list of sub-patches (non
content-part items are filtered by the path check).
2. Premature stream termination on search. Any /message/status ->
finished_successfully patch emitted finish_reason="stop". The web-search
tool message reports finished_successfully *before* the answer is generated,
so in streaming mode the consumer broke early and closed the upstream
connection before the answer streamed (no [DONE]). Non-streaming tolerated it
by draining the whole stream. Now we track the current message's
role/recipient/content_type and only treat finished_successfully as terminal
for the visible assistant answer (role=assistant, recipient=all, text), via
_term_finish().
3. Citation markers leaked. ChatGPT wraps inline citations in private-use-area
control chars (citeturn0finance0), surfacing as stray
"citeturn0finance0" text. Added strip_citation_markers() (non-streaming) and
a stateful CitationStripper (streaming, safe across split chunks) handling
both the PUA-delimited and bare-text forms, with no false positives.
All fixes applied to both the sync and async parsers.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
📝 WalkthroughWalkthrough本 PR 为 ChatGPT 流式输出引入引用和导航标记清理功能。在 SSE 流解析层新增 PUA 正则匹配器、增量清理类和动态终止判断逻辑,同时扩展消息提取以处理新型 JSON Patch 格式,最后在客户端中应用清理功能到汇总和流式增量。 Changes引用标记清理与 SSE 流解析
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@app/chatgpt/sse.py`:
- Around line 192-219: The batch JSON Patch handling uses stale context
variables cur_role/cur_recipient/cur_ct when calling _term_finish(), which can
misattribute finish_reason for patches belonging to other messages; update the
block that yields a finished message to derive role/recipient/ct from the patch
itself (e.g., from data fields or from the per-key accumulator entry in
accumulators keyed by conv_id:msg_id) and pass those extracted values into
_term_finish() (fall back to empty/defaults only if the patch/accumulator lacks
them); reference the variables/functions accumulators, key, data, sub_patch,
_term_finish, ChatMessage and ensure the finish_reason is computed using the
patch-scoped context rather than cur_role/cur_recipient/cur_ct.
- Around line 38-62: CitationStripper.feed currently only strips PUA markers and
then runs _CITATION_TEXT_RE.sub on the incoming chunk, which misses bare-text
citation tokens split across chunks; add a small rolling buffer on the
CitationStripper instance (e.g., self._tail_buf) and in feed prepend that buffer
to the current chunk before processing with _CITATION_TEXT_RE and the PUA logic,
then after processing keep only the final N characters (N = max possible
citation token length, derived from the regex for _CITATION_TEXT_RE) in
self._tail_buf for the next call; ensure CitationStripper.feed still returns
only the cleaned output for the current chunk (excluding the kept tail) and
update/reset the buffer when markers are closed or fully consumed, and note
strip_citation_markers remains the downstream full-text fallback.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 7c77a436-e7af-4a5c-9d47-807cbf10d310
📒 Files selected for processing (2)
app/chatgpt/client.pyapp/chatgpt/sse.py
| class CitationStripper: | ||
| """Stateful stripper for streamed deltas: drops PUA-delimited citation | ||
| markers even when a marker is split across chunks, plus the bare text form.""" | ||
|
|
||
| def __init__(self) -> None: | ||
| self._in_marker = False | ||
|
|
||
| def feed(self, chunk: str) -> str: | ||
| if not chunk: | ||
| return chunk | ||
| if self._in_marker or "\ue200" in chunk: | ||
| out = [] | ||
| for ch in chunk: | ||
| if self._in_marker: | ||
| if ch == "\ue201": | ||
| self._in_marker = False | ||
| continue | ||
| if ch == "\ue200": | ||
| self._in_marker = True | ||
| continue | ||
| if "\ue200" <= ch <= "\ue20f": | ||
| continue | ||
| out.append(ch) | ||
| chunk = "".join(out) | ||
| return _CITATION_TEXT_RE.sub("", chunk) |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | 💤 Low value
裸文本引用标记在跨 chunk 分割时可能无法被清理。
CitationStripper.feed() 在字符处理后调用 _CITATION_TEXT_RE.sub("", chunk),但如果裸文本引用标记(如 citeturn0finance0)恰好被分割在两个 chunk 之间(如 chunk1="cite", chunk2="turn0finance0"),则两个 chunk 都不会匹配正则,导致泄漏。
不过根据下游代码(client.py:446),非流式场景会对聚合后的完整内容再次调用 strip_citation_markers(),可以兜底清理。如果裸文本分割确实在生产中出现,可考虑在 CitationStripper 中增加一个小的 buffer 来处理跨 chunk 的文本匹配。
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/chatgpt/sse.py` around lines 38 - 62, CitationStripper.feed currently
only strips PUA markers and then runs _CITATION_TEXT_RE.sub on the incoming
chunk, which misses bare-text citation tokens split across chunks; add a small
rolling buffer on the CitationStripper instance (e.g., self._tail_buf) and in
feed prepend that buffer to the current chunk before processing with
_CITATION_TEXT_RE and the PUA logic, then after processing keep only the final N
characters (N = max possible citation token length, derived from the regex for
_CITATION_TEXT_RE) in self._tail_buf for the next call; ensure
CitationStripper.feed still returns only the cleaned output for the current
chunk (excluding the kept tail) and update/reset the buffer when markers are
closed or fully consumed, and note strip_citation_markers remains the downstream
full-text fallback.
| # --- Batch JSON Patch with no top-level path --- | ||
| # Two forms, both used by the web-search / finance answer flow: | ||
| # {"o": "patch", "v": [{"p": "/message/content/parts/0", "o": "append", "v": "..."}]} | ||
| # {"v": [{"p": "/message/content/parts/0", "o": "append", "v": "..."}]} (continuation) | ||
| # Non-content list items (e.g. search_result_group) are filtered by the path check below. | ||
| if "p" not in data and isinstance(data.get("v"), list): | ||
| conv_id = data.get("conversation_id", "") | ||
| msg_id = data.get("message_id", "") | ||
| key = f"{conv_id}:{msg_id}" | ||
| if key not in accumulators: | ||
| accumulators[key] = {"text": "", "message_id": msg_id, "conversation_id": conv_id} | ||
| for sub_patch in data["v"]: | ||
| if not isinstance(sub_patch, dict): | ||
| continue | ||
| sub_val = sub_patch.get("v") | ||
| sub_path = sub_patch.get("p", "") | ||
| if isinstance(sub_val, str) and "/content/parts/" in sub_path: | ||
| accumulators[key]["text"] += sub_val | ||
| yield ChatMessage( | ||
| message_id=msg_id, conversation_id=conv_id, | ||
| role="assistant", content=sub_val, | ||
| ) | ||
| elif sub_path == "/message/status" and sub_val == "finished_successfully": | ||
| yield ChatMessage( | ||
| message_id=msg_id, conversation_id=conv_id, | ||
| role="assistant", content="", finish_reason=_term_finish(cur_role, cur_recipient, cur_ct), | ||
| ) | ||
| continue |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial | 💤 Low value
批量 JSON Patch 块中的消息上下文可能与当前消息不匹配。
在批量 JSON Patch 处理块中(lines 192-219),cur_role/cur_recipient/cur_ct 用于计算 _term_finish(),但这些变量是由之前处理的完整消息更新设置的。如果批量 patch 来自不同的消息(例如在 tool 消息之后),可能会导致 _term_finish() 基于过期的上下文做出错误判断。
不过由于 _term_finish() 对空字符串默认返回 "stop",且批量 patch 通常紧随其对应的消息上下文,实际影响可能有限。如果后续发现流式场景有异常终止,可考虑从 batch patch 本身提取或验证消息身份。
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@app/chatgpt/sse.py` around lines 192 - 219, The batch JSON Patch handling
uses stale context variables cur_role/cur_recipient/cur_ct when calling
_term_finish(), which can misattribute finish_reason for patches belonging to
other messages; update the block that yields a finished message to derive
role/recipient/ct from the patch itself (e.g., from data fields or from the
per-key accumulator entry in accumulators keyed by conv_id:msg_id) and pass
those extracted values into _term_finish() (fall back to empty/defaults only if
the patch/accumulator lacks them); reference the variables/functions
accumulators, key, data, sub_patch, _term_finish, ChatMessage and ensure the
finish_reason is computed using the patch-scoped context rather than
cur_role/cur_recipient/cur_ct.
背景
启用联网(web search / finance)的回答出现两类问题:正文为空或被截断,以及引用标记泄漏到输出(如
citeturn0finance0)。排查后发现是 SSE 解析里的三个独立 bug。三个根因 & 修复
1. 无顶层 path 的批量补丁被丢弃(空/截断)
搜索回答的正文以批量补丁形式下发,有两种形态:
{"o":"patch","v":[{"p":"/message/content/parts/0","o":"append","v":"..."}]} {"v":[{"p":"/message/content/parts/0","o":"append","v":"..."}]} // 连续形式,仅顶层 v原解析器两种都没处理,正文被直接丢弃。新增分支:处理任何「顶层
v是 sub-patch 列表」的事件(非正文项如search_result_group由/content/parts/路径判断过滤掉)。2. 搜索场景下流式被提前掐断
解析器对任何
/message/status → finished_successfully都发finish_reason="stop"。但搜索工具消息会先于正文 finish——流式消费端收到这个早到的stop就提前 break,在正文生成前关闭了上游连接(抓包里根本没有[DONE])。非流式因为会把整个流读完所以不受影响。修复:跟踪「当前消息」的
role/recipient/content_type,只有当它是用户可见的助手回答(role=assistant、recipient=all、text)时才把finished_successfully当作终止(新增_term_finish())。3. 引用标记泄漏
ChatGPT 用私有区控制符包裹内联引用(
citeturn0finance0),渲染出来就是裸的citeturn0finance0。新增:strip_citation_markers():非流式在正文汇总处清洗;CitationStripper:流式有状态清洗,跨分片安全;以上修复 sync 与 async 两套解析器都已覆盖。
验证
stop、无 PUA / 无 cite 残留如需要我可以补单元测试。
Summary by CodeRabbit
发布说明