Skip to content

fix(sse): capture web-search answers, strip citation markers, avoid premature stop#3

Open
Leslielu wants to merge 1 commit into
69gg:mainfrom
Leslielu:fix/sse-search-citation-truncation
Open

fix(sse): capture web-search answers, strip citation markers, avoid premature stop#3
Leslielu wants to merge 1 commit into
69gg:mainfrom
Leslielu:fix/sse-search-citation-truncation

Conversation

@Leslielu

@Leslielu Leslielu commented Jun 4, 2026

Copy link
Copy Markdown

背景

启用联网(web search / finance)的回答出现两类问题:正文为空或被截断,以及引用标记泄漏到输出(如 citeturn0finance0)。排查后发现是 SSE 解析里的三个独立 bug。

三个根因 & 修复

1. 无顶层 path 的批量补丁被丢弃(空/截断)

搜索回答的正文以批量补丁形式下发,有两种形态:

{"o":"patch","v":[{"p":"/message/content/parts/0","o":"append","v":"..."}]}
{"v":[{"p":"/message/content/parts/0","o":"append","v":"..."}]}   // 连续形式,仅顶层 v

原解析器两种都没处理,正文被直接丢弃。新增分支:处理任何「顶层 v 是 sub-patch 列表」的事件(非正文项如 search_result_group/content/parts/ 路径判断过滤掉)。

2. 搜索场景下流式被提前掐断

解析器对任何 /message/status → finished_successfully 都发 finish_reason="stop"。但搜索工具消息会先于正文 finish——流式消费端收到这个早到的 stop 就提前 break,在正文生成前关闭了上游连接(抓包里根本没有 [DONE])。非流式因为会把整个流读完所以不受影响。

修复:跟踪「当前消息」的 role/recipient/content_type,只有当它是用户可见的助手回答(role=assistant、recipient=all、text)时才把 finished_successfully 当作终止(新增 _term_finish())。

3. 引用标记泄漏

ChatGPT 用私有区控制符包裹内联引用(citeturn0finance0),渲染出来就是裸的 citeturn0finance0。新增:

  • strip_citation_markers():非流式在正文汇总处清洗;
  • CitationStripper:流式有状态清洗,跨分片安全;
  • 同时覆盖 PUA 包裹形式与裸文本形式,且无误伤(例如 "I cite my sources ... turning point" 不受影响)。

以上修复 sync 与 async 两套解析器都已覆盖。

验证

场景 结果
流式 + 搜索 ✅ 完整正文、正常 stop、无 PUA / 无 cite 残留
非流式 + 搜索 ✅ 完整、干净
非搜索流式 ✅ 正常终止,无回归

如需要我可以补单元测试。

Summary by CodeRabbit

发布说明

  • Bug 修复
    • 优化了聊天流式输出的清洁度,改进了引用标记处理
    • 增强了流式消息传输的准确性和消息完成判断的可靠性

…premature stop

Web-search ("finance"/"search" augmented) responses were returning empty or
truncated content, and citation markers leaked into the output. Three distinct
SSE-parsing bugs were involved:

1. Batch JSON Patch with no top-level path was dropped. The search/finance
   answer text streams as {"o":"patch","v":[{"p":"/message/content/parts/0",
   "o":"append","v":"..."}]} and a continuation form {"v":[{...}]} with only a
   top-level "v" list. Neither was handled, so the answer text was discarded.
   Added a branch that processes any top-level "v" list of sub-patches (non
   content-part items are filtered by the path check).

2. Premature stream termination on search. Any /message/status ->
   finished_successfully patch emitted finish_reason="stop". The web-search
   tool message reports finished_successfully *before* the answer is generated,
   so in streaming mode the consumer broke early and closed the upstream
   connection before the answer streamed (no [DONE]). Non-streaming tolerated it
   by draining the whole stream. Now we track the current message's
   role/recipient/content_type and only treat finished_successfully as terminal
   for the visible assistant answer (role=assistant, recipient=all, text), via
   _term_finish().

3. Citation markers leaked. ChatGPT wraps inline citations in private-use-area
   control chars (citeturn0finance0), surfacing as stray
   "citeturn0finance0" text. Added strip_citation_markers() (non-streaming) and
   a stateful CitationStripper (streaming, safe across split chunks) handling
   both the PUA-delimited and bare-text forms, with no false positives.

All fixes applied to both the sync and async parsers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

本 PR 为 ChatGPT 流式输出引入引用和导航标记清理功能。在 SSE 流解析层新增 PUA 正则匹配器、增量清理类和动态终止判断逻辑,同时扩展消息提取以处理新型 JSON Patch 格式,最后在客户端中应用清理功能到汇总和流式增量。

Changes

引用标记清理与 SSE 流解析

Layer / File(s) Summary
引用标记清理库
app/chatgpt/sse.py (lines 10–79)
新增 PUA 相关正则、strip_citation_markers() 同步函数和 CitationStripper 类支持跨 chunk 增量清理;新增 _term_finish() 根据消息角色/recipient/内容类型动态计算终止原因。
SSE 流解析:消息上下文与动态终止
app/chatgpt/sse.py (lines 151, 304, 326, 501, 638, 659)
extract_chat_messagesasync_extract_chat_messages 中初始化消息上下文变量(cur_role/cur_recipient/cur_ct),在消息完成事件时用 _term_finish() 计算动态 finish_reason 替代固定 "stop"
SSE 流解析:批量 JSON Patch 处理与上下文赋值
app/chatgpt/sse.py (lines 192–220, 364–368, 538–563, 693–696)
在两处消息提取函数中新增无顶层 path 的批量 Patch 处理,按 path 过滤 "/content/parts/" 文本增量;在消息跳过前赋值上下文变量以支撑后续 finish_reason 计算。
客户端集成:引用标记清理
app/chatgpt/client.py (lines 29, 443–446, 472–479)
导入清理模块,chat() 对汇总内容执行 strip_citation_markers()chat_stream() 为流式增量创建 CitationStripper 实例并过滤空内容。

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 引用标记悄悄隐,
流式增量逐次清,
SSE 调整终止因,
Patch 批量新解析成,
干净输出展眼前!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed 标题清晰准确地总结了本次变更的三个主要修复:修复SSE解析以捕获网络搜索答案、剥离引用标记、避免过早终止流。标题与代码变更内容完全对应。
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/chatgpt/sse.py`:
- Around line 192-219: The batch JSON Patch handling uses stale context
variables cur_role/cur_recipient/cur_ct when calling _term_finish(), which can
misattribute finish_reason for patches belonging to other messages; update the
block that yields a finished message to derive role/recipient/ct from the patch
itself (e.g., from data fields or from the per-key accumulator entry in
accumulators keyed by conv_id:msg_id) and pass those extracted values into
_term_finish() (fall back to empty/defaults only if the patch/accumulator lacks
them); reference the variables/functions accumulators, key, data, sub_patch,
_term_finish, ChatMessage and ensure the finish_reason is computed using the
patch-scoped context rather than cur_role/cur_recipient/cur_ct.
- Around line 38-62: CitationStripper.feed currently only strips PUA markers and
then runs _CITATION_TEXT_RE.sub on the incoming chunk, which misses bare-text
citation tokens split across chunks; add a small rolling buffer on the
CitationStripper instance (e.g., self._tail_buf) and in feed prepend that buffer
to the current chunk before processing with _CITATION_TEXT_RE and the PUA logic,
then after processing keep only the final N characters (N = max possible
citation token length, derived from the regex for _CITATION_TEXT_RE) in
self._tail_buf for the next call; ensure CitationStripper.feed still returns
only the cleaned output for the current chunk (excluding the kept tail) and
update/reset the buffer when markers are closed or fully consumed, and note
strip_citation_markers remains the downstream full-text fallback.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 7c77a436-e7af-4a5c-9d47-807cbf10d310

📥 Commits

Reviewing files that changed from the base of the PR and between bd146ee and 4941230.

📒 Files selected for processing (2)
  • app/chatgpt/client.py
  • app/chatgpt/sse.py

Comment thread app/chatgpt/sse.py
Comment on lines +38 to +62
class CitationStripper:
"""Stateful stripper for streamed deltas: drops PUA-delimited citation
markers even when a marker is split across chunks, plus the bare text form."""

def __init__(self) -> None:
self._in_marker = False

def feed(self, chunk: str) -> str:
if not chunk:
return chunk
if self._in_marker or "\ue200" in chunk:
out = []
for ch in chunk:
if self._in_marker:
if ch == "\ue201":
self._in_marker = False
continue
if ch == "\ue200":
self._in_marker = True
continue
if "\ue200" <= ch <= "\ue20f":
continue
out.append(ch)
chunk = "".join(out)
return _CITATION_TEXT_RE.sub("", chunk)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

裸文本引用标记在跨 chunk 分割时可能无法被清理。

CitationStripper.feed() 在字符处理后调用 _CITATION_TEXT_RE.sub("", chunk),但如果裸文本引用标记(如 citeturn0finance0)恰好被分割在两个 chunk 之间(如 chunk1="cite", chunk2="turn0finance0"),则两个 chunk 都不会匹配正则,导致泄漏。

不过根据下游代码(client.py:446),非流式场景会对聚合后的完整内容再次调用 strip_citation_markers(),可以兜底清理。如果裸文本分割确实在生产中出现,可考虑在 CitationStripper 中增加一个小的 buffer 来处理跨 chunk 的文本匹配。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/chatgpt/sse.py` around lines 38 - 62, CitationStripper.feed currently
only strips PUA markers and then runs _CITATION_TEXT_RE.sub on the incoming
chunk, which misses bare-text citation tokens split across chunks; add a small
rolling buffer on the CitationStripper instance (e.g., self._tail_buf) and in
feed prepend that buffer to the current chunk before processing with
_CITATION_TEXT_RE and the PUA logic, then after processing keep only the final N
characters (N = max possible citation token length, derived from the regex for
_CITATION_TEXT_RE) in self._tail_buf for the next call; ensure
CitationStripper.feed still returns only the cleaned output for the current
chunk (excluding the kept tail) and update/reset the buffer when markers are
closed or fully consumed, and note strip_citation_markers remains the downstream
full-text fallback.

Comment thread app/chatgpt/sse.py
Comment on lines +192 to +219
# --- Batch JSON Patch with no top-level path ---
# Two forms, both used by the web-search / finance answer flow:
# {"o": "patch", "v": [{"p": "/message/content/parts/0", "o": "append", "v": "..."}]}
# {"v": [{"p": "/message/content/parts/0", "o": "append", "v": "..."}]} (continuation)
# Non-content list items (e.g. search_result_group) are filtered by the path check below.
if "p" not in data and isinstance(data.get("v"), list):
conv_id = data.get("conversation_id", "")
msg_id = data.get("message_id", "")
key = f"{conv_id}:{msg_id}"
if key not in accumulators:
accumulators[key] = {"text": "", "message_id": msg_id, "conversation_id": conv_id}
for sub_patch in data["v"]:
if not isinstance(sub_patch, dict):
continue
sub_val = sub_patch.get("v")
sub_path = sub_patch.get("p", "")
if isinstance(sub_val, str) and "/content/parts/" in sub_path:
accumulators[key]["text"] += sub_val
yield ChatMessage(
message_id=msg_id, conversation_id=conv_id,
role="assistant", content=sub_val,
)
elif sub_path == "/message/status" and sub_val == "finished_successfully":
yield ChatMessage(
message_id=msg_id, conversation_id=conv_id,
role="assistant", content="", finish_reason=_term_finish(cur_role, cur_recipient, cur_ct),
)
continue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial | 💤 Low value

批量 JSON Patch 块中的消息上下文可能与当前消息不匹配。

在批量 JSON Patch 处理块中(lines 192-219),cur_role/cur_recipient/cur_ct 用于计算 _term_finish(),但这些变量是由之前处理的完整消息更新设置的。如果批量 patch 来自不同的消息(例如在 tool 消息之后),可能会导致 _term_finish() 基于过期的上下文做出错误判断。

不过由于 _term_finish() 对空字符串默认返回 "stop",且批量 patch 通常紧随其对应的消息上下文,实际影响可能有限。如果后续发现流式场景有异常终止,可考虑从 batch patch 本身提取或验证消息身份。

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@app/chatgpt/sse.py` around lines 192 - 219, The batch JSON Patch handling
uses stale context variables cur_role/cur_recipient/cur_ct when calling
_term_finish(), which can misattribute finish_reason for patches belonging to
other messages; update the block that yields a finished message to derive
role/recipient/ct from the patch itself (e.g., from data fields or from the
per-key accumulator entry in accumulators keyed by conv_id:msg_id) and pass
those extracted values into _term_finish() (fall back to empty/defaults only if
the patch/accumulator lacks them); reference the variables/functions
accumulators, key, data, sub_patch, _term_finish, ChatMessage and ensure the
finish_reason is computed using the patch-scoped context rather than
cur_role/cur_recipient/cur_ct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant