fix(sse): capture web-search answers, strip citation markers, avoid premature stop by Leslielu · Pull Request #3 · 69gg/gpt2api

Leslielu · 2026-06-04T09:26:52Z

背景

启用联网（web search / finance）的回答出现两类问题：正文为空或被截断，以及引用标记泄漏到输出（如 citeturn0finance0）。排查后发现是 SSE 解析里的三个独立 bug。

三个根因 & 修复

1. 无顶层 path 的批量补丁被丢弃（空/截断）

搜索回答的正文以批量补丁形式下发，有两种形态：

{"o":"patch","v":[{"p":"/message/content/parts/0","o":"append","v":"..."}]}
{"v":[{"p":"/message/content/parts/0","o":"append","v":"..."}]}   // 连续形式，仅顶层 v

原解析器两种都没处理，正文被直接丢弃。新增分支：处理任何「顶层 v 是 sub-patch 列表」的事件（非正文项如 search_result_group 由 /content/parts/ 路径判断过滤掉）。

2. 搜索场景下流式被提前掐断

解析器对任何 /message/status → finished_successfully 都发 finish_reason="stop"。但搜索工具消息会先于正文 finish——流式消费端收到这个早到的 stop 就提前 break，在正文生成前关闭了上游连接（抓包里根本没有 [DONE]）。非流式因为会把整个流读完所以不受影响。

修复：跟踪「当前消息」的 role/recipient/content_type，只有当它是用户可见的助手回答（role=assistant、recipient=all、text）时才把 finished_successfully 当作终止（新增 _term_finish()）。

3. 引用标记泄漏

ChatGPT 用私有区控制符包裹内联引用（citeturn0finance0），渲染出来就是裸的 citeturn0finance0。新增：

strip_citation_markers()：非流式在正文汇总处清洗；
CitationStripper：流式有状态清洗，跨分片安全；
同时覆盖 PUA 包裹形式与裸文本形式，且无误伤（例如 "I cite my sources ... turning point" 不受影响）。

以上修复 sync 与 async 两套解析器都已覆盖。

验证

场景	结果
流式 + 搜索	✅ 完整正文、正常 `stop`、无 PUA / 无 cite 残留
非流式 + 搜索	✅ 完整、干净
非搜索流式	✅ 正常终止，无回归

如需要我可以补单元测试。

Summary by CodeRabbit

发布说明

Bug 修复
- 优化了聊天流式输出的清洁度，改进了引用标记处理
- 增强了流式消息传输的准确性和消息完成判断的可靠性

…premature stop Web-search ("finance"/"search" augmented) responses were returning empty or truncated content, and citation markers leaked into the output. Three distinct SSE-parsing bugs were involved: 1. Batch JSON Patch with no top-level path was dropped. The search/finance answer text streams as {"o":"patch","v":[{"p":"/message/content/parts/0", "o":"append","v":"..."}]} and a continuation form {"v":[{...}]} with only a top-level "v" list. Neither was handled, so the answer text was discarded. Added a branch that processes any top-level "v" list of sub-patches (non content-part items are filtered by the path check). 2. Premature stream termination on search. Any /message/status -> finished_successfully patch emitted finish_reason="stop". The web-search tool message reports finished_successfully *before* the answer is generated, so in streaming mode the consumer broke early and closed the upstream connection before the answer streamed (no [DONE]). Non-streaming tolerated it by draining the whole stream. Now we track the current message's role/recipient/content_type and only treat finished_successfully as terminal for the visible assistant answer (role=assistant, recipient=all, text), via _term_finish(). 3. Citation markers leaked. ChatGPT wraps inline citations in private-use-area control chars (citeturn0finance0), surfacing as stray "citeturn0finance0" text. Added strip_citation_markers() (non-streaming) and a stateful CitationStripper (streaming, safe across split chunks) handling both the PUA-delimited and bare-text forms, with no false positives. All fixes applied to both the sync and async parsers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

coderabbitai · 2026-06-04T09:27:04Z

📝 Walkthrough

Walkthrough

本 PR 为 ChatGPT 流式输出引入引用和导航标记清理功能。在 SSE 流解析层新增 PUA 正则匹配器、增量清理类和动态终止判断逻辑，同时扩展消息提取以处理新型 JSON Patch 格式，最后在客户端中应用清理功能到汇总和流式增量。

Changes

引用标记清理与 SSE 流解析

Layer / File(s)	Summary
引用标记清理库 `app/chatgpt/sse.py` (lines 10–79)	新增 PUA 相关正则、`strip_citation_markers()` 同步函数和 `CitationStripper` 类支持跨 chunk 增量清理；新增 `_term_finish()` 根据消息角色/recipient/内容类型动态计算终止原因。
SSE 流解析：消息上下文与动态终止 `app/chatgpt/sse.py` (lines 151, 304, 326, 501, 638, 659)	在 `extract_chat_messages` 和 `async_extract_chat_messages` 中初始化消息上下文变量（`cur_role`/`cur_recipient`/`cur_ct`），在消息完成事件时用 `_term_finish()` 计算动态 finish_reason 替代固定 `"stop"`。
SSE 流解析：批量 JSON Patch 处理与上下文赋值 `app/chatgpt/sse.py` (lines 192–220, 364–368, 538–563, 693–696)	在两处消息提取函数中新增无顶层 path 的批量 Patch 处理，按 path 过滤 `"/content/parts/"` 文本增量；在消息跳过前赋值上下文变量以支撑后续 finish_reason 计算。
客户端集成：引用标记清理 `app/chatgpt/client.py` (lines 29, 443–446, 472–479)	导入清理模块，`chat()` 对汇总内容执行 `strip_citation_markers()`，`chat_stream()` 为流式增量创建 `CitationStripper` 实例并过滤空内容。

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 引用标记悄悄隐，
流式增量逐次清，
SSE 调整终止因，
Patch 批量新解析成，
干净输出展眼前！

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	标题清晰准确地总结了本次变更的三个主要修复：修复SSE解析以捕获网络搜索答案、剥离引用标记、避免过早终止流。标题与代码变更内容完全对应。
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@app/chatgpt/sse.py`:
- Around line 192-219: The batch JSON Patch handling uses stale context
variables cur_role/cur_recipient/cur_ct when calling _term_finish(), which can
misattribute finish_reason for patches belonging to other messages; update the
block that yields a finished message to derive role/recipient/ct from the patch
itself (e.g., from data fields or from the per-key accumulator entry in
accumulators keyed by conv_id:msg_id) and pass those extracted values into
_term_finish() (fall back to empty/defaults only if the patch/accumulator lacks
them); reference the variables/functions accumulators, key, data, sub_patch,
_term_finish, ChatMessage and ensure the finish_reason is computed using the
patch-scoped context rather than cur_role/cur_recipient/cur_ct.
- Around line 38-62: CitationStripper.feed currently only strips PUA markers and
then runs _CITATION_TEXT_RE.sub on the incoming chunk, which misses bare-text
citation tokens split across chunks; add a small rolling buffer on the
CitationStripper instance (e.g., self._tail_buf) and in feed prepend that buffer
to the current chunk before processing with _CITATION_TEXT_RE and the PUA logic,
then after processing keep only the final N characters (N = max possible
citation token length, derived from the regex for _CITATION_TEXT_RE) in
self._tail_buf for the next call; ensure CitationStripper.feed still returns
only the cleaned output for the current chunk (excluding the kept tail) and
update/reset the buffer when markers are closed or fully consumed, and note
strip_citation_markers remains the downstream full-text fallback.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 7c77a436-e7af-4a5c-9d47-807cbf10d310

📥 Commits

Reviewing files that changed from the base of the PR and between bd146ee and 4941230.

📒 Files selected for processing (2)

app/chatgpt/client.py
app/chatgpt/sse.py

coderabbitai · 2026-06-04T09:32:03Z

+class CitationStripper:
+    """Stateful stripper for streamed deltas: drops PUA-delimited citation
+    markers even when a marker is split across chunks, plus the bare text form."""
+
+    def __init__(self) -> None:
+        self._in_marker = False
+
+    def feed(self, chunk: str) -> str:
+        if not chunk:
+            return chunk
+        if self._in_marker or "\ue200" in chunk:
+            out = []
+            for ch in chunk:
+                if self._in_marker:
+                    if ch == "\ue201":
+                        self._in_marker = False
+                    continue
+                if ch == "\ue200":
+                    self._in_marker = True
+                    continue
+                if "\ue200" <= ch <= "\ue20f":
+                    continue
+                out.append(ch)
+            chunk = "".join(out)
+        return _CITATION_TEXT_RE.sub("", chunk)


🧹 Nitpick | 🔵 Trivial | 💤 Low value

裸文本引用标记在跨 chunk 分割时可能无法被清理。

CitationStripper.feed() 在字符处理后调用 _CITATION_TEXT_RE.sub("", chunk)，但如果裸文本引用标记（如 citeturn0finance0）恰好被分割在两个 chunk 之间（如 chunk1="cite", chunk2="turn0finance0"），则两个 chunk 都不会匹配正则，导致泄漏。

不过根据下游代码（client.py:446），非流式场景会对聚合后的完整内容再次调用 strip_citation_markers()，可以兜底清理。如果裸文本分割确实在生产中出现，可考虑在 CitationStripper 中增加一个小的 buffer 来处理跨 chunk 的文本匹配。

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/chatgpt/sse.py` around lines 38 - 62, CitationStripper.feed currently only strips PUA markers and then runs _CITATION_TEXT_RE.sub on the incoming chunk, which misses bare-text citation tokens split across chunks; add a small rolling buffer on the CitationStripper instance (e.g., self._tail_buf) and in feed prepend that buffer to the current chunk before processing with _CITATION_TEXT_RE and the PUA logic, then after processing keep only the final N characters (N = max possible citation token length, derived from the regex for _CITATION_TEXT_RE) in self._tail_buf for the next call; ensure CitationStripper.feed still returns only the cleaned output for the current chunk (excluding the kept tail) and update/reset the buffer when markers are closed or fully consumed, and note strip_citation_markers remains the downstream full-text fallback.

coderabbitai · 2026-06-04T09:32:03Z

+        # --- Batch JSON Patch with no top-level path ---
+        # Two forms, both used by the web-search / finance answer flow:
+        #   {"o": "patch", "v": [{"p": "/message/content/parts/0", "o": "append", "v": "..."}]}
+        #   {"v": [{"p": "/message/content/parts/0", "o": "append", "v": "..."}]}   (continuation)
+        # Non-content list items (e.g. search_result_group) are filtered by the path check below.
+        if "p" not in data and isinstance(data.get("v"), list):
+            conv_id = data.get("conversation_id", "")
+            msg_id = data.get("message_id", "")
+            key = f"{conv_id}:{msg_id}"
+            if key not in accumulators:
+                accumulators[key] = {"text": "", "message_id": msg_id, "conversation_id": conv_id}
+            for sub_patch in data["v"]:
+                if not isinstance(sub_patch, dict):
+                    continue
+                sub_val = sub_patch.get("v")
+                sub_path = sub_patch.get("p", "")
+                if isinstance(sub_val, str) and "/content/parts/" in sub_path:
+                    accumulators[key]["text"] += sub_val
+                    yield ChatMessage(
+                        message_id=msg_id, conversation_id=conv_id,
+                        role="assistant", content=sub_val,
+                    )
+                elif sub_path == "/message/status" and sub_val == "finished_successfully":
+                    yield ChatMessage(
+                        message_id=msg_id, conversation_id=conv_id,
+                        role="assistant", content="", finish_reason=_term_finish(cur_role, cur_recipient, cur_ct),
+                    )
+            continue


🧹 Nitpick | 🔵 Trivial | 💤 Low value

批量 JSON Patch 块中的消息上下文可能与当前消息不匹配。

在批量 JSON Patch 处理块中（lines 192-219），cur_role/cur_recipient/cur_ct 用于计算 _term_finish()，但这些变量是由之前处理的完整消息更新设置的。如果批量 patch 来自不同的消息（例如在 tool 消息之后），可能会导致 _term_finish() 基于过期的上下文做出错误判断。

不过由于 _term_finish() 对空字符串默认返回 "stop"，且批量 patch 通常紧随其对应的消息上下文，实际影响可能有限。如果后续发现流式场景有异常终止，可考虑从 batch patch 本身提取或验证消息身份。

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@app/chatgpt/sse.py` around lines 192 - 219, The batch JSON Patch handling uses stale context variables cur_role/cur_recipient/cur_ct when calling _term_finish(), which can misattribute finish_reason for patches belonging to other messages; update the block that yields a finished message to derive role/recipient/ct from the patch itself (e.g., from data fields or from the per-key accumulator entry in accumulators keyed by conv_id:msg_id) and pass those extracted values into _term_finish() (fall back to empty/defaults only if the patch/accumulator lacks them); reference the variables/functions accumulators, key, data, sub_patch, _term_finish, ChatMessage and ensure the finish_reason is computed using the patch-scoped context rather than cur_role/cur_recipient/cur_ct.

coderabbitai Bot reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sse): capture web-search answers, strip citation markers, avoid premature stop#3

fix(sse): capture web-search answers, strip citation markers, avoid premature stop#3
Leslielu wants to merge 1 commit into
69gg:mainfrom
Leslielu:fix/sse-search-citation-truncation

Leslielu commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Uh oh!

coderabbitai Bot Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Leslielu commented Jun 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

背景

三个根因 & 修复

1. 无顶层 path 的批量补丁被丢弃（空/截断）

2. 搜索场景下流式被提前掐断

3. 引用标记泄漏

验证

Summary by CodeRabbit

发布说明

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Leslielu commented Jun 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading