Skip to content

feat(kb): add Markdown-aware chunker for structured documents#8151

Open
fausalors wants to merge 4 commits into
AstrBotDevs:masterfrom
fausalors:feat/markdown-chunker
Open

feat(kb): add Markdown-aware chunker for structured documents#8151
fausalors wants to merge 4 commits into
AstrBotDevs:masterfrom
fausalors:feat/markdown-chunker

Conversation

@fausalors
Copy link
Copy Markdown

@fausalors fausalors commented May 11, 2026

现有的 RecursiveCharacterChunker 按字符数切分 Markdown 文档,会破坏标题层级结构——一个 chunk 可能包含上一节的尾部和下一节的头部,导致语义不连贯,降低知识库检索质量。

本 PR 新增 MarkdownChunker,按标题层级切分文档,保持每个章节的语义完整性。上传 .md 文件时自动启用,对非 Markdown 文件无影响。

Modifications / 改动点

  • 新增 astrbot/core/knowledge_base/chunking/markdown.py:Markdown 感知分块器,按标题层级切分,支持短 chunk 合并、超长章节回退递归分割

  • 修改 astrbot/core/knowledge_base/chunking/__init__.py:导出 MarkdownChunker

  • 修改 astrbot/core/knowledge_base/kb_helper.py:上传文档时检测 .md/.markdown/.mkd/.mdx 后缀,自动切换到 MarkdownChunker

  • This is NOT a breaking change. / 这不是一个破坏性变更。

Screenshots or Test Results / 运行截图或测试结果

Benchmark: 3 份文档(分布式系统设计 + ML工程实践 + CCF大模型认证大纲,共 11,669 字符),20 条 query,chunk_size=256。

bge-m3 (Pro/BAAI/bge-m3, 1024d)

策略 Recall@1 Precision@1 Recall@3 Precision@3
RecursiveCharacterChunker (baseline) 0.5292 0.9500 0.7458 0.5000
MarkdownChunker 0.7433 1.0000 0.7850 0.3667

Recall@1 +40%, Precision@1 达到 1.0

mxbai-embed-large (本地, 1024d)

策略 Recall@1 Precision@1 Recall@3 Precision@3
RecursiveCharacterChunker (baseline) 0.2667 0.5000 0.3792 0.2500
MarkdownChunker 0.4350 0.5500 0.5867 0.2833

Recall@3 +55%


Checklist / 检查清单

  • 😊 If there are new features added in the PR, I have discussed it with the authors through issues/emails, etc.
    / 如果 PR 中有新加入的功能,已经通过 Issue / 邮件等方式和作者讨论过。

  • 👀 My changes have been well-tested, and "Verification Steps" and "Screenshots" have been provided above.
    / 我的更改经过了良好的测试,并已在上方提供了“验证步骤”和“运行截图”

  • 🤓 I have ensured that no new dependencies are introduced, OR if new dependencies are introduced, they have been added to the appropriate locations in requirements.txt and pyproject.toml.
    / 我确保没有引入新依赖库,或者引入了新依赖库的同时将其添加到 requirements.txtpyproject.toml 文件相应位置。

  • 😮 My changes do not introduce malicious code.
    / 我的更改没有引入恶意代码。

Summary by Sourcery

Add a Markdown-aware chunker and integrate it into knowledge base document upload to preserve section structure for Markdown files.

New Features:

  • Introduce a MarkdownChunker that splits Markdown documents by heading hierarchy with recursive fallback for oversized sections.
  • Automatically select the MarkdownChunker for Markdown file uploads based on file extension while keeping existing behavior for other formats.

Enhancements:

  • Export the MarkdownChunker from the chunking package for reuse across the codebase.

@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label May 11, 2026
@auto-assign auto-assign Bot requested review from LIghtJUNction and Raven95676 May 11, 2026 10:26
@dosubot dosubot Bot added the feature:knowledge-base The bug / feature is about knowledge base label May 11, 2026
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • The Markdown heading detection uses a simple regex over the whole text and will also match # lines inside fenced code blocks or inline examples; consider skipping fenced code blocks (e.g., track ```/~~~ sections) before applying the heading regex to avoid spurious section splits.
  • In upload_document, a new MarkdownChunker is instantiated for every Markdown upload instead of reusing or deriving from the existing self.chunker configuration; consider reusing a shared instance or a small factory to avoid per-upload allocation and to keep configuration (like include_heading_context/max_heading_depth) consistently customizable.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The Markdown heading detection uses a simple regex over the whole text and will also match `#` lines inside fenced code blocks or inline examples; consider skipping fenced code blocks (e.g., track ```/~~~ sections) before applying the heading regex to avoid spurious section splits.
- In `upload_document`, a new `MarkdownChunker` is instantiated for every Markdown upload instead of reusing or deriving from the existing `self.chunker` configuration; consider reusing a shared instance or a small factory to avoid per-upload allocation and to keep configuration (like `include_heading_context`/`max_heading_depth`) consistently customizable.

## Individual Comments

### Comment 1
<location path="astrbot/core/knowledge_base/chunking/markdown.py" line_range="110-119" />
<code_context>
+                        chunk_text = sub_chunk
+                    raw_chunks.append((chunk_text.strip(), True))
+
+        # 合并没有实质正文的 chunk 到下一个有正文的 chunk
+        merged = []
+        pending = ""
+        for chunk_text, has_body in raw_chunks:
+            if not chunk_text:
+                continue
+            if not has_body:
+                # 纯标题节,暂存,等待合并到下一个有内容的 chunk
+                pending += chunk_text + "\n\n"
+            else:
+                if pending:
+                    combined = pending + chunk_text
+                    # 如果合并后不超过 chunk_size,合并
+                    if len(combined) <= chunk_size:
+                        merged.append(combined.strip())
+                    else:
+                        # 超长了,分开保留
+                        merged.append(pending.strip())
+                        merged.append(chunk_text.strip())
+                    pending = ""
+                else:
</code_context>
<issue_to_address>
**issue (bug_risk):** Pending header-only chunks can exceed `chunk_size` without being further split

Because `pending` is only size-checked when combined with a `has_body=True` chunk, a run of many or very long header-only chunks can make `pending` exceed `chunk_size`. In that case, the `else` path appends `pending`/`pending.strip()` directly to `merged` (and similarly in the tail handling), producing chunks larger than `chunk_size`. Consider either enforcing a size cap as `pending` grows or passing `pending` through the fallback chunker before appending, so all emitted chunks stay within `chunk_size`.
</issue_to_address>

### Comment 2
<location path="astrbot/core/knowledge_base/chunking/markdown.py" line_range="13" />
<code_context>
+from .recursive import RecursiveCharacterChunker
+
+
+class MarkdownChunker(BaseChunker):
+    """Markdown 感知分块器
+
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring `MarkdownChunker` to introduce small typed helpers (e.g., dataclasses and merge functions) so `chunk()` becomes a clearer high-level orchestration rather than a large stateful method.

You can keep all current behavior but reduce complexity by introducing small typed objects and extracting the “phases” of `chunk()` into helpers. That will make the control flow much easier to follow and maintain.

### 1. Use a `Section` dataclass instead of generic dicts

Encoding the structure of `_parse_sections` in a dataclass removes a lot of implicit knowledge from `chunk()`:

```python
from dataclasses import dataclass
from typing import List

@dataclass
class Section:
    heading_path: List[str]
    text: str
    has_body: bool
```

Update `_parse_sections` to return `list[Section]`:

```python
def _parse_sections(self, text: str) -> list[Section]:
    ...
    sections: list[Section] = []

    preamble = text[: headings[0]["start"]].strip()
    if preamble:
        sections.append(Section(heading_path=[], text=preamble, has_body=True))

    heading_stack: list[dict] = []

    for i, heading in enumerate(headings):
        ...
        section_text = heading_line
        if body:
            section_text += "\n" + body

        heading_path = [h["title"] for h in heading_stack[:-1]]

        sections.append(
            Section(
                heading_path=heading_path,
                text=section_text,
                has_body=bool(body),
            )
        )

    return sections
```

This lets `chunk()` work with `Section` objects directly, instead of re‑deriving meanings from dict keys.

### 2. Extract heading-context / continuation formatting

Move the continuation-label formatting out of the main loop so `chunk()` only decides *when* to create continuation chunks:

```python
def _apply_heading_context(
    self, heading_path: list[str], content: str, is_continuation: bool
) -> str:
    if not self.include_heading_context or not heading_path:
        return content.strip()

    title = " > ".join(heading_path)
    if is_continuation:
        return f"[续] {title}\n\n{content}".strip()
    else:
        return f"{title}\n\n{content}".strip()
```

Use it inside your fallback chunking:

```python
sub_chunks = await self._fallback_chunker.chunk(
    section_text, chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
for i, sub_chunk in enumerate(sub_chunks):
    raw_chunks.append(
        (
            self._apply_heading_context(
                heading_path, sub_chunk, is_continuation=(i > 0)
            ),
            True,
        )
    )
```

This removes a nested conditional from the main orchestration logic.

### 3. Factor out the two merge passes into helpers

You currently have two separate passes with different ad‑hoc state (`pending`, `buf`). You can keep behavior but make it clearer by using small helpers over a simple chunk type.

Introduce a light `Chunk` type:

```python
@dataclass
class Chunk:
    text: str
    has_body: bool  # for heading-only merge; ignored by size merge
```

Then:

```python
def _merge_heading_only_chunks(
    self, chunks: list[Chunk], chunk_size: int
) -> list[str]:
    merged: list[str] = []
    pending = ""

    for c in chunks:
        if not c.text:
            continue
        if not c.has_body:
            pending += c.text + "\n\n"
            continue

        if pending:
            combined = pending + c.text
            if len(combined) <= chunk_size:
                merged.append(combined.strip())
            else:
                merged.append(pending.strip())
                merged.append(c.text.strip())
            pending = ""
        else:
            merged.append(c.text.strip())

    if pending:
        if merged and len(merged[-1] + "\n\n" + pending.strip()) <= chunk_size:
            merged[-1] = merged[-1] + "\n\n" + pending.strip()
        else:
            merged.append(pending.strip())

    return [m for m in merged if m.strip()]
```

And a separate helper that only cares about size, not headings:

```python
def _merge_short_chunks_by_size(
    self, chunks: list[str], chunk_size: int, min_chunk_size: int
) -> list[str]:
    if min_chunk_size <= 0 or len(chunks) <= 1:
        return chunks

    final: list[str] = []
    buf = ""

    for c in chunks:
        if buf:
            combined = buf + "\n\n" + c
            if len(combined) <= chunk_size:
                buf = combined
            else:
                final.append(buf)
                buf = c if len(c) < min_chunk_size else ""
                if len(c) >= min_chunk_size:
                    final.append(c)
        elif len(c) < min_chunk_size:
            buf = c
        else:
            final.append(c)

    if buf:
        if final and len(final[-1] + "\n\n" + buf) <= chunk_size:
            final[-1] = final[-1] + "\n\n" + buf
        else:
            final.append(buf)

    return final
```

Then `chunk()` becomes a higher‑level orchestration:

```python
sections = self._parse_sections(text)
...
raw_chunks: list[Chunk] = self._sections_to_chunks(
    sections, chunk_size, chunk_overlap
)
merged = self._merge_heading_only_chunks(raw_chunks, chunk_size)
merged = self._merge_short_chunks_by_size(
    merged, chunk_size, self.min_chunk_size
)
return merged
```

Where `_sections_to_chunks` contains only the “section → raw chunk” logic (heading context + recursion), independent from merging rules.

This keeps all current behavior but flattens the nested state machines into small, named helpers that are easier to reason about and test independently.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread astrbot/core/knowledge_base/chunking/markdown.py Outdated
Comment thread astrbot/core/knowledge_base/chunking/markdown.py
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a MarkdownChunker to provide structure-aware document splitting for Markdown files, preserving heading context across chunks and integrating it into the document upload process. Feedback focuses on improving the robustness of the chunker, specifically by clamping max_heading_depth to prevent regex errors, adjusting sub-chunk sizes to account for prefix lengths, and refining header detection to ignore matches within code blocks. Additionally, it is recommended to avoid hardcoded language-specific strings like "[续]" to support multi-language environments.

Comment thread astrbot/core/knowledge_base/chunking/markdown.py Outdated
Comment thread astrbot/core/knowledge_base/chunking/markdown.py
Comment thread astrbot/core/knowledge_base/chunking/markdown.py Outdated
Comment thread astrbot/core/knowledge_base/chunking/markdown.py
- Clamp max_heading_depth to 1-6 to prevent regex errors
- Deduct prefix length when splitting oversized sections
- Replace hardcoded "[续]" with configurable continuation_prefix
- Skip fenced code blocks when detecting headings
- Cap pending size to prevent chunks exceeding chunk_size
- Refactor into dataclass + helper methods
@fausalors
Copy link
Copy Markdown
Author

@sourcery-ai review

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • Consider allowing the MarkdownChunker configuration (e.g., include_heading_context, max_heading_depth, min_chunk_size) to be injected from existing KB or app settings instead of hardcoding defaults in kb_helper so users can tune behavior without code changes.
  • In MarkdownChunker.chunk, the heavy regex parsing and section processing are done synchronously inside an async method; if this is used on large documents or high throughput, consider offloading parsing to a thread pool or making chunk a sync method for clarity about its blocking behavior.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- Consider allowing the MarkdownChunker configuration (e.g., include_heading_context, max_heading_depth, min_chunk_size) to be injected from existing KB or app settings instead of hardcoding defaults in kb_helper so users can tune behavior without code changes.
- In MarkdownChunker.chunk, the heavy regex parsing and section processing are done synchronously inside an async method; if this is used on large documents or high throughput, consider offloading parsing to a thread pool or making chunk a sync method for clarity about its blocking behavior.

## Individual Comments

### Comment 1
<location path="astrbot/core/knowledge_base/chunking/markdown.py" line_range="131-132" />
<code_context>
+                # 章节过长,内部递归分割
+                # 扣除前缀长度,确保添加前缀后不超过 chunk_size
+                prefix_len = self._estimate_prefix_length(heading_path)
+                effective_chunk_size = max(
+                    chunk_size - prefix_len, chunk_size // 2
+                )
+
</code_context>
<issue_to_address>
**issue (bug_risk):** The `effective_chunk_size` calculation can produce chunks that exceed the intended `chunk_size` once the heading prefix is added.

Because of the `chunk_size // 2` fallback, large `prefix_len` values (e.g. `chunk_size=1024`, `prefix_len=900`) can produce `effective_chunk_size=512`, so `len(prefix) + len(content)` can exceed `chunk_size` (~1412 in this example), breaking the size guarantee.

To preserve the invariant, derive `effective_chunk_size` only from the remaining budget after the prefix, e.g.:

```python
base = chunk_size - prefix_len
effective_chunk_size = max(1, base)
```

This keeps `len(prefix) + len(content) <= chunk_size` even for very long headings.
</issue_to_address>

### Comment 2
<location path="astrbot/core/knowledge_base/chunking/markdown.py" line_range="319-320" />
<code_context>
+    def _find_fenced_code_ranges(text: str) -> list[tuple[int, int]]:
+        """找到所有围栏代码块的 (start, end) 范围"""
+        ranges: list[tuple[int, int]] = []
+        fence_pattern = re.compile(r"^(`{3,}|~{3,})", re.MULTILINE)
+        matches = list(fence_pattern.finditer(text))
+
+        i = 0
</code_context>
<issue_to_address>
**issue:** Single unmatched fenced code blocks are not treated as code, so `#` inside them may be misclassified as headings.

Because the loop stops at `while i < len(matches) - 1:`, a single unmatched fence (e.g. an unterminated ``` block at EOF) is never handled, and the `else` branch that should mark the rest of the text as code is never reached. This allows `#` inside that trailing fence to be parsed as real headings.

To treat an unmatched opening fence as extending to EOF (like common Markdown renderers), iterate over all matches and special‑case the last unmatched one, for example:

```python
i = 0
while i < len(matches):
    open_match = matches[i]
    open_fence = open_match.group(1)
    fence_char = open_fence[0]
    fence_len = len(open_fence)

    for j in range(i + 1, len(matches)):
        close_match = matches[j]
        close_fence = close_match.group(1)
        if close_fence[0] == fence_char and len(close_fence) >= fence_len:
            ranges.append((open_match.start(), close_match.end()))
            i = j + 1
            break
    else:
        ranges.append((open_match.start(), len(text)))
        break
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread astrbot/core/knowledge_base/chunking/markdown.py Outdated
Comment thread astrbot/core/knowledge_base/chunking/markdown.py
@fausalors
Copy link
Copy Markdown
Author

fausalors commented May 14, 2026

Hey - I've found 2 issues, and left some high level feedback:

  • Consider allowing the MarkdownChunker configuration (e.g., include_heading_context, max_heading_depth, min_chunk_size) to be injected from existing KB or app settings instead of hardcoding defaults in kb_helper so users can tune behavior without code changes.
  • In MarkdownChunker.chunk, the heavy regex parsing and section processing are done synchronously inside an async method; if this is used on large documents or high throughput, consider offloading parsing to a thread pool or making chunk a sync method for clarity about its blocking behavior.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- Consider allowing the MarkdownChunker configuration (e.g., include_heading_context, max_heading_depth, min_chunk_size) to be injected from existing KB or app settings instead of hardcoding defaults in kb_helper so users can tune behavior without code changes.
- In MarkdownChunker.chunk, the heavy regex parsing and section processing are done synchronously inside an async method; if this is used on large documents or high throughput, consider offloading parsing to a thread pool or making chunk a sync method for clarity about its blocking behavior.

## Individual Comments

### Comment 1
<location path="astrbot/core/knowledge_base/chunking/markdown.py" line_range="131-132" />
<code_context>
+                # 章节过长,内部递归分割
+                # 扣除前缀长度,确保添加前缀后不超过 chunk_size
+                prefix_len = self._estimate_prefix_length(heading_path)
+                effective_chunk_size = max(
+                    chunk_size - prefix_len, chunk_size // 2
+                )
+
</code_context>
<issue_to_address>
**issue (bug_risk):** The `effective_chunk_size` calculation can produce chunks that exceed the intended `chunk_size` once the heading prefix is added.

Because of the `chunk_size // 2` fallback, large `prefix_len` values (e.g. `chunk_size=1024`, `prefix_len=900`) can produce `effective_chunk_size=512`, so `len(prefix) + len(content)` can exceed `chunk_size` (~1412 in this example), breaking the size guarantee.

To preserve the invariant, derive `effective_chunk_size` only from the remaining budget after the prefix, e.g.:

```python
base = chunk_size - prefix_len
effective_chunk_size = max(1, base)

This keeps len(prefix) + len(content) <= chunk_size even for very long headings.
</issue_to_address>

Comment 2

+ def _find_fenced_code_ranges(text: str) -> list[tuple[int, int]]: + """找到所有围栏代码块的 (start, end) 范围""" + ranges: list[tuple[int, int]] = [] + fence_pattern = re.compile(r"^(`{3,}|~{3,})", re.MULTILINE) + matches = list(fence_pattern.finditer(text)) + + i = 0 **issue:** Single unmatched fenced code blocks are not treated as code, so `#` inside them may be misclassified as headings.

Because the loop stops at while i < len(matches) - 1:, a single unmatched fence (e.g. an unterminated ``` block at EOF) is never handled, and the else branch that should mark the rest of the text as code is never reached. This allows `#` inside that trailing fence to be parsed as real headings.

To treat an unmatched opening fence as extending to EOF (like common Markdown renderers), iterate over all matches and special‑case the last unmatched one, for example:

i = 0
while i < len(matches):
    open_match = matches[i]
    open_fence = open_match.group(1)
    fence_char = open_fence[0]
    fence_len = len(open_fence)

    for j in range(i + 1, len(matches)):
        close_match = matches[j]
        close_fence = close_match.group(1)
        if close_fence[0] == fence_char and len(close_fence) >= fence_len:
            ranges.append((open_match.start(), close_match.end()))
            i = j + 1
            break
    else:
        ranges.append((open_match.start(), len(text)))
        break

</issue_to_address>


Sourcery is free for open source - if you like our reviews please consider sharing them ✨
* [X](https://twitter.com/intent/tweet?text=I%20just%20got%20an%20instant%20code%20review%20from%20%40SourceryAI%2C%20and%20it%20was%20brilliant%21%20It%27s%20free%20for%20open%20source%20and%20has%20a%20free%20trial%20for%20private%20code.%20Check%20it%20out%20https%3A//sourcery.ai)
* [Mastodon](https://mastodon.social/share?text=I%20just%20got%20an%20instant%20code%20review%20from%20%40SourceryAI%2C%20and%20it%20was%20brilliant%21%20It%27s%20free%20for%20open%20source%20and%20has%20a%20free%20trial%20for%20private%20code.%20Check%20it%20out%20https%3A//sourcery.ai)
* [LinkedIn](https://www.linkedin.com/sharing/share-offsite/?url=https://sourcery.ai)
* [Facebook](https://www.facebook.com/sharer/sharer.php?u=https://sourcery.ai)

Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Consider allowing the MarkdownChunker configuration to be injected from existing KB or app settings

这个问题等维护者确认KB设置体系的接入方式后在后续PR中处理吧

In MarkdownChunker.chunk, the heavy regex parsing and section processing are done synchronously inside an async method

这个跟现有的RecursiveCharacterChunker是同样的模式,如果要改应该是整个chunking模块统一重构,应该不属于此PR的范围内

@fausalors
Copy link
Copy Markdown
Author

@sourcery-ai review

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 2 issues, and left some high level feedback:

  • When switching to MarkdownChunker in upload_document, you always construct a new instance with only chunk_size and chunk_overlap; if callers customize other chunker options (e.g. include heading context, min chunk size), consider exposing those via configuration or reusing an injected Markdown-aware chunker to avoid silently ignoring settings.
  • Markdown detection in upload_document is based on a lowercase suffix check; if filenames can contain uppercase or mixed-case extensions, ensure file_ext is normalized once (e.g. at upload) or use a case-insensitive comparison to avoid missing valid Markdown files.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- When switching to `MarkdownChunker` in `upload_document`, you always construct a new instance with only `chunk_size` and `chunk_overlap`; if callers customize other chunker options (e.g. include heading context, min chunk size), consider exposing those via configuration or reusing an injected Markdown-aware chunker to avoid silently ignoring settings.
- Markdown detection in `upload_document` is based on a lowercase suffix check; if filenames can contain uppercase or mixed-case extensions, ensure `file_ext` is normalized once (e.g. at upload) or use a case-insensitive comparison to avoid missing valid Markdown files.

## Individual Comments

### Comment 1
<location path="astrbot/core/knowledge_base/chunking/markdown.py" line_range="129-131" />
<code_context>
+                raw_chunks.append((full_text.strip(), has_body))
+            else:
+                # 章节过长,内部递归分割
+                # 扣除前缀长度,确保添加前缀后不超过 chunk_size
+                prefix_len = self._estimate_prefix_length(heading_path)
+                effective_chunk_size = max(
+                    chunk_size // 4, chunk_size - prefix_len
+                )
</code_context>
<issue_to_address>
**issue (bug_risk):** Effective sub-chunk size can exceed the configured chunk_size when heading prefixes are very long.

When `prefix_len > chunk_size`, `effective_chunk_size` becomes `chunk_size // 4` and is used to split `section_text`, but `_apply_heading_context` later prepends the full prefix without re-checking the size. This can yield chunks where `len(prefix) + len(sub_chunk) > chunk_size`, breaking the size guarantee.

To preserve the `chunk_size` invariant, you could either:
- Compute `effective_chunk_size` as `max(1, chunk_size - prefix_len)` (and explicitly handle very small values), or
- Enforce `chunk_size` after prefixing by trimming or further splitting so `len(chunk_text) <= chunk_size`.
</issue_to_address>

### Comment 2
<location path="astrbot/core/knowledge_base/chunking/markdown.py" line_range="15" />
<code_context>
+
+
+@dataclass
+class _Section:
+    """解析后的 Markdown 章节"""
+
</code_context>
<issue_to_address>
**issue (complexity):** Consider refactoring heading-only section handling and short-chunk merging into simpler, single-pass flows to reduce state and make the markdown chunker easier to reason about.

You can reduce complexity in two focused spots without changing behavior: heading-only handling and short‑chunk merging.

---

### 1. Remove `has_body` + `_merge_heading_only_chunks` two‑phase handling

Right now:

- `_parse_sections` packs heading line + body into `Section.text` and sets `has_body`.
- `_sections_to_chunks` propagates `has_body`.
- `_merge_heading_only_chunks` then reinterprets that flag.

You can instead keep heading‑only sections out of the chunk list in the first place and prepend them directly to the next section’s text. That lets you:

- Drop `has_body` from `_Section`
- Make `_sections_to_chunks` return plain `list[str]`
- Inline the “heading‑only merging” logic where chunks are produced

Example of a narrower refactor:

```python
@dataclass
class _Section:
    heading_path: list[str]
    heading_title: str  # current heading title
    body: str           # body WITHOUT heading line

async def _sections_to_chunks(
    self, sections: list[_Section], chunk_size: int, chunk_overlap: int
) -> list[str]:
    chunks: list[str] = []
    pending_heading_only: list[str] = []

    for section in sections:
        # heading-only section: accumulate and continue
        if not section.body.strip():
            pending_heading_only.append(section.heading_title)
            continue

        # build effective heading_path including any pending heading-only titles
        heading_path = section.heading_path + pending_heading_only + [section.heading_title]
        pending_heading_only = []

        context_prefix = self._build_context_prefix(heading_path)
        full_text = context_prefix + section.body

        if len(full_text) <= chunk_size:
            chunks.append(full_text.strip())
        else:
            prefix_len = self._estimate_prefix_length(heading_path)
            effective_chunk_size = max(chunk_size // 4, chunk_size - prefix_len)

            sub_chunks = await self._fallback_chunker.chunk(
                section.body,
                chunk_size=effective_chunk_size,
                chunk_overlap=chunk_overlap,
            )
            for i, sub_chunk in enumerate(sub_chunks):
                chunk_text = self._apply_heading_context(
                    heading_path, sub_chunk, is_continuation=(i > 0)
                )
                chunks.append(chunk_text.strip())

    # optional: if trailing heading-only sections exist, attach to last chunk
    if pending_heading_only and chunks:
        tail = "\n\n" + "\n\n".join(pending_heading_only)
        if len(chunks[-1] + tail) <= chunk_size:
            chunks[-1] = (chunks[-1] + tail).strip()
        else:
            chunks.extend(pending_heading_only)

    return chunks
```

Then you can simplify `chunk()` to skip `_merge_heading_only_chunks` entirely:

```python
raw_chunks = await self._sections_to_chunks(sections, chunk_size, chunk_overlap)
merged = self._merge_short_chunks(raw_chunks, chunk_size)
```

And `_Section` no longer needs `has_body`.

This keeps all existing behaviors (headings merged into nearest body) but removes the boolean flag, the two‑phase pipeline, and `_merge_heading_only_chunks`.

---

### 2. Simplify `_merge_short_chunks` control flow

The current version has a `buf` with branchy updates and conditional `final.append` calls that are hard to reason about. You can keep semantics but use a single “current chunk” accumulator.

Pattern:

- Maintain `current_chunk` (string or `None`).
- For each `c`:
  - Try to merge into `current_chunk` if under `chunk_size`.
  - Ensure that anything shorter than `min_chunk_size` is merged forward if possible.

A simpler outline:

```python
def _merge_short_chunks(self, chunks: list[str], chunk_size: int) -> list[str]:
    if self.min_chunk_size <= 0 or len(chunks) <= 1:
        return chunks

    result: list[str] = []
    current: str | None = None

    for c in chunks:
        if current is None:
            current = c
            continue

        combined = current + "\n\n" + c
        # prefer merging if either side is too small and we stay within limit
        if (len(current) < self.min_chunk_size or len(c) < self.min_chunk_size) and len(combined) <= chunk_size:
            current = combined
        else:
            result.append(current)
            current = c

    if current is not None:
        # optional last merge if trailing small chunk fits into previous
        if result and len(current) < self.min_chunk_size:
            candidate = result[-1] + "\n\n" + current
            if len(candidate) <= chunk_size:
                result[-1] = candidate
                return result
        result.append(current)

    return result
```

This keeps all logic in one place:

- “Small” is defined once (`len(c) < min_chunk_size`).
- Merge decision is explicit (`len(combined) <= chunk_size`).
- No dual role of `buf` vs `final[-1]`.

You can adjust the merge heuristics to exactly mirror current behavior, but the structure will remain much easier to read and test.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +129 to +131
# 扣除前缀长度,确保添加前缀后不超过 chunk_size
prefix_len = self._estimate_prefix_length(heading_path)
effective_chunk_size = max(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Effective sub-chunk size can exceed the configured chunk_size when heading prefixes are very long.

When prefix_len > chunk_size, effective_chunk_size becomes chunk_size // 4 and is used to split section_text, but _apply_heading_context later prepends the full prefix without re-checking the size. This can yield chunks where len(prefix) + len(sub_chunk) > chunk_size, breaking the size guarantee.

To preserve the chunk_size invariant, you could either:

  • Compute effective_chunk_size as max(1, chunk_size - prefix_len) (and explicitly handle very small values), or
  • Enforce chunk_size after prefixing by trimming or further splitting so len(chunk_text) <= chunk_size.



@dataclass
class _Section:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (complexity): Consider refactoring heading-only section handling and short-chunk merging into simpler, single-pass flows to reduce state and make the markdown chunker easier to reason about.

You can reduce complexity in two focused spots without changing behavior: heading-only handling and short‑chunk merging.


1. Remove has_body + _merge_heading_only_chunks two‑phase handling

Right now:

  • _parse_sections packs heading line + body into Section.text and sets has_body.
  • _sections_to_chunks propagates has_body.
  • _merge_heading_only_chunks then reinterprets that flag.

You can instead keep heading‑only sections out of the chunk list in the first place and prepend them directly to the next section’s text. That lets you:

  • Drop has_body from _Section
  • Make _sections_to_chunks return plain list[str]
  • Inline the “heading‑only merging” logic where chunks are produced

Example of a narrower refactor:

@dataclass
class _Section:
    heading_path: list[str]
    heading_title: str  # current heading title
    body: str           # body WITHOUT heading line

async def _sections_to_chunks(
    self, sections: list[_Section], chunk_size: int, chunk_overlap: int
) -> list[str]:
    chunks: list[str] = []
    pending_heading_only: list[str] = []

    for section in sections:
        # heading-only section: accumulate and continue
        if not section.body.strip():
            pending_heading_only.append(section.heading_title)
            continue

        # build effective heading_path including any pending heading-only titles
        heading_path = section.heading_path + pending_heading_only + [section.heading_title]
        pending_heading_only = []

        context_prefix = self._build_context_prefix(heading_path)
        full_text = context_prefix + section.body

        if len(full_text) <= chunk_size:
            chunks.append(full_text.strip())
        else:
            prefix_len = self._estimate_prefix_length(heading_path)
            effective_chunk_size = max(chunk_size // 4, chunk_size - prefix_len)

            sub_chunks = await self._fallback_chunker.chunk(
                section.body,
                chunk_size=effective_chunk_size,
                chunk_overlap=chunk_overlap,
            )
            for i, sub_chunk in enumerate(sub_chunks):
                chunk_text = self._apply_heading_context(
                    heading_path, sub_chunk, is_continuation=(i > 0)
                )
                chunks.append(chunk_text.strip())

    # optional: if trailing heading-only sections exist, attach to last chunk
    if pending_heading_only and chunks:
        tail = "\n\n" + "\n\n".join(pending_heading_only)
        if len(chunks[-1] + tail) <= chunk_size:
            chunks[-1] = (chunks[-1] + tail).strip()
        else:
            chunks.extend(pending_heading_only)

    return chunks

Then you can simplify chunk() to skip _merge_heading_only_chunks entirely:

raw_chunks = await self._sections_to_chunks(sections, chunk_size, chunk_overlap)
merged = self._merge_short_chunks(raw_chunks, chunk_size)

And _Section no longer needs has_body.

This keeps all existing behaviors (headings merged into nearest body) but removes the boolean flag, the two‑phase pipeline, and _merge_heading_only_chunks.


2. Simplify _merge_short_chunks control flow

The current version has a buf with branchy updates and conditional final.append calls that are hard to reason about. You can keep semantics but use a single “current chunk” accumulator.

Pattern:

  • Maintain current_chunk (string or None).
  • For each c:
    • Try to merge into current_chunk if under chunk_size.
    • Ensure that anything shorter than min_chunk_size is merged forward if possible.

A simpler outline:

def _merge_short_chunks(self, chunks: list[str], chunk_size: int) -> list[str]:
    if self.min_chunk_size <= 0 or len(chunks) <= 1:
        return chunks

    result: list[str] = []
    current: str | None = None

    for c in chunks:
        if current is None:
            current = c
            continue

        combined = current + "\n\n" + c
        # prefer merging if either side is too small and we stay within limit
        if (len(current) < self.min_chunk_size or len(c) < self.min_chunk_size) and len(combined) <= chunk_size:
            current = combined
        else:
            result.append(current)
            current = c

    if current is not None:
        # optional last merge if trailing small chunk fits into previous
        if result and len(current) < self.min_chunk_size:
            candidate = result[-1] + "\n\n" + current
            if len(candidate) <= chunk_size:
                result[-1] = candidate
                return result
        result.append(current)

    return result

This keeps all logic in one place:

  • “Small” is defined once (len(c) < min_chunk_size).
  • Merge decision is explicit (len(combined) <= chunk_size).
  • No dual role of buf vs final[-1].

You can adjust the merge heuristics to exactly mirror current behavior, but the structure will remain much easier to read and test.

@fausalors
Copy link
Copy Markdown
Author

@sourcery-ai 第一个bug-risk:已在上一个 commit 中将 fallback 从 chunk_size // 2 改为 chunk_size // 4。极端情况下(prefix_len > chunk_size)属于不合理输入,实际文档中不会出现。如果后续有需要可以加 post-check 截断,当前不做过度防御
第二个complexity:当前实现功能正确且已通过测试。结构性重构放到后续迭代中考虑,本 PR 优先保证功能完整和 bug-free

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature:knowledge-base The bug / feature is about knowledge base size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant