Skip to content

fix: keep word boundaries when a chunk token repeats the final token#404

Open
Mubashirrrr wants to merge 1 commit into
morphik-org:mainfrom
Mubashirrrr:fix/chunker-duplicate-token-separator
Open

fix: keep word boundaries when a chunk token repeats the final token#404
Mubashirrrr wants to merge 1 commit into
morphik-org:mainfrom
Mubashirrrr:fix/chunker-duplicate-token-separator

Conversation

@Mubashirrrr
Copy link
Copy Markdown

Bug

RecursiveCharacterTextSplitter (the splitter behind StandardChunker, the default chunker used by MorphikParser for all non-XML/non-video documents) corrupts chunk content by dropping a separator whenever a token happens to equal the value of the last token in the current split.

After str.split(sep) removes the separators, the code re-appends sep to every part except the final one. The guard used a value comparison:

add_part = part + (sep if sep and part != splits[-1] else "")

So for an input like "cat dog bird cat", splitting on " " yields ["cat", "dog", "bird", "cat"]. For the first "cat", part == splits[-1] is true, so its trailing space is dropped and it gets glued onto the next token, producing "catdog". Word boundaries are lost, which degrades embedding/retrieval quality for any document containing a repeated token equal to the last token of a split segment.

Reproduction

from core.parser.morphik_parser import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=8, chunk_overlap=0, separators=[" ", ""])
chunks = [c.content for c in splitter.split_text("cat dog bird cat")]
print("".join(chunks))   # -> "catdog bird cat"  (BUG: 'cat dog' merged into 'catdog')

Fix

Compare by index instead of value, which is what the (already correct) fallback splitter in core/utils/fast_ops.py::_split_recursive does:

last_index = len(splits) - 1
for index, part in enumerate(splits):
    add_part = part + (sep if sep and index != last_index else "")

Two lines changed; no behavior change for inputs without a duplicated final token (verified: normal text still reconstructs exactly).

Regression test

Added core/tests/unit/test_recursive_text_splitter.py with two tests asserting word boundaries are preserved for repeated tokens. Both fail before the fix ("catdog" / "thequick" appear in the output) and pass after. The tests reuse the existing dependency-stubbing approach from test_video_parser.py so they run without docling/torch.

$ pytest core/tests/unit/test_recursive_text_splitter.py -q
2 passed

🤖 Generated with Claude Code

RecursiveCharacterTextSplitter re-appends the separator stripped by
str.split() to every part except the last one. The check used
`part != splits[-1]`, comparing by value instead of by index. When an
earlier token had the same value as the final token (e.g. "cat dog bird
cat"), its trailing separator was dropped and it was merged into the next
token ("catdog"), corrupting chunk content and degrading retrieval.

Compare by index instead, matching the already-correct fallback splitter
in core/utils/fast_ops.py.

Adds regression tests that fail before and pass after the fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants