fix: keep word boundaries when a chunk token repeats the final token by Mubashirrrr · Pull Request #404 · morphik-org/morphik-core

Mubashirrrr · 2026-06-03T09:57:31Z

Bug

RecursiveCharacterTextSplitter (the splitter behind StandardChunker, the default chunker used by MorphikParser for all non-XML/non-video documents) corrupts chunk content by dropping a separator whenever a token happens to equal the value of the last token in the current split.

After str.split(sep) removes the separators, the code re-appends sep to every part except the final one. The guard used a value comparison:

add_part = part + (sep if sep and part != splits[-1] else "")

So for an input like "cat dog bird cat", splitting on " " yields ["cat", "dog", "bird", "cat"]. For the first "cat", part == splits[-1] is true, so its trailing space is dropped and it gets glued onto the next token, producing "catdog". Word boundaries are lost, which degrades embedding/retrieval quality for any document containing a repeated token equal to the last token of a split segment.

Reproduction

from core.parser.morphik_parser import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=8, chunk_overlap=0, separators=[" ", ""])
chunks = [c.content for c in splitter.split_text("cat dog bird cat")]
print("".join(chunks))   # -> "catdog bird cat"  (BUG: 'cat dog' merged into 'catdog')

Fix

Compare by index instead of value, which is what the (already correct) fallback splitter in core/utils/fast_ops.py::_split_recursive does:

last_index = len(splits) - 1
for index, part in enumerate(splits):
    add_part = part + (sep if sep and index != last_index else "")

Two lines changed; no behavior change for inputs without a duplicated final token (verified: normal text still reconstructs exactly).

Regression test

Added core/tests/unit/test_recursive_text_splitter.py with two tests asserting word boundaries are preserved for repeated tokens. Both fail before the fix ("catdog" / "thequick" appear in the output) and pass after. The tests reuse the existing dependency-stubbing approach from test_video_parser.py so they run without docling/torch.

$ pytest core/tests/unit/test_recursive_text_splitter.py -q
2 passed

🤖 Generated with Claude Code

RecursiveCharacterTextSplitter re-appends the separator stripped by str.split() to every part except the last one. The check used `part != splits[-1]`, comparing by value instead of by index. When an earlier token had the same value as the final token (e.g. "cat dog bird cat"), its trailing separator was dropped and it was merged into the next token ("catdog"), corrupting chunk content and degrading retrieval. Compare by index instead, matching the already-correct fallback splitter in core/utils/fast_ops.py. Adds regression tests that fail before and pass after the fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

CLAassistant · 2026-06-03T09:57:43Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: keep word boundaries when a chunk token repeats the final token#404

fix: keep word boundaries when a chunk token repeats the final token#404
Mubashirrrr wants to merge 1 commit into
morphik-org:mainfrom
Mubashirrrr:fix/chunker-duplicate-token-separator

Mubashirrrr commented Jun 3, 2026

Uh oh!

CLAassistant commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mubashirrrr commented Jun 3, 2026

Bug

Reproduction

Fix

Regression test

Uh oh!

CLAassistant commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants