fix: keep word boundaries when a chunk token repeats the final token#404
Open
Mubashirrrr wants to merge 1 commit into
Open
fix: keep word boundaries when a chunk token repeats the final token#404Mubashirrrr wants to merge 1 commit into
Mubashirrrr wants to merge 1 commit into
Conversation
RecursiveCharacterTextSplitter re-appends the separator stripped by
str.split() to every part except the last one. The check used
`part != splits[-1]`, comparing by value instead of by index. When an
earlier token had the same value as the final token (e.g. "cat dog bird
cat"), its trailing separator was dropped and it was merged into the next
token ("catdog"), corrupting chunk content and degrading retrieval.
Compare by index instead, matching the already-correct fallback splitter
in core/utils/fast_ops.py.
Adds regression tests that fail before and pass after the fix.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
RecursiveCharacterTextSplitter(the splitter behindStandardChunker, the default chunker used byMorphikParserfor all non-XML/non-video documents) corrupts chunk content by dropping a separator whenever a token happens to equal the value of the last token in the current split.After
str.split(sep)removes the separators, the code re-appendssepto every part except the final one. The guard used a value comparison:So for an input like
"cat dog bird cat", splitting on" "yields["cat", "dog", "bird", "cat"]. For the first"cat",part == splits[-1]is true, so its trailing space is dropped and it gets glued onto the next token, producing"catdog". Word boundaries are lost, which degrades embedding/retrieval quality for any document containing a repeated token equal to the last token of a split segment.Reproduction
Fix
Compare by index instead of value, which is what the (already correct) fallback splitter in
core/utils/fast_ops.py::_split_recursivedoes:Two lines changed; no behavior change for inputs without a duplicated final token (verified: normal text still reconstructs exactly).
Regression test
Added
core/tests/unit/test_recursive_text_splitter.pywith two tests asserting word boundaries are preserved for repeated tokens. Both fail before the fix ("catdog"/"thequick"appear in the output) and pass after. The tests reuse the existing dependency-stubbing approach fromtest_video_parser.pyso they run without docling/torch.🤖 Generated with Claude Code