Skip to content

fix(fts): split ICU tokens on punctuation#7005

Closed
Xuanwo wants to merge 1 commit into
mainfrom
xuanwo/split-icu-underscore
Closed

fix(fts): split ICU tokens on punctuation#7005
Xuanwo wants to merge 1 commit into
mainfrom
xuanwo/split-icu-underscore

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented May 29, 2026

ICU became the default FTS tokenizer, but it can keep punctuation inside word segments where the simple tokenizer would split on every non-alphanumeric character. This makes ICU preserve the multilingual word segmentation behavior while applying the same intra-segment delimiter rule as the simple tokenizer.

This restores expected matching for terms such as foo_bar and also aligns punctuation cases like apostrophes, hyphens, and dots with the simple tokenizer.

@github-actions github-actions Bot added the bug Something isn't working label May 29, 2026
@Xuanwo Xuanwo closed this May 29, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

Codecov Report

❌ Patch coverage is 97.87234% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-tokenizer/src/icu.rs 97.87% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant