fix(fts): split ICU tokens on punctuation by Xuanwo · Pull Request #7005 · lance-format/lance

Xuanwo · 2026-05-29T17:59:45Z

ICU became the default FTS tokenizer, but it can keep punctuation inside word segments where the simple tokenizer would split on every non-alphanumeric character. This makes ICU preserve the multilingual word segmentation behavior while applying the same intra-segment delimiter rule as the simple tokenizer.

This restores expected matching for terms such as foo_bar and also aligns punctuation cases like apostrophes, hyphens, and dots with the simple tokenizer.

codecov · 2026-05-29T18:40:31Z

Codecov Report

❌ Patch coverage is 97.87234% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-tokenizer/src/icu.rs	97.87%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

fix(fts): split ICU tokens on punctuation

04bc385

github-actions Bot added the bug Something isn't working label May 29, 2026

Xuanwo closed this May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(fts): split ICU tokens on punctuation#7005

fix(fts): split ICU tokens on punctuation#7005
Xuanwo wants to merge 1 commit into
mainfrom
xuanwo/split-icu-underscore

Xuanwo commented May 29, 2026

Uh oh!

codecov Bot commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Xuanwo commented May 29, 2026

Uh oh!

codecov Bot commented May 29, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant