Segmenter proof of concept#7898
Merged
Merged
Conversation
a1732e7 to
f55e29a
Compare
38268c9 to
29f317b
Compare
4aca56c to
37cf393
Compare
Manishearth
approved these changes
May 5, 2026
Manishearth
left a comment
Member
There was a problem hiding this comment.
Approving so progress can be made. Cursory code review lgtm, I think we can do a proper code review once this is done being experimented with.
| icu_provider::data_marker!( | ||
| /// `SegmenterBreakLineV2` | ||
| SegmenterBreakLineV2, | ||
| "segmenter/break/line/v2", |
Member
There was a problem hiding this comment.
observation: BreakLineV2 sounds a bit strange but it's fine
Member
Author
There was a problem hiding this comment.
The current one is called SegmenterBreakLineV1. I don't want to invent new names
| // A map from Unicode scalar values to their segmentation classes | ||
| #[cfg_attr(feature = "serde", serde(borrow))] | ||
| pub classes: CodePointTrie<'data, Class>, | ||
| // A dense map of states |
Member
There was a problem hiding this comment.
for future: eventually we should have detailed docs on what all of these entries really are (map of states from where to where?)
sffc
reviewed
May 6, 2026
sffc
left a comment
Member
There was a problem hiding this comment.
Praise: Easy to read, few lines of code. Some thoughts:
- Final API shape TBD. We probably want LineSegmenterBorrowed.
- Word and Sentence segmenters have locale-specific tailorings. Do we have a plan to handle those? (OK if the answer is, we'll figure it out later)
sffc
approved these changes
May 6, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This uses the data from unicode-org/unicodetools#1321.
Changelog
N/A