Skip to content

Segmenter proof of concept#7898

Merged
robertbastian merged 5 commits into
unicode-org:mainfrom
robertbastian:seg
May 6, 2026
Merged

Segmenter proof of concept#7898
robertbastian merged 5 commits into
unicode-org:mainfrom
robertbastian:seg

Conversation

@robertbastian

@robertbastian robertbastian commented Apr 21, 2026

Copy link
Copy Markdown
Member

This uses the data from unicode-org/unicodetools#1321.

Changelog

N/A

@robertbastian robertbastian force-pushed the seg branch 8 times, most recently from a1732e7 to f55e29a Compare April 27, 2026 11:20
@robertbastian robertbastian added discuss-priority Discuss at the next ICU4X meeting and removed discuss-priority Discuss at the next ICU4X meeting labels Apr 30, 2026
@robertbastian robertbastian force-pushed the seg branch 6 times, most recently from 38268c9 to 29f317b Compare May 5, 2026 13:55
@robertbastian robertbastian marked this pull request as ready for review May 5, 2026 13:55
@robertbastian robertbastian force-pushed the seg branch 2 times, most recently from 4aca56c to 37cf393 Compare May 5, 2026 14:29

@Manishearth Manishearth left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving so progress can be made. Cursory code review lgtm, I think we can do a proper code review once this is done being experimented with.

Comment thread components/segmenter/src/provider/mod.rs
icu_provider::data_marker!(
/// `SegmenterBreakLineV2`
SegmenterBreakLineV2,
"segmenter/break/line/v2",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

observation: BreakLineV2 sounds a bit strange but it's fine

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current one is called SegmenterBreakLineV1. I don't want to invent new names

// A map from Unicode scalar values to their segmentation classes
#[cfg_attr(feature = "serde", serde(borrow))]
pub classes: CodePointTrie<'data, Class>,
// A dense map of states

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for future: eventually we should have detailed docs on what all of these entries really are (map of states from where to where?)

@sffc sffc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Praise: Easy to read, few lines of code. Some thoughts:

  • Final API shape TBD. We probably want LineSegmenterBorrowed.
  • Word and Sentence segmenters have locale-specific tailorings. Do we have a plan to handle those? (OK if the answer is, we'll figure it out later)

@robertbastian robertbastian merged commit f75a9f2 into unicode-org:main May 6, 2026
34 checks passed
@robertbastian robertbastian deleted the seg branch May 6, 2026 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants