Skip to content

Update grapheme cluster width calculation#837

Merged
ima1zumi merged 1 commit into
ruby:masterfrom
tompng:grapheme_cluster_width
Jul 15, 2025
Merged

Update grapheme cluster width calculation#837
ima1zumi merged 1 commit into
ruby:masterfrom
tompng:grapheme_cluster_width

Conversation

@tompng

@tompng tompng commented Jul 7, 2025

Copy link
Copy Markdown
Member

Implemented rules

Width of NonspacingMark and EnclosingMark is 0
Width of char just after ZeroWidthJoiner is 0
Width of Hangul GraphemeClusterBreak=V,T are 0 because there should be preceding L or LV
Other chars: sum of east asian width for each characters

Relation with grapheme cluster break rules

GB1 - GB5

Break at the start and end of text, unless the text is empty.
Do not break between a CR and LF. Otherwise, break before and after controls.

Not related to width calculation

GB6, GB7, GB8

Do not break Hangul syllable or other conjoining sequences.
L × (L | V | LV | LVT)
GB7 (LV | V) × (V | T)
GB8 (LVT | T) × T

LV LVT are NFC normalized style. L+V, L+V+T are NFD normalized style.
East asian width of L LV LVT is 2. L+V L+V+T should be 2-width, so treat V and T as zero-width.
Width of standalone V and standalone T varies over terminal emulators (0 or 1)
Width of L+V+V+T+T, L+V+T that has no glyph also varies over terminal emulators.

GB9, GB9a, GB9b, GB9c

GB9 × (Extend | ZWJ)
GB9a × SpacingMark
GB9b Prepend ×
GB9c \p{InCB=Consonant} [ \p{InCB=Extend} \p{InCB=Linker} ]* \p{InCB=Linker} [ \p{InCB=Extend} \p{InCB=Linker} ]* × \p{InCB=Consonant}

Extend: Nonspacing Mark | Enclosing Mark | Spacing Mark | halfwidth-dakuten | halfwidth-handakuten | Modifier_Symbol | Format
Linker: Nonspacing Mark

Nonspacing Mark and Enclosing Mark: Zero-width
Character after ZWJ: Zero-width
Other: Follow east asian width
Do nothing for Prepend because width varies over terminal emulators.

GB11

Do not break within emoji modifier sequences or emoji zwj sequences.
GB11 \p{Extended_Pictographic} Extend* ZWJ × \p{Extended_Pictographic}

Assume combined emoji exists, Assume Extend to be Nonspacing Mark.
Character after ZWJ: Zero-width

GB12, GB13

Do not break within emoji flag sequences. That is, do not break between regional indicator (RI) symbols if there is an odd number of RI characters before the break point.
GB12 sot (RI RI)* RI × RI
GB13 [^RI] (RI RI)* RI × RI

Flag emoji width varies over terminal emulators and its configurations. Unable to handle it now.
Width calculation change from 1 to 2 in this pull request.

Width of NonspacingMark and EnclosingMark is 0
Width of char just after ZeroWidthJoiner is 0
Width of Hangul GraphemeClusterBreak=V,T are 0 because there should be preceding L or LV
@tompng tompng force-pushed the grapheme_cluster_width branch from 959cbad to 99dc2a3 Compare July 8, 2025 19:43

@ima1zumi ima1zumi left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@ima1zumi ima1zumi merged commit d0f09ee into ruby:master Jul 15, 2025
44 checks passed
@tompng tompng deleted the grapheme_cluster_width branch July 15, 2025 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants