Skip to content

Update line segmenter to Unicode 17#8041

Open
makotokato wants to merge 5 commits into
unicode-org:mainfrom
makotokato:lb17-2
Open

Update line segmenter to Unicode 17#8041
makotokato wants to merge 5 commits into
unicode-org:mainfrom
makotokato:lb17-2

Conversation

@makotokato

@makotokato makotokato commented Jun 5, 2026

Copy link
Copy Markdown
Member

Update previous. I have passed monkeytest that is 32,000 tests.

Changelog

  • Update Line segmenter to Unicode 17

# © 2025 Unicode®, Inc.
# Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries.
# For terms of use and license, see https://www.unicode.org/terms_of_use.html
#
# Generated from unicodetools that is replace SA_Mn, SA_Mc and SAmMnmMc with original SA to handle

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary, as spec_test now uses new_for_non_complex_scripts

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah it fails the unaltered test file:

  | A | E | Code pt. | Line_Break         | General_Category   | East_Asian_Width | Literal
  | ÷ | ÷ |     0E31 |    Complex_Context |    Nonspacing_Mark |          Neutral | ั
  | × | × |     0308 |     Combining_Mark |    Nonspacing_Mark |        Ambiguous | ̈
😭| ÷ | × |     0E31 |    Complex_Context |    Nonspacing_Mark |          Neutral | ั
Test case #14562

It should really pass it

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit that handles this discrepancy in the test, so we can check in the unaltered test data

@eggrobin

eggrobin commented Jun 5, 2026

Copy link
Copy Markdown
Member

I have passed monkeytest that is 32,000 tests.

Unfortunately, for such a large state machine, it looks like this is not enough. I tried with a million strings, and got a first failure on test case #‌136268, see below (A more realistic and minimal test case would be endings (« ‐s », « ‐x », etc.), with an unwanted break after endings (« ‐.)

---- run_line_break_random_test stdout ----
  | A | E | Code pt. | Line_Break         | General_Category   | East_Asian_Width | Literal
  | ÷ | ÷ |     0FD9 |               Glue |  Other_Punctuation |          Neutral | ࿙
  | × | × |     1907 |         Alphabetic |       Other_Letter |          Neutral | ᤇ
  | × | × |     FE56 |        Exclamation |  Other_Punctuation |             Wide | ﹖
  | ÷ | ÷ |     FE6A |    Postfix_Numeric |  Other_Punctuation |             Wide | ﹪
  | × | × |    16F87 |     Combining_Mark |       Spacing_Mark |          Neutral | 𖾇
  | × | × |     0020 |              Space |    Space_Separator |           Narrow |
  | ÷ | ÷ |     300E |   Open_Punctuation |   Open_Punctuation |             Wide | 『
  | × | × |     FFFC |   Contingent_Break |       Other_Symbol |          Neutral | 
  | × | × |     0085 |          Next_Line |            Control |          Neutral |
  | ÷ | ÷ |    1F3FC |         E_Modifier |    Modifier_Symbol |             Wide | 🏼
  | ÷ | ÷ |    536F0 |            Unknown |         Unassigned |          Neutral | 񓛰
  | ÷ | ÷ |     AA22 |       Aksara_Start |       Other_Letter |          Neutral | ꨢ
  | × | × |     060D |      Infix_Numeric |  Other_Punctuation |          Neutral | ؍
  | × | × |     200B |            ZWSpace |             Format |          Neutral | ​
  | × | × |     0085 |          Next_Line |            Control |          Neutral |
  | ÷ | ÷ |     A976 |                 JL |       Other_Letter |             Wide | ꥶ
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     FF01 |        Exclamation |  Other_Punctuation |        Fullwidth | !
  | × | × |     1803 |        Exclamation |  Other_Punctuation |          Neutral | ᠃
  | × | × |     2047 |         Nonstarter |  Other_Punctuation |          Neutral | ⁇
  | ÷ | ÷ |    1F8E3 |            Unknown |         Unassigned |          Neutral | 🣣
  | × | × |     FB2C |      Hebrew_Letter |       Other_Letter |          Neutral | שּׁ
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | × | × |    16FE4 |               Glue |    Nonspacing_Mark |             Wide | 𖿤
  | × | × |    1FE3F |        Ideographic |         Unassigned |          Neutral | 🸿
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | ÷ | ÷ |    1F3FF |         E_Modifier |    Modifier_Symbol |             Wide | 🏿
  | ÷ | ÷ |     2662 |         Alphabetic |       Other_Symbol |          Neutral | ♢
  | × | × |    1F8EB |            Unknown |         Unassigned |          Neutral | 🣫
  | × | × |     0020 |              Space |    Space_Separator |           Narrow |
  | ÷ | ÷ |     FE3B |   Open_Punctuation |   Open_Punctuation |             Wide | ︻
  | × | × |     20BB |    Postfix_Numeric |    Currency_Symbol |          Neutral | ₻
  | × | × |    1F198 |          Ambiguous |       Other_Symbol |             Wide | 🆘
  | × | × |     0DDA |     Combining_Mark |       Spacing_Mark |          Neutral | ේ
  | ÷ | ÷ |    1193F |     Aksara_Prebase |       Other_Letter |          Neutral | 𑤿
  | × | × |     17D6 |         Nonstarter |  Other_Punctuation |          Neutral | ៖
  | ÷ | ÷ |    1F5E4 |        Ideographic |       Other_Symbol |          Neutral | 🗤
  | ÷ | ÷ |    1F0BE |        Ideographic |       Other_Symbol |          Neutral | 🂾
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | × | × |     FE19 |        Inseparable |  Other_Punctuation |             Wide | ︙
  | × | × |     060D |      Infix_Numeric |  Other_Punctuation |          Neutral | ؍
  | ÷ | ÷ |    119E2 |       Break_Before |  Other_Punctuation |          Neutral | 𑧢
  | × | × |     005D |  Close_Parenthesis |  Close_Punctuation |           Narrow | ]
  | × | × |     200D |                ZWJ |             Format |          Neutral | ‍
  | × | × |     261D |             E_Base |       Other_Symbol |          Neutral | ☝
  | × | × |    1F67B |         Nonstarter |       Other_Symbol |          Neutral | 🙻
  | ÷ | ÷ |    16115 |       Aksara_Start |       Other_Letter |          Neutral | 𖄕
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | × | × |    16FF2 |         Nonstarter |    Modifier_Letter |             Wide | 𖿲
  | × | × |    16FE4 |               Glue |    Nonspacing_Mark |             Wide | 𖿤
  | × | × |    1FC6D |        Ideographic |         Unassigned |          Neutral | 🱭
  | ÷ | ÷ |     25CC |         Alphabetic |       Other_Symbol |          Neutral | ◌
  | ÷ | ÷ |     A9D8 |       Aksara_Start |     Decimal_Number |          Neutral | ꧘
  | × | × |     0020 |              Space |    Space_Separator |           Narrow |
  | ÷ | ÷ |     203D |         Nonstarter |  Other_Punctuation |          Neutral | ‽
  | ÷ | ÷ |    104A9 |            Numeric |     Decimal_Number |          Neutral | 𐒩
  | × | × |     20AF |     Prefix_Numeric |    Currency_Symbol |          Neutral | ₯
  | × | × |     000D |    Carriage_Return |            Control |          Neutral |
  | ÷ | ÷ |     2603 |        Ideographic |       Other_Symbol |          Neutral | ☃
  | ÷ | ÷ |     A875 |       Break_Before |  Other_Punctuation |          Neutral | ꡵
  | × | × |     2E08 |          Quotation |  Other_Punctuation |          Neutral | ⸈
  | × | × |     2E02 |          Quotation | Initial_Punctuation |          Neutral | ⸂
  | × | × |     FE19 |        Inseparable |  Other_Punctuation |             Wide | ︙
  | × | × |     FE15 |        Exclamation |  Other_Punctuation |             Wide | ︕
  | ÷ | ÷ |     FFE0 |    Postfix_Numeric |    Currency_Symbol |        Fullwidth | ¢
  | ÷ | ÷ |     270C |             E_Base |       Other_Symbol |          Neutral | ✌
  | × | × |     0F0D |        Exclamation |  Other_Punctuation |          Neutral | །
  | × | × |     0020 |              Space |    Space_Separator |           Narrow |
  | ÷ | ÷ |    1F931 |             E_Base |       Other_Symbol |             Wide | 🤱
  | ÷ | ÷ |     00A5 |     Prefix_Numeric |    Currency_Symbol |           Narrow | ¥
  | × | × |     2E56 |  Close_Parenthesis |  Close_Punctuation |          Neutral | ⹖
  | × | × |    10AF6 |        Inseparable |  Other_Punctuation |          Neutral | 𐫶
  | ÷ | ÷ |     FFFC |   Contingent_Break |       Other_Symbol |          Neutral | 
  | × | × |     3015 |  Close_Punctuation |  Close_Punctuation |             Wide | 〕
  | ÷ | ÷ |     FE59 |   Open_Punctuation |   Open_Punctuation |             Wide | ﹙
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     2034 |    Postfix_Numeric |  Other_Punctuation |          Neutral | ‴
  | × | × |     22EF |        Inseparable |        Math_Symbol |          Neutral | ⋯
  | × | × |     3000 |        Break_After |    Space_Separator |        Fullwidth |  
  | × | × |     2E56 |  Close_Parenthesis |  Close_Punctuation |          Neutral | ⹖
  | × | × |     A015 |         Nonstarter |    Modifier_Letter |             Wide | ꀕ
  | × | × |     FE15 |        Exclamation |  Other_Punctuation |             Wide | ︕
  | × | × |     2762 |        Exclamation |       Other_Symbol |          Neutral | ❢
  | × | × |     200D |                ZWJ |             Format |          Neutral | ‍
  | × | × |     2049 |         Nonstarter |  Other_Punctuation |          Neutral | ⁉
  | ÷ | ÷ |    1F18E |          Ambiguous |       Other_Symbol |             Wide | 🆎
  | ÷ | ÷ |    1F3FE |         E_Modifier |    Modifier_Symbol |             Wide | 🏾
  | ÷ | ÷ |    1F486 |             E_Base |       Other_Symbol |             Wide | 💆
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | × | × |     000D |    Carriage_Return |            Control |          Neutral |
  | ÷ | ÷ |    10292 |         Alphabetic |       Other_Letter |          Neutral | 𐊒
  | ÷ | ÷ |    1FC4B |        Ideographic |         Unassigned |          Neutral | 🱋
  | ÷ | ÷ |    11067 |       Aksara_Start |     Decimal_Number |          Neutral | 𑁧
  | ÷ | ÷ |    1FA6C |        Ideographic |       Other_Symbol |          Neutral | 🩬
  | × | × |    11C44 |        Break_After |  Other_Punctuation |          Neutral | 𑱄
  | ÷ | ÷ |    112F7 |            Numeric |     Decimal_Number |          Neutral | 𑋷
  | × | × |     31FB | Conditional_Japanese_Starter |       Other_Letter |             Wide | ㇻ
  | × | × |     302E |     Combining_Mark |       Spacing_Mark |             Wide | 〮
  | ÷ | ÷ |    1F590 |             E_Base |       Other_Symbol |          Neutral | 🖐
  | ÷ | ÷ |     02DF |       Break_Before |    Modifier_Symbol |        Ambiguous | ˟
  | × | × |     31F5 | Conditional_Japanese_Starter |       Other_Letter |             Wide | ㇵ
  | ÷ | ÷ |     FF05 |    Postfix_Numeric |  Other_Punctuation |        Fullwidth | %
  | ÷ | ÷ |     20CE |     Prefix_Numeric |         Unassigned |          Neutral | ⃎
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     11BD |                 JT |       Other_Letter |          Neutral | ᆽ
  | ÷ | ÷ |    1F19A |          Ambiguous |       Other_Symbol |             Wide | 🆚
  | × | × |     2029 |    Mandatory_Break | Paragraph_Separator |          Neutral | 

  | ÷ | ÷ |     00AB |          Quotation | Initial_Punctuation |          Neutral | «
  | × | × |     FFFC |   Contingent_Break |       Other_Symbol |          Neutral | 
  | ÷ | ÷ |     261D |             E_Base |       Other_Symbol |          Neutral | ☝
  | × | × |    16FE4 |               Glue |    Nonspacing_Mark |             Wide | 𖿤
  | × | × |    1F1FB | Regional_Indicator |       Other_Symbol |          Neutral | 🇻
  | × | × |     2029 |    Mandatory_Break | Paragraph_Separator |          Neutral | 

  | ÷ | ÷ |    16FE4 |               Glue |    Nonspacing_Mark |             Wide | 𖿤
  | × | × |     1BF2 |       Virama_Final |       Spacing_Mark |          Neutral | ᯲
  | × | × |     00BB |          Quotation |  Final_Punctuation |          Neutral | »
  | × | × |     D7C3 |                 JV |       Other_Letter |          Neutral | ퟃ
  | × | × |     FE19 |        Inseparable |  Other_Punctuation |             Wide | ︙
  | × | × |    10AF6 |        Inseparable |  Other_Punctuation |          Neutral | 𐫶
  | × | × |    16FE0 |         Nonstarter |    Modifier_Letter |             Wide | 𖿠
  | × | × |     2E0B |          Quotation |  Other_Punctuation |          Neutral | ⸋
  | × | × |     302C |     Combining_Mark |    Nonspacing_Mark |             Wide | 〬
  | × | × |    1325D |  Close_Punctuation |       Other_Letter |          Neutral | 𓉝
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     1400 | Unambiguous_Hyphen |   Dash_Punctuation |          Neutral | ᐀
  | ÷ | ÷ |     1BF3 |       Virama_Final |       Spacing_Mark |          Neutral | ᯳
  | ÷ | ÷ |    113D1 |     Aksara_Prebase |       Other_Letter |          Neutral | 𑏑
  | × | × |     0D42 |     Combining_Mark |    Nonspacing_Mark |          Neutral | ൂ
  | × | × |    1343F |  Close_Punctuation |             Format |          Neutral | 𓐿
  | ÷ | ÷ |    1F8E5 |            Unknown |         Unassigned |          Neutral | 🣥
  | ÷ | ÷ |    38390 |        Ideographic |         Unassigned |             Wide | 𸎐
  | × | × |     2E56 |  Close_Parenthesis |  Close_Punctuation |          Neutral | ⹖
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | ÷ | ÷ |     02CC |       Break_Before |    Modifier_Letter |          Neutral | ˌ
  | × | × |     2060 |        Word_Joiner |             Format |          Neutral | ⁠
  | × | × |     FFFC |   Contingent_Break |       Other_Symbol |          Neutral | 
  | ÷ | ÷ |     3083 | Conditional_Japanese_Starter |       Other_Letter |             Wide | ゃ
  | ÷ | ÷ |    1F888 |            Unknown |         Unassigned |          Neutral | 🢈
  | ÷ | ÷ |    2C046 |        Ideographic |       Other_Letter |             Wide | 𬁆
  | ÷ | ÷ |    1FF46 |        Ideographic |         Unassigned |          Neutral | 🽆
  | ÷ | ÷ |    8E722 |            Unknown |         Unassigned |          Neutral | 򎜢
  | ÷ | ÷ |     115A |                 JL |       Other_Letter |             Wide | ᅚ
  | × | × |     2026 |        Inseparable |  Other_Punctuation |        Ambiguous | …
  | ÷ | ÷ |     2664 |          Ambiguous |       Other_Symbol |        Ambiguous | ♤
  | × | × |     200B |            ZWSpace |             Format |          Neutral | ​
  | × | × |     0020 |              Space |    Space_Separator |           Narrow |
  | ÷ | ÷ |     2E5A |  Close_Parenthesis |  Close_Punctuation |          Neutral | ⹚
  | × | × |     05E3 |      Hebrew_Letter |       Other_Letter |          Neutral | ף
  | ÷ | ÷ |    1F3FD |         E_Modifier |    Modifier_Symbol |             Wide | 🏽
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |    1F8FC |            Unknown |         Unassigned |          Neutral | 🣼
  | × | × |     FE2D |               Glue |    Nonspacing_Mark |          Neutral | ︭
  | × | × |    1CF71 |         Alphabetic |       Other_Symbol |          Neutral | 𜽱
  | ÷ | ÷ |     3018 |   Open_Punctuation |   Open_Punctuation |             Wide | 〘
  | × | × |     C97C |                 H3 |       Other_Letter |             Wide | 쥼
  | ÷ | ÷ |     FE41 |   Open_Punctuation |   Open_Punctuation |             Wide | ﹁
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | × | × |     2E0C |          Quotation | Initial_Punctuation |          Neutral | ⸌
  | × | × |     2563 |          Ambiguous |       Other_Symbol |        Ambiguous | ╣
  | × | × |     1B5A |        Break_After |  Other_Punctuation |          Neutral | ᭚
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     05BE | Unambiguous_Hyphen |   Dash_Punctuation |          Neutral | ־
  | × | × |     25CC |         Alphabetic |       Other_Symbol |          Neutral | ◌
  | × | × |    16FE4 |               Glue |    Nonspacing_Mark |             Wide | 𖿤
  | × | × |     FE15 |        Exclamation |  Other_Punctuation |             Wide | ︕
  | × | × |     FE11 |  Close_Punctuation |  Other_Punctuation |             Wide | ︑
  | × | × |     2039 |          Quotation | Initial_Punctuation |          Neutral | ‹
  | × | × |     11A0 |                 JV |       Other_Letter |          Neutral | ᆠ
  | ÷ | ÷ |    1F3FC |         E_Modifier |    Modifier_Symbol |             Wide | 🏼
  | ÷ | ÷ |    172CF |        Ideographic |       Other_Letter |             Wide | 𗋏
  | × | × |     000C |    Mandatory_Break |            Control |          Neutral |

  | ÷ | ÷ |     115B |                 JL |       Other_Letter |             Wide | ᅛ
  | × | × |     201C |          Quotation | Initial_Punctuation |        Ambiguous | “
  | × | × |     1BF3 |       Virama_Final |       Spacing_Mark |          Neutral | ᯳
  | ÷ | ÷ |     FE41 |   Open_Punctuation |   Open_Punctuation |             Wide | ﹁
  | × | × |     2E1D |          Quotation |  Final_Punctuation |          Neutral | ⸝
  | ÷ | ÷ |     FFE0 |    Postfix_Numeric |    Currency_Symbol |        Fullwidth | ¢
  | ÷ | ÷ |     20B8 |     Prefix_Numeric |    Currency_Symbol |          Neutral | ₸
  | × | × |    1F184 |          Ambiguous |       Other_Symbol |        Ambiguous | 🆄
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | × | × |     2E17 | Unambiguous_Hyphen |   Dash_Punctuation |          Neutral | ⸗
  | × | × |     3017 |  Close_Punctuation |  Close_Punctuation |             Wide | 〗
  | ÷ | ÷ |     2E55 |   Open_Punctuation |   Open_Punctuation |          Neutral | ⹕
  | × | × |     FF1F |        Exclamation |  Other_Punctuation |        Fullwidth | ?
  | × | × |     200D |                ZWJ |             Format |          Neutral | ‍
  | × | × |     3000 |        Break_After |    Space_Separator |        Fullwidth |  
  | ÷ | ÷ |    11941 |     Aksara_Prebase |       Other_Letter |          Neutral | 𑥁
  | ÷ | ÷ |     1154 |                 JL |       Other_Letter |             Wide | ᅔ
  | ÷ | ÷ |     11F4 |                 JT |       Other_Letter |          Neutral | ᇴ
  | ÷ | ÷ |     B9AC |                 H2 |       Other_Letter |             Wide | 리
  | × | × |    11FDF |    Postfix_Numeric |    Currency_Symbol |          Neutral | 𑿟
  | ÷ | ÷ |     D057 |                 H3 |       Other_Letter |             Wide | 큗
  | × | × |     060B |    Postfix_Numeric |    Currency_Symbol |          Neutral | ؋
  | ÷ | ÷ |     2E3A |         Break_Both |   Dash_Punctuation |          Neutral | ⸺
  | × | × |     2044 |      Infix_Numeric |        Math_Symbol |          Neutral | ⁄
  | × | × |     FE57 |        Exclamation |  Other_Punctuation |             Wide | ﹗
  | ÷ | ÷ |     CFA8 |                 H2 |       Other_Letter |             Wide | 쾨
  | × | × |     2E0D |          Quotation |  Final_Punctuation |          Neutral | ⸍
  | × | × |     2014 |         Break_Both |   Dash_Punctuation |        Ambiguous | —
  | × | × |     2012 | Unambiguous_Hyphen |   Dash_Punctuation |          Neutral | ‒
  | ÷ | ÷ |    1F6B5 |             E_Base |       Other_Symbol |             Wide | 🚵
  | × | × |     005D |  Close_Parenthesis |  Close_Punctuation |           Narrow | ]
  | ÷ | ÷ |     C41C |                 H3 |       Other_Letter |             Wide | 쐜
  | × | × |     FE15 |        Exclamation |  Other_Punctuation |             Wide | ︕
  | × | × |     061D |        Exclamation |  Other_Punctuation |          Neutral | ؝
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     061E |        Exclamation |  Other_Punctuation |          Neutral | ؞
  | ÷ | ÷ |     19D4 |            Numeric |     Decimal_Number |          Neutral | ᧔
  | ÷ | ÷ |     D7B3 |                 JV |       Other_Letter |          Neutral | ힳ
  | ÷ | ÷ |    2A9D7 |        Ideographic |       Other_Letter |             Wide | 𪧗
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     06D4 |        Exclamation |  Other_Punctuation |          Neutral | ۔
  | ÷ | ÷ |     261D |             E_Base |       Other_Symbol |          Neutral | ☝
  | × | × |     275E |          Quotation |       Other_Symbol |          Neutral | ❞
  | × | × |    1F1FE | Regional_Indicator |       Other_Symbol |          Neutral | 🇾
  | ÷ | ÷ |     05D4 |      Hebrew_Letter |       Other_Letter |          Neutral | ה
  | × | × |     2771 |  Close_Punctuation |  Close_Punctuation |          Neutral | ❱
  | × | × |     00AD |        Break_After |             Format |        Ambiguous | ­
  | ÷ | ÷ |     11A1 |                 JV |       Other_Letter |          Neutral | ᆡ
  | × | × |    11FE0 |    Postfix_Numeric |    Currency_Symbol |          Neutral | 𑿠
  | ÷ | ÷ |     D2D8 |                 H3 |       Other_Letter |             Wide | 틘
  | ÷ | ÷ |    1F93E |             E_Base |       Other_Symbol |             Wide | 🤾
  | × | × |     07F8 |      Infix_Numeric |  Other_Punctuation |          Neutral | ߸
  | × | × |     2060 |        Word_Joiner |             Format |          Neutral | ⁠
  | × | × |     2033 |    Postfix_Numeric |  Other_Punctuation |        Ambiguous | ″
  | × | × |    1F14B |          Ambiguous |       Other_Symbol |        Ambiguous | 🅋
  | ÷ | ÷ |     FFFC |   Contingent_Break |       Other_Symbol |          Neutral | 
  | ÷ | ÷ |     FFFC |   Contingent_Break |       Other_Symbol |          Neutral | 
  | × | × |     2060 |        Word_Joiner |             Format |          Neutral | ⁠
  | × | × |     1180 |                 JV |       Other_Letter |          Neutral | ᆀ
  | ÷ | ÷ |     00D7 |          Ambiguous |        Math_Symbol |        Ambiguous | ×
  | × | × |     2E56 |  Close_Parenthesis |  Close_Punctuation |          Neutral | ⹖
  | × | × |    16FF1 |     Combining_Mark |       Spacing_Mark |             Wide | 𖿱
  | ÷ | ÷ |     FF05 |    Postfix_Numeric |  Other_Punctuation |        Fullwidth | %
  | ÷ | ÷ |     0F03 |       Break_Before |       Other_Symbol |          Neutral | ༃
  | × | × |    49062 |            Unknown |         Unassigned |          Neutral | 񉁢
  | ÷ | ÷ |    1F5E7 |        Ideographic |       Other_Symbol |          Neutral | 🗧
  | ÷ | ÷ |    11665 |       Break_Before |  Other_Punctuation |          Neutral | 𑙥
  | × | × |     2CFE |        Exclamation |  Other_Punctuation |          Neutral | ⳾
  | ÷ | ÷ |    3952E |        Ideographic |         Unassigned |             Wide | 𹔮
  | × | × |     FE19 |        Inseparable |  Other_Punctuation |             Wide | ︙
  | ÷ | ÷ |     11EE |                 JT |       Other_Letter |          Neutral | ᇮ
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |    1343C |   Open_Punctuation |             Format |          Neutral | 𓐼
  | × | × |     118A |                 JV |       Other_Letter |          Neutral | ᆊ
  | ÷ | ÷ |     FE43 |   Open_Punctuation |   Open_Punctuation |             Wide | ﹃
  | × | × |     1BF2 |       Virama_Final |       Spacing_Mark |          Neutral | ᯲
  | × | × |     2E0A |          Quotation |  Final_Punctuation |          Neutral | ⸊
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     FE15 |        Exclamation |  Other_Punctuation |             Wide | ︕
  | × | × |     0085 |          Next_Line |            Control |          Neutral |
  | ÷ | ÷ |     FE56 |        Exclamation |  Other_Punctuation |             Wide | ﹖
  | × | × |     002F |      Break_Symbols |  Other_Punctuation |           Narrow | /
  | × | × |     037E |      Infix_Numeric |  Other_Punctuation |          Neutral | ;
  | ÷ | ÷ |    1F0D4 |        Ideographic |       Other_Symbol |          Neutral | 🃔
  | × | × |     0029 |  Close_Parenthesis |  Close_Punctuation |           Narrow | )
  | ÷ | ÷ |     1BF3 |       Virama_Final |       Spacing_Mark |          Neutral | ᯳
  | ÷ | ÷ |     005B |   Open_Punctuation |   Open_Punctuation |           Narrow | [
  | × | × |     200B |            ZWSpace |             Format |          Neutral | ​
  | ÷ | ÷ |     0964 |        Break_After |  Other_Punctuation |          Neutral | ।
  | × | × |     203A |          Quotation |  Final_Punctuation |          Neutral | ›
  | × | × |    1F679 |         Nonstarter |       Other_Symbol |          Neutral | 🙹
  | ÷ | ÷ |    1F8FC |            Unknown |         Unassigned |          Neutral | 🣼
  | × | × |     3000 |        Break_After |    Space_Separator |        Fullwidth |  
  | × | × |     200D |                ZWJ |             Format |          Neutral | ‍
  | × | × |     A9C0 |             Virama |       Spacing_Mark |          Neutral | ꧀
  | ÷ | ÷ |     20C8 |     Prefix_Numeric |         Unassigned |          Neutral | ⃈
  | ÷ | ÷ |     2E3B |         Break_Both |   Dash_Punctuation |          Neutral | ⸻
  | × | × |    113CE |     Combining_Mark |    Nonspacing_Mark |          Neutral | 𑏎
  | × | × |    16FF3 |         Nonstarter |    Modifier_Letter |             Wide | 𖿳
  | ÷ | ÷ |    1F1ED | Regional_Indicator |       Other_Symbol |          Neutral | 🇭
  | × | × |     0020 |              Space |    Space_Separator |           Narrow |
  | ÷ | ÷ |     3005 |         Nonstarter |    Modifier_Letter |             Wide | 々
  | ÷ | ÷ |     20B2 |     Prefix_Numeric |    Currency_Symbol |          Neutral | ₲
  | × | × |     0029 |  Close_Parenthesis |  Close_Punctuation |           Narrow | )
  | ÷ | ÷ |     B5F1 |                 H3 |       Other_Letter |             Wide | 뗱
  | × | × |     2E1D |          Quotation |  Final_Punctuation |          Neutral | ⸝
  | × | × |     D7D6 |                 JT |       Other_Letter |          Neutral | ퟖ
  | ÷ | ÷ |    1F3CB |             E_Base |       Other_Symbol |          Neutral | 🏋
  | ÷ | ÷ |     2536 |          Ambiguous |       Other_Symbol |        Ambiguous | ┶
  | × | × |     FB2F |      Hebrew_Letter |       Other_Letter |          Neutral | אָ
  | × | × |    18C3C |         Alphabetic |       Other_Letter |             Wide | 𘰼
  | ÷ | ÷ |     1BF2 |       Virama_Final |       Spacing_Mark |          Neutral | ᯲
  | ÷ | ÷ |    1F85E |            Unknown |         Unassigned |          Neutral | 🡞
  | ÷ | ÷ |     2014 |         Break_Both |   Dash_Punctuation |        Ambiguous | —
  | × | × |     FE19 |        Inseparable |  Other_Punctuation |             Wide | ︙
  | ÷ | ÷ |     FFE0 |    Postfix_Numeric |    Currency_Symbol |        Fullwidth | ¢
  | ÷ | ÷ |     A9C0 |             Virama |       Spacing_Mark |          Neutral | ꧀
  | ÷ | ÷ |     300A |   Open_Punctuation |   Open_Punctuation |             Wide | 《
  | × | × |     AA53 |       Aksara_Start |     Decimal_Number |          Neutral | ꩓
  | × | × |     000C |    Mandatory_Break |            Control |          Neutral |

  | ÷ | ÷ |    18C76 |         Alphabetic |       Other_Letter |             Wide | 𘱶
  | × | × |    18B80 |         Alphabetic |       Other_Letter |             Wide | 𘮀
  | ÷ | ÷ |    1193F |     Aksara_Prebase |       Other_Letter |          Neutral | 𑤿
  | ÷ | ÷ |     FF05 |    Postfix_Numeric |  Other_Punctuation |        Fullwidth | %
  | × | × |     200D |                ZWJ |             Format |          Neutral | ‍
  | × | × |    11339 |             Aksara |       Other_Letter |          Neutral | 𑌹
  | ÷ | ÷ |     FFFC |   Contingent_Break |       Other_Symbol |          Neutral | 
  | ÷ | ÷ |     0601 |            Numeric |             Format |          Neutral | ؁
  | ÷ | ÷ |    1F1FF | Regional_Indicator |       Other_Symbol |          Neutral | 🇿
  | ÷ | ÷ |    1FD36 |        Ideographic |         Unassigned |          Neutral | 🴶
  | × | × |    1337B |  Close_Punctuation |       Other_Letter |          Neutral | 𓍻
  | ÷ | ÷ |    11D57 |            Numeric |     Decimal_Number |          Neutral | 𑵗
  | ÷ | ÷ |    1FEDA |        Ideographic |         Unassigned |          Neutral | 🻚
  | × | × |     3047 | Conditional_Japanese_Starter |       Other_Letter |             Wide | ぇ
  | × | × |     FE55 |         Nonstarter |  Other_Punctuation |             Wide | ﹕
  | × | × |    1B151 | Conditional_Japanese_Starter |       Other_Letter |             Wide | 𛅑
  | × | × |     000D |    Carriage_Return |            Control |          Neutral |
  | ÷ | ÷ |     2E5C |  Close_Parenthesis |  Close_Punctuation |          Neutral | ⹜
  | ÷ | ÷ |     2993 |   Open_Punctuation |   Open_Punctuation |          Neutral | ⦓
  | × | × |    1F194 |          Ambiguous |       Other_Symbol |             Wide | 🆔
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     25CC |         Alphabetic |       Other_Symbol |          Neutral | ◌
  | × | × |     0085 |          Next_Line |            Control |          Neutral |
  | ÷ | ÷ |     FFE6 |     Prefix_Numeric |    Currency_Symbol |        Fullwidth | ₩
  | × | × |     200D |                ZWJ |             Format |          Neutral | ‍
  | × | × |     FEFF |        Word_Joiner |             Format |          Neutral | 
  | × | × |    1F18E |          Ambiguous |       Other_Symbol |             Wide | 🆎
  | ÷ | ÷ |     301A |   Open_Punctuation |   Open_Punctuation |             Wide | 〚
  | × | × |     2E02 |          Quotation | Initial_Punctuation |          Neutral | ⸂
  | × | × |     0020 |              Space |    Space_Separator |           Narrow |
  | × | × |     2012 | Unambiguous_Hyphen |   Dash_Punctuation |          Neutral | ‒
😭| ÷ | × |     25CC |         Alphabetic |       Other_Symbol |          Neutral | ◌
  | × | × |     2515 |          Ambiguous |       Other_Symbol |        Ambiguous | ┕
  | × | × |     0029 |  Close_Parenthesis |  Close_Punctuation |           Narrow | )
  | ÷ | ÷ |     AD81 |                 H3 |       Other_Letter |             Wide | 궁
  | ÷ | ÷ |     118C |                 JV |       Other_Letter |          Neutral | ᆌ
  | ÷ | ÷ |     26C1 |        Ideographic |       Other_Symbol |          Neutral | ⛁
  | × | × |     303C |         Nonstarter |       Other_Letter |             Wide | 〼
  | ÷ | ÷ |     AC72 |                 H3 |       Other_Letter |             Wide | 걲
  | ÷ | ÷ |    1E143 |            Numeric |     Decimal_Number |          Neutral | 𞅃
  | ÷ | ÷ |     C68C |                 H3 |       Other_Letter |             Wide | 욌
  | ÷ | ÷ |     8E7B |        Ideographic |       Other_Letter |             Wide | 蹻
  | × | × |     207E |  Close_Punctuation |  Close_Punctuation |          Neutral | ⁾
  | ÷ | ÷ |     270C |             E_Base |       Other_Symbol |          Neutral | ✌
  | ÷ | ÷ |    1F474 |             E_Base |       Other_Symbol |             Wide | 👴
  | × | × |     000B |    Mandatory_Break |            Control |          Neutral |

  | ÷ | ÷ |    1CF0E |     Combining_Mark |    Nonspacing_Mark |          Neutral | 𜼎
  | × | × |     2060 |        Word_Joiner |             Format |          Neutral | ⁠
  | × | × |     000D |    Carriage_Return |            Control |          Neutral |
  | ÷ | ÷ |     2026 |        Inseparable |  Other_Punctuation |        Ambiguous | …
  | × | × |     302A |     Combining_Mark |    Nonspacing_Mark |             Wide | 〪
  | ÷ | ÷ |     2E3A |         Break_Both |   Dash_Punctuation |          Neutral | ⸺
  | ÷ | ÷ |     FFE0 |    Postfix_Numeric |    Currency_Symbol |        Fullwidth | ¢
  | × | × |     000D |    Carriage_Return |            Control |          Neutral |
  | ÷ | ÷ |    B08B7 |            Unknown |         Unassigned |          Neutral | 򰢷
  | × | × |    18BD7 |         Alphabetic |       Other_Letter |             Wide | 𘯗
  | ÷ | ÷ |    1166C |       Break_Before |  Other_Punctuation |          Neutral | 𑙬
  | × | × |     05BE | Unambiguous_Hyphen |   Dash_Punctuation |          Neutral | ־
  | × | × |     2060 |        Word_Joiner |             Format |          Neutral | ⁠
  | × | × |    1191F |             Aksara |       Other_Letter |          Neutral | 𑤟
  | ÷ | ÷ |     25CC |         Alphabetic |       Other_Symbol |          Neutral | ◌
  | ÷ | ÷ |     D770 |                 H3 |       Other_Letter |             Wide | 흰
  | ÷ | ÷ |     270D |             E_Base |       Other_Symbol |          Neutral | ✍
  | × | × |     2025 |        Inseparable |  Other_Punctuation |        Ambiguous | ‥
  | × | × |    1F679 |         Nonstarter |       Other_Symbol |          Neutral | 🙹
  | × | × |    1DA4F |     Combining_Mark |    Nonspacing_Mark |          Neutral | 𝩏
  | ÷ | ÷ |    1F3FE |         E_Modifier |    Modifier_Symbol |             Wide | 🏾
  | ÷ | ÷ |     2014 |         Break_Both |   Dash_Punctuation |        Ambiguous | —
  | ÷ | ÷ |     20A1 |     Prefix_Numeric |    Currency_Symbol |          Neutral | ₡
  | × | × |    18B07 |         Alphabetic |       Other_Letter |             Wide | 𘬇
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     30FC | Conditional_Japanese_Starter |    Modifier_Letter |             Wide | ー
  | × | × |     000D |    Carriage_Return |            Control |          Neutral |
  | ÷ | ÷ |   10D402 |            Unknown |        Private_Use |        Ambiguous | 􍐂
  | ÷ | ÷ |     3014 |   Open_Punctuation |   Open_Punctuation |             Wide | 〔
  | × | × |    1F196 |          Ambiguous |       Other_Symbol |             Wide | 🆖
  | × | × |     05DF |      Hebrew_Letter |       Other_Letter |          Neutral | ן
  | × | × |    16FE4 |               Glue |    Nonspacing_Mark |             Wide | 𖿤
  | × | × |    1F8F8 |            Unknown |         Unassigned |          Neutral | 🣸
  | ÷ | ÷ |     A9C0 |             Virama |       Spacing_Mark |          Neutral | ꧀
  | ÷ | ÷ |     115E |                 JL |       Other_Letter |             Wide | ᅞ
  | ÷ | ÷ |     FFFC |   Contingent_Break |       Other_Symbol |          Neutral | 
  | ÷ | ÷ |    8A0BA |            Unknown |         Unassigned |          Neutral | 򊂺
  | × | × |    16F97 |         Alphabetic |    Modifier_Letter |          Neutral | 𖾗
  | × | × |     000B |    Mandatory_Break |            Control |          Neutral |

  | ÷ | ÷ |    11326 |             Aksara |       Other_Letter |          Neutral | 𑌦
  | × | × |     302B |     Combining_Mark |    Nonspacing_Mark |             Wide | 〫
  | ÷ | ÷ |    1F930 |             E_Base |       Other_Symbol |             Wide | 🤰
  | ÷ | ÷ |    1F44D |             E_Base |       Other_Symbol |             Wide | 👍
  | × | × |     000B |    Mandatory_Break |            Control |          Neutral |

  | ÷ | ÷ |    1F01C |        Ideographic |       Other_Symbol |          Neutral | 🀜
  | ÷ | ÷ |    18C3F |         Alphabetic |       Other_Letter |             Wide | 𘰿
  | ÷ | ÷ |    1F575 |             E_Base |       Other_Symbol |          Neutral | 🕵
  | × | × |     2E5D | Unambiguous_Hyphen |   Dash_Punctuation |          Neutral | ⹝
  | ÷ | ÷ |    1F8E3 |            Unknown |         Unassigned |          Neutral | 🣣
  | × | × |     2060 |        Word_Joiner |             Format |          Neutral | ⁠
  | × | × |     FF01 |        Exclamation |  Other_Punctuation |        Fullwidth | !
  | × | × |     AA48 |        Break_After |       Other_Letter |          Neutral | ꩈ
  | × | × |    11340 |     Combining_Mark |    Nonspacing_Mark |          Neutral | 𑍀
  | ÷ | ÷ |     FF04 |     Prefix_Numeric |    Currency_Symbol |        Fullwidth | $
  | × | × |     0085 |          Next_Line |            Control |          Neutral |
  | ÷ | ÷ |     0020 |              Space |    Space_Separator |           Narrow |
  | ÷ | ÷ |     26F0 |          Ambiguous |       Other_Symbol |        Ambiguous | ⛰
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | ÷ | ÷ |    1F645 |             E_Base |       Other_Symbol |             Wide | 🙅
  | × | × |     3000 |        Break_After |    Space_Separator |        Fullwidth |  
  | ÷ | ÷ |     FB32 |      Hebrew_Letter |       Other_Letter |          Neutral | גּ
  | × | × |    18CA9 |         Alphabetic |       Other_Letter |             Wide | 𘲩
  | ÷ | ÷ |     FE59 |   Open_Punctuation |   Open_Punctuation |             Wide | ﹙
  | × | × |     FF01 |        Exclamation |  Other_Punctuation |        Fullwidth | !
  | × | × |     2048 |         Nonstarter |  Other_Punctuation |          Neutral | ⁈
  | × | × |     2060 |        Word_Joiner |             Format |          Neutral | ⁠
  | × | × |     3002 |  Close_Punctuation |  Other_Punctuation |             Wide | 。
  | × | × |     FE16 |        Exclamation |  Other_Punctuation |             Wide | ︖
  | × | × |     2044 |      Infix_Numeric |        Math_Symbol |          Neutral | ⁄
  | × | × |     000C |    Mandatory_Break |            Control |          Neutral |

  | ÷ | ÷ |     2010 | Unambiguous_Hyphen |   Dash_Punctuation |        Ambiguous | ‐
  | × | × |     302B |     Combining_Mark |    Nonspacing_Mark |             Wide | 〫
  | × | × |     3099 |     Combining_Mark |    Nonspacing_Mark |             Wide | ゙
  | ÷ | ÷ |     FF05 |    Postfix_Numeric |  Other_Punctuation |        Fullwidth | %
  | × | × |    18C7E |         Alphabetic |       Other_Letter |             Wide | 𘱾
  | × | × |     FE19 |        Inseparable |  Other_Punctuation |             Wide | ︙
  | ÷ | ÷ |    1F85A |            Unknown |         Unassigned |          Neutral | 🡚
  | × | × |     3000 |        Break_After |    Space_Separator |        Fullwidth |  
  | ÷ | ÷ |     D7FB |                 JT |       Other_Letter |          Neutral | ퟻ
  | × | × |     200D |                ZWJ |             Format |          Neutral | ‍
  | × | × |    1F574 |             E_Base |       Other_Symbol |          Neutral | 🕴
  | ÷ | ÷ |    1F18E |          Ambiguous |       Other_Symbol |             Wide | 🆎
  | ÷ | ÷ |     A875 |       Break_Before |  Other_Punctuation |          Neutral | ꡵
  | × | × |     1AB9 |     Combining_Mark |    Nonspacing_Mark |          Neutral | ᪹
  | ÷ | ÷ |     FFFC |   Contingent_Break |       Other_Symbol |          Neutral | 
  | ÷ | ÷ |    11B09 |       Break_Before |  Other_Punctuation |          Neutral | 𑬉
  | × | × |     2986 |  Close_Punctuation |  Close_Punctuation |           Narrow | ⦆
  | × | × |     2CFE |        Exclamation |  Other_Punctuation |          Neutral | ⳾
  | ÷ | ÷ |    1F2A3 |        Ideographic |         Unassigned |          Neutral | 🊣
  | × | × |     000D |    Carriage_Return |            Control |          Neutral |
  | ÷ | ÷ |     30A7 | Conditional_Japanese_Starter |       Other_Letter |             Wide | ェ
  | ÷ | ÷ |     D7BC |                 JV |       Other_Letter |          Neutral | ힼ
  | ÷ | ÷ |    1193E |             Virama |    Nonspacing_Mark |          Neutral | 𑤾
  | ÷ | ÷ |     FE41 |   Open_Punctuation |   Open_Punctuation |             Wide | ﹁
  | × | × |    1F195 |          Ambiguous |       Other_Symbol |             Wide | 🆕
  | × | × |     FF04 |     Prefix_Numeric |    Currency_Symbol |        Fullwidth | $
  | × | × |     2CFE |        Exclamation |  Other_Punctuation |          Neutral | ⳾
  | × | × |     000D |    Carriage_Return |            Control |          Neutral |
  | ÷ | ÷ |     AA0C |       Aksara_Start |       Other_Letter |          Neutral | ꨌ
  | ÷ | ÷ |    1F291 |        Ideographic |         Unassigned |          Neutral | 🊑
  | × | × |     200B |            ZWSpace |             Format |          Neutral | ​
  | ÷ | ÷ |     1B32 |             Aksara |       Other_Letter |          Neutral | ᬲ
  | × | × |     31F0 | Conditional_Japanese_Starter |       Other_Letter |             Wide | ㇰ
  | ÷ | ÷ |     CAFC |                 H3 |       Other_Letter |             Wide | 쫼
  | ÷ | ÷ |    1F3FD |         E_Modifier |    Modifier_Symbol |             Wide | 🏽
  | × | × |    1ECAC |    Postfix_Numeric |       Other_Symbol |          Neutral | 𞲬
  | ÷ | ÷ |    1F5A8 |        Ideographic |       Other_Symbol |          Neutral | 🖨
  | × | × |     2025 |        Inseparable |  Other_Punctuation |        Ambiguous | ‥
  | ÷ | ÷ |     1B44 |             Virama |       Spacing_Mark |          Neutral | ᭄
  | ÷ | ÷ |     117C |                 JV |       Other_Letter |          Neutral | ᅼ
  | ÷ | ÷ |    1F8F5 |            Unknown |         Unassigned |          Neutral | 🣵
  | ÷ | ÷ |    1134D |             Virama |       Spacing_Mark |          Neutral | 𑍍
  | ÷ | ÷ |    1F3FF |         E_Modifier |    Modifier_Symbol |             Wide | 🏿
  | × | × |    13438 |  Close_Punctuation |             Format |          Neutral | 𓐸
  | ÷ | ÷ |    301B5 |        Ideographic |       Other_Letter |             Wide | 𰆵
  | ÷ | ÷ |     02C8 |       Break_Before |    Modifier_Letter |          Neutral | ˈ
  | × | × |     300F |  Close_Punctuation |  Close_Punctuation |             Wide | 』
  | ÷ | ÷ |    1F3FB |         E_Modifier |    Modifier_Symbol |             Wide | 🏻
  | ÷ | ÷ |     116B |                 JV |       Other_Letter |          Neutral | ᅫ
  | × | × |     FE19 |        Inseparable |  Other_Punctuation |             Wide | ︙
  | ÷ | ÷ |    1F24F |        Ideographic |         Unassigned |          Neutral | 🉏
  | ÷ | ÷ |     B944 |                 H3 |       Other_Letter |             Wide | 륄
  | × | × |     200D |                ZWJ |             Format |          Neutral | ‍
  | × | × |    1610C |       Aksara_Start |       Other_Letter |          Neutral | 𖄌
  | × | × |     3019 |  Close_Punctuation |  Close_Punctuation |             Wide | 〙
  | ÷ | ÷ |    E948E |            Unknown |         Unassigned |          Neutral | 󩒎
  | × | × |     2E34 |        Break_After |  Other_Punctuation |          Neutral | ⸴
  | × | × |     17D6 |         Nonstarter |  Other_Punctuation |          Neutral | ៖
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | ÷ | ÷ |     D7C3 |                 JV |       Other_Letter |          Neutral | ퟃ
  | × | × |     FF69 | Conditional_Japanese_Starter |       Other_Letter |        Halfwidth | ゥ
  | ÷ | ÷ |     270D |             E_Base |       Other_Symbol |          Neutral | ✍
  | × | × |     3000 |        Break_After |    Space_Separator |        Fullwidth |  
  | × | × |    11A54 |     Combining_Mark |    Nonspacing_Mark |          Neutral | 𑩔
  | ÷ | ÷ |     29DA |   Open_Punctuation |   Open_Punctuation |          Neutral | ⧚
  | × | × |     0085 |          Next_Line |            Control |          Neutral |
  | ÷ | ÷ |     303C |         Nonstarter |       Other_Letter |             Wide | 〼
  | ÷ | ÷ |    1134D |             Virama |       Spacing_Mark |          Neutral | 𑍍
  | × | × |     2007 |               Glue |    Space_Separator |          Neutral |  
  | × | × |     05EF |      Hebrew_Letter |       Other_Letter |          Neutral | ׯ
  | × | × |     FF05 |    Postfix_Numeric |  Other_Punctuation |        Fullwidth | %
  | × | × |    1F8F5 |            Unknown |         Unassigned |          Neutral | 🣵
Test case #136268

@eggrobin

eggrobin commented Jun 9, 2026

Copy link
Copy Markdown
Member

The next unhappy monkey is #‌245722. With a failure every hundred thousand tests (which generate at a rate of a million per hour using the very slow naïve implementation), this is going to become impractical to debug much further…

---- run_line_break_random_test stdout ----
  | A | E | Code pt. | Line_Break         | General_Category   | East_Asian_Width | Literal
  | ÷ | ÷ |     1400 | Unambiguous_Hyphen |   Dash_Punctuation |          Neutral | ᐀
  | × | × |     2018 |          Quotation | Initial_Punctuation |        Ambiguous | ‘
  | × | × |     0020 |              Space |    Space_Separator |           Narrow |
😭| × | ÷ |    13A34 |         Alphabetic |       Other_Letter |          Neutral | 𓨴
  | × | × |     002F |      Break_Symbols |  Other_Punctuation |           Narrow | /
  | × | × |     2E07 |          Quotation |  Other_Punctuation |          Neutral | ⸇
  | × | × |    11662 |       Break_Before |  Other_Punctuation |          Neutral | 𑙢
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     1BF2 |       Virama_Final |       Spacing_Mark |          Neutral | ᯲
  | × | × |     000B |    Mandatory_Break |            Control |          Neutral |

  | ÷ | ÷ |     FB21 |      Hebrew_Letter |       Other_Letter |          Neutral | ﬡ
  | × | × |     3000 |        Break_After |    Space_Separator |        Fullwidth |  
  | × | × |     201F |          Quotation | Initial_Punctuation |          Neutral | ‟
  | × | × |    13287 |  Close_Punctuation |       Other_Letter |          Neutral | 𓊇
  | ÷ | ÷ |     25CC |         Alphabetic |       Other_Symbol |          Neutral | ◌
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | ÷ | ÷ |     1BF2 |       Virama_Final |       Spacing_Mark |          Neutral | ᯲
  | ÷ | ÷ |    1F84F |            Unknown |         Unassigned |          Neutral | 🡏
  | × | × |     20CB |     Prefix_Numeric |         Unassigned |          Neutral | ⃋
  | × | × |     0952 |     Combining_Mark |    Nonspacing_Mark |          Neutral | ॒
  | × | × |   1094A7 |            Unknown |        Private_Use |        Ambiguous | 􉒧
  | ÷ | ÷ |     1BD6 |       Aksara_Start |       Other_Letter |          Neutral | ᯖ
  | ÷ | ÷ |    1132C |             Aksara |       Other_Letter |          Neutral | 𑌬
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | ÷ | ÷ |     273B |         Alphabetic |       Other_Symbol |          Neutral | ✻
  | × | × |     FE19 |        Inseparable |  Other_Punctuation |             Wide | ︙
  | ÷ | ÷ |     26F9 |             E_Base |       Other_Symbol |        Ambiguous | ⛹
  | × | × |    1DA69 |     Combining_Mark |    Nonspacing_Mark |          Neutral | 𝩩
  | × | × |     FE11 |  Close_Punctuation |  Other_Punctuation |             Wide | ︑
  | ÷ | ÷ |     FFE0 |    Postfix_Numeric |    Currency_Symbol |        Fullwidth | ¢
  | ÷ | ÷ |     1806 |       Break_Before |   Dash_Punctuation |          Neutral | ᠆
  | × | × |     309E |         Nonstarter |    Modifier_Letter |             Wide | ゞ
  | ÷ | ÷ |    1F3FF |         E_Modifier |    Modifier_Symbol |             Wide | 🏿
  | × | × |     200D |                ZWJ |             Format |          Neutral | ‍
  | × | × |     007B |   Open_Punctuation |   Open_Punctuation |           Narrow | {
  | × | × |     FE18 |  Close_Punctuation |  Close_Punctuation |             Wide | ︘
  | ÷ | ÷ |     00B9 |          Ambiguous |       Other_Number |        Ambiguous | ¹
  | × | × |     208E |  Close_Punctuation |  Close_Punctuation |          Neutral | ₎
  | ÷ | ÷ |    11F10 |             Aksara |       Other_Letter |          Neutral | 𑼐
  | ÷ | ÷ |     CF57 |                 H3 |       Other_Letter |             Wide | 콗
  | × | × |     0085 |          Next_Line |            Control |          Neutral |
  | ÷ | ÷ |     05DF |      Hebrew_Letter |       Other_Letter |          Neutral | ן
  | ÷ | ÷ |     1BF2 |       Virama_Final |       Spacing_Mark |          Neutral | ᯲
  | × | × |     FE15 |        Exclamation |  Other_Punctuation |             Wide | ︕
  | ÷ | ÷ |    11381 |       Aksara_Start |       Other_Letter |          Neutral | 𑎁
  | ÷ | ÷ |     2057 |    Postfix_Numeric |  Other_Punctuation |          Neutral | ⁗
  | × | × |     FE19 |        Inseparable |  Other_Punctuation |             Wide | ︙
  | ÷ | ÷ |    10415 |         Alphabetic |   Uppercase_Letter |          Neutral | 𐐕
  | × | × |     0020 |              Space |    Space_Separator |           Narrow |
  | ÷ | ÷ |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | × | × |    1F8EF |            Unknown |         Unassigned |          Neutral | 🣯
  | ÷ | ÷ |     2E3B |         Break_Both |   Dash_Punctuation |          Neutral | ⸻
  | ÷ | ÷ |    1F46C |             E_Base |       Other_Symbol |             Wide | 👬
  | ÷ | ÷ |    11F02 |     Aksara_Prebase |       Other_Letter |          Neutral | 𑼂
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |    23413 |        Ideographic |       Other_Letter |             Wide | 𣐓
  | ÷ | ÷ |    11F02 |     Aksara_Prebase |       Other_Letter |          Neutral | 𑼂
  | × | × |     05BE | Unambiguous_Hyphen |   Dash_Punctuation |          Neutral | ־
  | × | × |     3000 |        Break_After |    Space_Separator |        Fullwidth |  
  | ÷ | ÷ |     FFFC |   Contingent_Break |       Other_Symbol |          Neutral | 
  | ÷ | ÷ |    1FC23 |        Ideographic |         Unassigned |          Neutral | 🰣
  | ÷ | ÷ |     2E3B |         Break_Both |   Dash_Punctuation |          Neutral | ⸻
  | × | × |     2025 |        Inseparable |  Other_Punctuation |        Ambiguous | ‥
  | × | × |     0020 |              Space |    Space_Separator |           Narrow |
  | ÷ | ÷ |    11A45 |       Break_Before |  Other_Punctuation |          Neutral | 𑩅
  | × | × |     FFE5 |     Prefix_Numeric |    Currency_Symbol |        Fullwidth | ¥
  | × | × |     005D |  Close_Parenthesis |  Close_Punctuation |           Narrow | ]
  | × | × |     060D |      Infix_Numeric |  Other_Punctuation |          Neutral | ؍
  | × | × |    F2B29 |            Unknown |        Private_Use |        Ambiguous | 󲬩
  | ÷ | ÷ |    1F01C |        Ideographic |       Other_Symbol |          Neutral | 🀜
  | ÷ | ÷ |     29DA |   Open_Punctuation |   Open_Punctuation |          Neutral | ⧚
  | × | × |     002F |      Break_Symbols |  Other_Punctuation |           Narrow | /
  | ÷ | ÷ |     1184 |                 JV |       Other_Letter |          Neutral | ᆄ
  | ÷ | ÷ |     25CC |         Alphabetic |       Other_Symbol |          Neutral | ◌
  | × | × |     200B |            ZWSpace |             Format |          Neutral | ​
  | ÷ | ÷ |    11337 |             Aksara |       Other_Letter |          Neutral | 𑌷
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | ÷ | ÷ |    1F590 |             E_Base |       Other_Symbol |          Neutral | 🖐
  | ÷ | ÷ |     300C |   Open_Punctuation |   Open_Punctuation |             Wide | 「
  | × | × |     1BF2 |       Virama_Final |       Spacing_Mark |          Neutral | ᯲
  | ÷ | ÷ |     20B9 |     Prefix_Numeric |    Currency_Symbol |          Neutral | ₹
  | ÷ | ÷ |     FFFC |   Contingent_Break |       Other_Symbol |          Neutral | 
  | ÷ | ÷ |     300A |   Open_Punctuation |   Open_Punctuation |             Wide | 《
  | × | × |    122C6 |         Alphabetic |       Other_Letter |          Neutral | 𒋆
  | ÷ | ÷ |     1BF3 |       Virama_Final |       Spacing_Mark |          Neutral | ᯳
  | ÷ | ÷ |    100A7 |         Alphabetic |       Other_Letter |          Neutral | 𐂧
  | ÷ | ÷ |     2E3B |         Break_Both |   Dash_Punctuation |          Neutral | ⸻
  | ÷ | ÷ |    1F1F2 | Regional_Indicator |       Other_Symbol |          Neutral | 🇲
  | × | × |     2771 |  Close_Punctuation |  Close_Punctuation |          Neutral | ❱
  | × | × |     0307 |     Combining_Mark |    Nonspacing_Mark |        Ambiguous | ̇
  | ÷ | ÷ |    1F6B5 |             E_Base |       Other_Symbol |             Wide | 🚵
  | ÷ | ÷ |     02CC |       Break_Before |    Modifier_Letter |          Neutral | ˌ
  | × | × |     FF05 |    Postfix_Numeric |  Other_Punctuation |        Fullwidth | %
  | × | × |     200D |                ZWJ |             Format |          Neutral | ‍
  | × | × |     0085 |          Next_Line |            Control |          Neutral |
  | ÷ | ÷ |     2010 | Unambiguous_Hyphen |   Dash_Punctuation |        Ambiguous | ‐
  | ÷ | ÷ |    13436 |               Glue |             Format |          Neutral | 𓐶
  | × | × |     C918 |                 H2 |       Other_Letter |             Wide | 줘
  | ÷ | ÷ |     20A1 |     Prefix_Numeric |    Currency_Symbol |          Neutral | ₡
  | × | × |     25CC |         Alphabetic |       Other_Symbol |          Neutral | ◌
  | ÷ | ÷ |     D05B |                 H3 |       Other_Letter |             Wide | 큛
  | ÷ | ÷ |    1F8FD |            Unknown |         Unassigned |          Neutral | 🣽
  | ÷ | ÷ |    1F474 |             E_Base |       Other_Symbol |             Wide | 👴
  | × | × |     202F |               Glue |    Space_Separator |          Neutral |  
  | × | × |     20A0 |     Prefix_Numeric |    Currency_Symbol |          Neutral | ₠
  | × | × |    34FCC |        Ideographic |         Unassigned |             Wide | 𴿌
  | × | × |     2025 |        Inseparable |  Other_Punctuation |        Ambiguous | ‥
  | ÷ | ÷ |     FE41 |   Open_Punctuation |   Open_Punctuation |             Wide | ﹁
  | × | × |    1F1FE | Regional_Indicator |       Other_Symbol |          Neutral | 🇾
  | × | × |     FEFF |        Word_Joiner |             Format |          Neutral | 
  | × | × |     B3C4 |                 H2 |       Other_Letter |             Wide | 도
  | ÷ | ÷ |     27EE |   Open_Punctuation |   Open_Punctuation |          Neutral | ⟮
  | × | × |     2029 |    Mandatory_Break | Paragraph_Separator |          Neutral | 

  | ÷ | ÷ |     FE13 |         Nonstarter |  Other_Punctuation |             Wide | ︓
  | × | × |     203A |          Quotation |  Final_Punctuation |          Neutral | ›
  | × | × |     0F28 |            Numeric |     Decimal_Number |          Neutral | ༨
  | ÷ | ÷ |    11059 |        Ideographic |       Other_Number |          Neutral | 𑁙
  | ÷ | ÷ |    116C8 |            Numeric |     Decimal_Number |          Neutral | 𑛈
  | × | × |    1B151 | Conditional_Japanese_Starter |       Other_Letter |             Wide | 𛅑
  | × | × |     3000 |        Break_After |    Space_Separator |        Fullwidth |  
  | ÷ | ÷ |    3179C |        Ideographic |       Other_Letter |             Wide | 𱞜
  | ÷ | ÷ |    1F569 |        Ideographic |       Other_Symbol |          Neutral | 🕩
  | ÷ | ÷ |     25CC |         Alphabetic |       Other_Symbol |          Neutral | ◌
  | × | × |     FE57 |        Exclamation |  Other_Punctuation |             Wide | ﹗
  | ÷ | ÷ |     1BF3 |       Virama_Final |       Spacing_Mark |          Neutral | ᯳
  | ÷ | ÷ |     1BF2 |       Virama_Final |       Spacing_Mark |          Neutral | ᯲
  | × | × |     002E |      Infix_Numeric |  Other_Punctuation |           Narrow | .
  | × | × |    1F84A |            Unknown |         Unassigned |          Neutral | 🡊
  | × | × |     002F |      Break_Symbols |  Other_Punctuation |           Narrow | /
  | ÷ | ÷ |    1193E |             Virama |    Nonspacing_Mark |          Neutral | 𑤾
  | × | × |     FE19 |        Inseparable |  Other_Punctuation |             Wide | ︙
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | × | × |     002F |      Break_Symbols |  Other_Punctuation |           Narrow | /
  | × | × |     2010 | Unambiguous_Hyphen |   Dash_Punctuation |        Ambiguous | ‐
  | ÷ | ÷ |    1193F |     Aksara_Prebase |       Other_Letter |          Neutral | 𑤿
  | ÷ | ÷ |     2E3B |         Break_Both |   Dash_Punctuation |          Neutral | ⸻
  | × | × |     2E10 |        Break_After |  Other_Punctuation |          Neutral | ⸐
  | × | × |    E0060 |     Combining_Mark |             Format |          Neutral | 󠁠
  | ÷ | ÷ |    1F8E1 |            Unknown |         Unassigned |          Neutral | 🣡
  | × | × |    13438 |  Close_Punctuation |             Format |          Neutral | 𓐸
  | ÷ | ÷ |     2B55 |          Ambiguous |       Other_Symbol |             Wide | ⭕
  | × | × |     2E0D |          Quotation |  Final_Punctuation |          Neutral | ⸍
  | × | × |     FE19 |        Inseparable |  Other_Punctuation |             Wide | ︙
  | × | × |    16FE4 |               Glue |    Nonspacing_Mark |             Wide | 𖿤
  | × | × |     000D |    Carriage_Return |            Control |          Neutral |
  | ÷ | ÷ |     20A9 |     Prefix_Numeric |    Currency_Symbol |        Halfwidth | ₩
  | × | × |    10AF6 |        Inseparable |  Other_Punctuation |          Neutral | 𐫶
  | ÷ | ÷ |     2212 |     Prefix_Numeric |        Math_Symbol |          Neutral | −
  | ÷ | ÷ |    1F1F4 | Regional_Indicator |       Other_Symbol |          Neutral | 🇴
  | × | × |     FE10 |  Close_Punctuation |  Other_Punctuation |             Wide | ︐
  | × | × |     31F4 | Conditional_Japanese_Starter |       Other_Letter |             Wide | ㇴ
  | × | × |     2060 |        Word_Joiner |             Format |          Neutral | ⁠
  | × | × |     FF1F |        Exclamation |  Other_Punctuation |        Fullwidth | ?
  | × | × |     000D |    Carriage_Return |            Control |          Neutral |
  | ÷ | ÷ |     11FA |                 JT |       Other_Letter |          Neutral | ᇺ
  | × | × |     200B |            ZWSpace |             Format |          Neutral | ​
  | ÷ | ÷ |     215D |          Ambiguous |       Other_Number |        Ambiguous | ⅝
  | × | × |     2E05 |          Quotation |  Final_Punctuation |          Neutral | ⸅
  | × | × |    16FE4 |               Glue |    Nonspacing_Mark |             Wide | 𖿤
  | × | × |     0A1C |         Alphabetic |       Other_Letter |          Neutral | ਜ
  | ÷ | ÷ |     1115 |                 JL |       Other_Letter |             Wide | ᄕ
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | × | × |     0C6E |            Numeric |     Decimal_Number |          Neutral | ౮
  | × | × |     2B55 |          Ambiguous |       Other_Symbol |             Wide | ⭕
  | ÷ | ÷ |    11F51 |       Aksara_Start |     Decimal_Number |          Neutral | 𑽑
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | × | × |     2E07 |          Quotation |  Other_Punctuation |          Neutral | ⸇
  | × | × |     FE19 |        Inseparable |  Other_Punctuation |             Wide | ︙
  | × | × |     FE16 |        Exclamation |  Other_Punctuation |             Wide | ︖
  | ÷ | ÷ |    1F447 |             E_Base |       Other_Symbol |             Wide | 👇
  | × | × |     061E |        Exclamation |  Other_Punctuation |          Neutral | ؞
  | × | × |     2E5A |  Close_Parenthesis |  Close_Punctuation |          Neutral | ⹚
  | × | × |     000D |    Carriage_Return |            Control |          Neutral |
  | ÷ | ÷ |     25CC |         Alphabetic |       Other_Symbol |          Neutral | ◌
  | ÷ | ÷ |    11003 |     Aksara_Prebase |       Other_Letter |          Neutral | 𑀃
  | ÷ | ÷ |     B669 |                 H3 |       Other_Letter |             Wide | 뙩
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | ÷ | ÷ |     2E3B |         Break_Both |   Dash_Punctuation |          Neutral | ⸻
  | ÷ | ÷ |    1FC46 |        Ideographic |         Unassigned |          Neutral | 🱆
  | ÷ | ÷ |     300C |   Open_Punctuation |   Open_Punctuation |             Wide | 「
  | × | × |     FE56 |        Exclamation |  Other_Punctuation |             Wide | ﹖
  | ÷ | ÷ |     2014 |         Break_Both |   Dash_Punctuation |        Ambiguous | —
  | × | × |     200D |                ZWJ |             Format |          Neutral | ‍
  | × | × |     1198 |                 JV |       Other_Letter |          Neutral | ᆘ
  | ÷ | ÷ |     2757 |          Ambiguous |       Other_Symbol |             Wide | ❗
  | × | × |     0020 |              Space |    Space_Separator |           Narrow |
  | ÷ | ÷ |     1178 |                 JV |       Other_Letter |          Neutral | ᅸ
  | ÷ | ÷ |     FFE6 |     Prefix_Numeric |    Currency_Symbol |        Fullwidth | ₩
  | × | × |     FE57 |        Exclamation |  Other_Punctuation |             Wide | ﹗
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | ÷ | ÷ |    1F575 |             E_Base |       Other_Symbol |          Neutral | 🕵
  | × | × |     302A |     Combining_Mark |    Nonspacing_Mark |             Wide | 〪
  | × | × |     20A7 |    Postfix_Numeric |    Currency_Symbol |          Neutral | ₧
  | × | × |     301B |  Close_Punctuation |  Close_Punctuation |             Wide | 〛
  | × | × |     2060 |        Word_Joiner |             Format |          Neutral | ⁠
  | × | × |     2CFF |        Break_After |  Other_Punctuation |          Neutral | ⳿
  | ÷ | ÷ |     3016 |   Open_Punctuation |   Open_Punctuation |             Wide | 〖
  | × | × |     3000 |        Break_After |    Space_Separator |        Fullwidth |  
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     FE5B |   Open_Punctuation |   Open_Punctuation |             Wide | ﹛
  | × | × |    16FE4 |               Glue |    Nonspacing_Mark |             Wide | 𖿤
  | × | × |     200D |                ZWJ |             Format |          Neutral | ‍
  | × | × |    3CE44 |        Ideographic |         Unassigned |             Wide | 𼹄
  | × | × |     000D |    Carriage_Return |            Control |          Neutral |
  | ÷ | ÷ |    11A9F |       Break_Before |  Other_Punctuation |          Neutral | 𑪟
  | × | × |    1F3FD |         E_Modifier |    Modifier_Symbol |             Wide | 🏽
  | × | × |     1AC8 |     Combining_Mark |    Nonspacing_Mark |          Neutral | ᫈
  | ÷ | ÷ |     1108 |                 JL |       Other_Letter |             Wide | ᄈ
  | ÷ | ÷ |    1F192 |          Ambiguous |       Other_Symbol |             Wide | 🆒
  | × | × |     061E |        Exclamation |  Other_Punctuation |          Neutral | ؞
  | ÷ | ÷ |    1F3DD |        Ideographic |       Other_Symbol |          Neutral | 🏝
  | × | × |     000B |    Mandatory_Break |            Control |          Neutral |

  | ÷ | ÷ |     2E01 |          Quotation |  Other_Punctuation |          Neutral | ⸁
  | × | × |     203A |          Quotation |  Final_Punctuation |          Neutral | ›
  | × | × |     1B31 |             Aksara |       Other_Letter |          Neutral | ᬱ
  | ÷ | ÷ |    114D4 |            Numeric |     Decimal_Number |          Neutral | 𑓔
  | ÷ | ÷ |     3010 |   Open_Punctuation |   Open_Punctuation |             Wide | 【
  | × | × |     2E57 |   Open_Punctuation |   Open_Punctuation |          Neutral | ⹗
  | × | × |    16FE4 |               Glue |    Nonspacing_Mark |             Wide | 𖿤
  | × | × |     0027 |          Quotation |  Other_Punctuation |           Narrow | '
  | × | × |     29DB |  Close_Punctuation |  Close_Punctuation |          Neutral | ⧛
  | × | × |     200B |            ZWSpace |             Format |          Neutral | ​
  | ÷ | ÷ |     FFE0 |    Postfix_Numeric |    Currency_Symbol |        Fullwidth | ¢
  | × | × |     0F0F |        Exclamation |  Other_Punctuation |          Neutral | ༏
  | ÷ | ÷ |     25CC |         Alphabetic |       Other_Symbol |          Neutral | ◌
  | × | × |     3000 |        Break_After |    Space_Separator |        Fullwidth |  
  | ÷ | ÷ |     11A0 |                 JV |       Other_Letter |          Neutral | ᆠ
  | × | × |     3000 |        Break_After |    Space_Separator |        Fullwidth |  
  | ÷ | ÷ |     FF04 |     Prefix_Numeric |    Currency_Symbol |        Fullwidth | $
  | ÷ | ÷ |    11382 |       Aksara_Start |       Other_Letter |          Neutral | 𑎂
  | ÷ | ÷ |     D7EE |                 JT |       Other_Letter |          Neutral | ퟮ
  | ÷ | ÷ |     A96A |                 JL |       Other_Letter |             Wide | ꥪ
  | ÷ | ÷ |     11E5 |                 JT |       Other_Letter |          Neutral | ᇥ
  | × | × |     FE24 |               Glue |    Nonspacing_Mark |          Neutral | ︤
  | × | × |     FE57 |        Exclamation |  Other_Punctuation |             Wide | ﹗
  | × | × |     FEFF |        Word_Joiner |             Format |          Neutral | 
  | × | × |     0085 |          Next_Line |            Control |          Neutral |
  | ÷ | ÷ |     2E07 |          Quotation |  Other_Punctuation |          Neutral | ⸇
  | × | × |     0085 |          Next_Line |            Control |          Neutral |
  | ÷ | ÷ |     2E08 |          Quotation |  Other_Punctuation |          Neutral | ⸈
  | × | × |    1FF2E |        Ideographic |         Unassigned |          Neutral | 🼮
  | ÷ | ÷ |    11003 |     Aksara_Prebase |       Other_Letter |          Neutral | 𑀃
  | × | × |     2024 |        Inseparable |  Other_Punctuation |        Ambiguous | ․
  | ÷ | ÷ |    1139B |             Aksara |       Other_Letter |          Neutral | 𑎛
  | ÷ | ÷ |     FF3B |   Open_Punctuation |   Open_Punctuation |        Fullwidth | [
  | × | × |    16FE4 |               Glue |    Nonspacing_Mark |             Wide | 𖿤
  | × | × |     20A5 |     Prefix_Numeric |    Currency_Symbol |          Neutral | ₥
  | × | × |     D7B0 |                 JV |       Other_Letter |          Neutral | ힰ
  | ÷ | ÷ |    18BF8 |         Alphabetic |       Other_Letter |             Wide | 𘯸
  | ÷ | ÷ |     2E3B |         Break_Both |   Dash_Punctuation |          Neutral | ⸻
  | × | × |    10AF6 |        Inseparable |  Other_Punctuation |          Neutral | 𐫶
  | ÷ | ÷ |    1F3CC |             E_Base |       Other_Symbol |          Neutral | 🏌
  | × | × |     002F |      Break_Symbols |  Other_Punctuation |           Narrow | /
  | ÷ | ÷ |    4C3AE |            Unknown |         Unassigned |          Neutral | 񌎮
  | × | × |     FE19 |        Inseparable |  Other_Punctuation |             Wide | ︙
  | ÷ | ÷ |     FFE0 |    Postfix_Numeric |    Currency_Symbol |        Fullwidth | ¢
  | ÷ | ÷ |     A985 |             Aksara |       Other_Letter |          Neutral | ꦅ
  | × | × |     037E |      Infix_Numeric |  Other_Punctuation |          Neutral | ;
  | × | × |    1F191 |          Ambiguous |       Other_Symbol |             Wide | 🆑
  | × | × |     FB4A |      Hebrew_Letter |       Other_Letter |          Neutral | תּ
  | × | × |     200B |            ZWSpace |             Format |          Neutral | ​
  | ÷ | ÷ |    16FE4 |               Glue |    Nonspacing_Mark |             Wide | 𖿤
  | × | × |    1F1EF | Regional_Indicator |       Other_Symbol |          Neutral | 🇯
  | × | × |     FE13 |         Nonstarter |  Other_Punctuation |             Wide | ︓
  | ÷ | ÷ |    376F5 |        Ideographic |         Unassigned |             Wide | 𷛵
  | ÷ | ÷ |    11668 |       Break_Before |  Other_Punctuation |          Neutral | 𑙨
  | × | × |     2044 |      Infix_Numeric |        Math_Symbol |          Neutral | ⁄
  | ÷ | ÷ |    1F590 |             E_Base |       Other_Symbol |          Neutral | 🖐
  | × | × |     2CFA |        Break_After |  Other_Punctuation |          Neutral | ⳺
  | × | × |     FF1A |         Nonstarter |  Other_Punctuation |        Fullwidth | :
  | ÷ | ÷ |     1BD9 |       Aksara_Start |       Other_Letter |          Neutral | ᯙ
  | × | × |     FEFF |        Word_Joiner |             Format |          Neutral | 
  | × | × |    1F677 |          Quotation |       Other_Symbol |          Neutral | 🙷
  | × | × |     00A0 |               Glue |    Space_Separator |          Neutral |  
  | × | × |     C5EC |                 H2 |       Other_Letter |             Wide | 여
  | × | × |    1DA66 |     Combining_Mark |    Nonspacing_Mark |          Neutral | 𝩦
  | × | × |    16FE4 |               Glue |    Nonspacing_Mark |             Wide | 𖿤
  | × | × |    1F3FB |         E_Modifier |    Modifier_Symbol |             Wide | 🏻
  | × | × |     002C |      Infix_Numeric |  Other_Punctuation |           Narrow | ,
  | × | × |     FEFF |        Word_Joiner |             Format |          Neutral | 
  | × | × |     00A4 |     Prefix_Numeric |    Currency_Symbol |        Ambiguous | ¤
  | × | × |     2635 |         Alphabetic |       Other_Symbol |             Wide | ☵
  | ÷ | ÷ |    11F02 |     Aksara_Prebase |       Other_Letter |          Neutral | 𑼂
  | × | × |    1107F |               Glue |    Nonspacing_Mark |          Neutral | 𑁿
  | × | × |    11391 |       Aksara_Start |       Other_Letter |          Neutral | 𑎑
  | × | × |     002F |      Break_Symbols |  Other_Punctuation |           Narrow | /
  | × | × |     FEFF |        Word_Joiner |             Format |          Neutral | 
  | × | × |     002F |      Break_Symbols |  Other_Punctuation |           Narrow | /
  | ÷ | ÷ |     FFFC |   Contingent_Break |       Other_Symbol |          Neutral | 
  | × | × |    16FE4 |               Glue |    Nonspacing_Mark |             Wide | 𖿤
  | × | × |     0085 |          Next_Line |            Control |          Neutral |
  | ÷ | ÷ |     3011 |  Close_Punctuation |  Close_Punctuation |             Wide | 】
  | × | × |     2029 |    Mandatory_Break | Paragraph_Separator |          Neutral | 

  | ÷ | ÷ |    1F3CB |             E_Base |       Other_Symbol |          Neutral | 🏋
  | × | × |     0F3B |  Close_Punctuation |  Close_Punctuation |          Neutral | ༻
  | ÷ | ÷ |    2BD31 |        Ideographic |       Other_Letter |             Wide | 𫴱
  | × | × |     0361 |               Glue |    Nonspacing_Mark |        Ambiguous | ͡
  | × | × |    1FF9C |        Ideographic |         Unassigned |          Neutral | 🾜
  | × | × |     0029 |  Close_Parenthesis |  Close_Punctuation |           Narrow | )
  | × | × |    1F88E |            Unknown |         Unassigned |          Neutral | 🢎
  | × | × |    1E5F6 |            Numeric |     Decimal_Number |          Neutral | 𞗶
  | × | × |     061D |        Exclamation |  Other_Punctuation |          Neutral | ؝
  | × | × |     201C |          Quotation | Initial_Punctuation |        Ambiguous | “
  | × | × |     FF6B | Conditional_Japanese_Starter |       Other_Letter |        Halfwidth | ォ
  | ÷ | ÷ |     D7EE |                 JT |       Other_Letter |          Neutral | ퟮ
  | × | × |     0085 |          Next_Line |            Control |          Neutral |
  | ÷ | ÷ |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     000C |    Mandatory_Break |            Control |          Neutral |

  | ÷ | ÷ |     2524 |          Ambiguous |       Other_Symbol |        Ambiguous | ┤
  | × | × |     FFE1 |     Prefix_Numeric |    Currency_Symbol |        Fullwidth | £
  | × | × |     2028 |    Mandatory_Break |     Line_Separator |          Neutral | 

  | ÷ | ÷ |     000D |    Carriage_Return |            Control |          Neutral |
  | ÷ | ÷ |     27ED |  Close_Punctuation |  Close_Punctuation |           Narrow | ⟭
  | ÷ | ÷ |    4BDDA |            Unknown |         Unassigned |          Neutral | 񋷚
  | ÷ | ÷ |    1193F |     Aksara_Prebase |       Other_Letter |          Neutral | 𑤿
  | ÷ | ÷ |     FE6A |    Postfix_Numeric |  Other_Punctuation |             Wide | ﹪
  | ÷ | ÷ |    1F1EF | Regional_Indicator |       Other_Symbol |          Neutral | 🇯
  | × | × |     2E58 |  Close_Parenthesis |  Close_Punctuation |          Neutral | ⹘
  | × | × |     0360 |               Glue |    Nonspacing_Mark |        Ambiguous | ͠
  | × | × |     25CC |         Alphabetic |       Other_Symbol |          Neutral | ◌
  | × | × |    CCF5C |            Unknown |         Unassigned |          Neutral | 󌽜
  | × | × |     FFE0 |    Postfix_Numeric |    Currency_Symbol |        Fullwidth | ¢
  | ÷ | ÷ |    1F574 |             E_Base |       Other_Symbol |          Neutral | 🕴
  | × | × |     3000 |        Break_After |    Space_Separator |        Fullwidth |  
  | ÷ | ÷ |    1F88A |            Unknown |         Unassigned |          Neutral | 🢊
  | × | × |    1F3B6 |         Alphabetic |       Other_Symbol |             Wide | 🎶
  | × | × |     FE16 |        Exclamation |  Other_Punctuation |             Wide | ︖
  | ÷ | ÷ |     1BF2 |       Virama_Final |       Spacing_Mark |          Neutral | ᯲
  | × | × |     1DFC |               Glue |    Nonspacing_Mark |          Neutral | ᷼
  | × | × |     270A |             E_Base |       Other_Symbol |             Wide | ✊
  | × | × |     000D |    Carriage_Return |            Control |          Neutral |
  | ÷ | ÷ |     FE42 |  Close_Punctuation |  Close_Punctuation |             Wide | ﹂
  | ÷ | ÷ |    1F574 |             E_Base |       Other_Symbol |          Neutral | 🕴
  | × | × |     3019 |  Close_Punctuation |  Close_Punctuation |             Wide | 〙
  | ÷ | ÷ |    13286 |   Open_Punctuation |       Other_Letter |          Neutral | 𓊆
  | × | × |    1F1F9 | Regional_Indicator |       Other_Symbol |          Neutral | 🇹
  | ÷ | ÷ |     FE3B |   Open_Punctuation |   Open_Punctuation |             Wide | ︻
  | × | × |     0020 |              Space |    Space_Separator |           Narrow |
  | × | × |    1F1FB | Regional_Indicator |       Other_Symbol |          Neutral | 🇻
  | × | × |     2024 |        Inseparable |  Other_Punctuation |        Ambiguous | ․
  | × | × |     3000 |        Break_After |    Space_Separator |        Fullwidth |  
  | ÷ | ÷ |    18BC5 |         Alphabetic |       Other_Letter |             Wide | 𘯅
  | ÷ | ÷ |     1145 |                 JL |       Other_Letter |             Wide | ᅅ
  | × | × |     003A |      Infix_Numeric |  Other_Punctuation |           Narrow | :
  | ÷ | ÷ |    24519 |        Ideographic |       Other_Letter |             Wide | 𤔙
  | × | × |     2E0D |          Quotation |  Final_Punctuation |          Neutral | ⸍
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | ÷ | ÷ |     D09A |                 H3 |       Other_Letter |             Wide | 킚
  | × | × |     2E1C |          Quotation | Initial_Punctuation |          Neutral | ⸜
  | × | × |     000D |    Carriage_Return |            Control |          Neutral |
  | ÷ | ÷ |     200B |            ZWSpace |             Format |          Neutral | ​
  | ÷ | ÷ |    1F676 |          Quotation |       Other_Symbol |          Neutral | 🙶
  | × | × |     2049 |         Nonstarter |  Other_Punctuation |          Neutral | ⁉
  | ÷ | ÷ |     BBC0 |                 H2 |       Other_Letter |             Wide | 므
  | ÷ | ÷ |    11B02 |       Break_Before |  Other_Punctuation |          Neutral | 𑬂
  | × | × |     FE19 |        Inseparable |  Other_Punctuation |             Wide | ︙
  | × | × |     200D |                ZWJ |             Format |          Neutral | ‍
  | × | × |     A960 |                 JL |       Other_Letter |             Wide | ꥠ
  | × | × |     2E56 |  Close_Parenthesis |  Close_Punctuation |          Neutral | ⹖
  | ÷ | ÷ |     26F9 |             E_Base |       Other_Symbol |        Ambiguous | ⛹
  | × | × |     FE19 |        Inseparable |  Other_Punctuation |             Wide | ︙
  | × | × |     31F7 | Conditional_Japanese_Starter |       Other_Letter |             Wide | ㇷ
  | × | × |     FE15 |        Exclamation |  Other_Punctuation |             Wide | ︕
  | ÷ | ÷ |     20A9 |     Prefix_Numeric |    Currency_Symbol |        Halfwidth | ₩
  | × | × |     2757 |          Ambiguous |       Other_Symbol |             Wide | ❗
  | × | × |     FE19 |        Inseparable |  Other_Punctuation |             Wide | ︙
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     25CC |         Alphabetic |       Other_Symbol |          Neutral | ◌
  | × | × |     2E08 |          Quotation |  Other_Punctuation |          Neutral | ⸈
  | × | × |     2014 |         Break_Both |   Dash_Punctuation |        Ambiguous | —
  | × | × |     1C3C |        Break_After |  Other_Punctuation |          Neutral | ᰼
  | ÷ | ÷ |    16FE4 |               Glue |    Nonspacing_Mark |             Wide | 𖿤
  | × | × |    1F3FB |         E_Modifier |    Modifier_Symbol |             Wide | 🏻
  | × | × |     2E53 |        Exclamation |  Other_Punctuation |          Neutral | ⹓
  | × | × |     FEFF |        Word_Joiner |             Format |          Neutral | 
  | × | × |     20CC |     Prefix_Numeric |         Unassigned |          Neutral | ⃌
  | ÷ | ÷ |    11F2F |             Aksara |       Other_Letter |          Neutral | 𑼯
  | ÷ | ÷ |    18C0E |         Alphabetic |       Other_Letter |             Wide | 𘰎
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     26F9 |             E_Base |       Other_Symbol |        Ambiguous | ⛹
  | × | × |     FE10 |  Close_Punctuation |  Other_Punctuation |             Wide | ︐
  | ÷ | ÷ |     11F8 |                 JT |       Other_Letter |          Neutral | ᇸ
  | ÷ | ÷ |    11EF1 |       Aksara_Start |       Other_Letter |          Neutral | 𑻱
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     A9A2 |             Aksara |       Other_Letter |          Neutral | ꦢ
  | × | × |     0020 |              Space |    Space_Separator |           Narrow |
  | ÷ | ÷ |     2E0C |          Quotation | Initial_Punctuation |          Neutral | ⸌
  | × | × |    13258 |   Open_Punctuation |       Other_Letter |          Neutral | 𓉘
  | × | × |    18CC3 |         Alphabetic |       Other_Letter |             Wide | 𘳃
  | × | × |    762B0 |            Unknown |         Unassigned |          Neutral | 񶊰
  | × | × |    1F888 |            Unknown |         Unassigned |          Neutral | 🢈
  | × | × |     200D |                ZWJ |             Format |          Neutral | ‍
  | × | × |     FFE0 |    Postfix_Numeric |    Currency_Symbol |        Fullwidth | ¢
  | × | × |     FE56 |        Exclamation |  Other_Punctuation |             Wide | ﹖
  | × | × |    11048 |        Break_After |  Other_Punctuation |          Neutral | 𑁈
  | ÷ | ÷ |     11A5 |                 JV |       Other_Letter |          Neutral | ᆥ
  | ÷ | ÷ |     05E2 |      Hebrew_Letter |       Other_Letter |          Neutral | ע
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |    10AF6 |        Inseparable |  Other_Punctuation |          Neutral | 𐫶
  | × | × |     0020 |              Space |    Space_Separator |           Narrow |
  | ÷ | ÷ |     2E04 |          Quotation | Initial_Punctuation |          Neutral | ⸄
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     000C |    Mandatory_Break |            Control |          Neutral |

  | ÷ | ÷ |    113D1 |     Aksara_Prebase |       Other_Letter |          Neutral | 𑏑
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |     2049 |         Nonstarter |  Other_Punctuation |          Neutral | ⁉
  | × | × |    16FF0 |     Combining_Mark |       Spacing_Mark |             Wide | 𖿰
  | × | × |     203D |         Nonstarter |  Other_Punctuation |          Neutral | ‽
  | ÷ | ÷ |     FB24 |      Hebrew_Letter |       Other_Letter |          Neutral | ﬤ
  | × | × |     FFE1 |     Prefix_Numeric |    Currency_Symbol |        Fullwidth | £
  | × | × |     301E |  Close_Punctuation |  Close_Punctuation |             Wide | 〞
  | ÷ | ÷ |     3291 |        Ideographic |       Other_Symbol |             Wide | ㊑
  | × | × |     0361 |               Glue |    Nonspacing_Mark |        Ambiguous | ͡
  | × | × |    1FC8E |        Ideographic |         Unassigned |          Neutral | 🲎
  | × | × |     2771 |  Close_Punctuation |  Close_Punctuation |          Neutral | ❱
  | ÷ | ÷ |    1F3FF |         E_Modifier |    Modifier_Symbol |             Wide | 🏿
  | ÷ | ÷ |    11015 |             Aksara |       Other_Letter |          Neutral | 𑀕
  | ÷ | ÷ |     26F7 |        Ideographic |       Other_Symbol |        Ambiguous | ⛷
  | × | × |     A838 |    Postfix_Numeric |    Currency_Symbol |          Neutral | ꠸
  | × | × |     2E02 |          Quotation | Initial_Punctuation |          Neutral | ⸂
  | × | × |     002E |      Infix_Numeric |  Other_Punctuation |           Narrow | .
  | × | × |     000A |          Line_Feed |            Control |          Neutral |

  | ÷ | ÷ |    1193F |     Aksara_Prebase |       Other_Letter |          Neutral | 𑤿
  | ÷ | ÷ |    1FD9B |        Ideographic |         Unassigned |          Neutral | 🶛
  | × | × |     200B |            ZWSpace |             Format |          Neutral | ​
  | ÷ | ÷ |     07C0 |            Numeric |     Decimal_Number |          Neutral | ߀
  | × | × |     2057 |    Postfix_Numeric |  Other_Punctuation |          Neutral | ⁗
  | × | × |     203D |         Nonstarter |  Other_Punctuation |          Neutral | ‽
  | ÷ | ÷ |     1F82 |         Alphabetic |   Lowercase_Letter |          Neutral | ᾂ
  | ÷ | ÷ |     2E3A |         Break_Both |   Dash_Punctuation |          Neutral | ⸺
  | ÷ | ÷ |    58360 |            Unknown |         Unassigned |          Neutral | 񘍠
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | ÷ | ÷ |     2567 |          Ambiguous |       Other_Symbol |        Ambiguous | ╧
  | ÷ | ÷ |     FE3F |   Open_Punctuation |   Open_Punctuation |             Wide | ︿
  | × | × |    14043 |         Alphabetic |       Other_Letter |          Neutral | 𔁃
  | ÷ | ÷ |    11669 |       Break_Before |  Other_Punctuation |          Neutral | 𑙩
  | × | × |     FEFF |        Word_Joiner |             Format |          Neutral | 
  | × | × |     A9B9 |     Combining_Mark |    Nonspacing_Mark |          Neutral | ꦹ
  | × | × |     D3B4 |                 H2 |       Other_Letter |             Wide | 펴
  | × | × |     2032 |    Postfix_Numeric |  Other_Punctuation |        Ambiguous | ′
  | ÷ | ÷ |    1FD53 |        Ideographic |         Unassigned |          Neutral | 🵓
  | ÷ | ÷ |     FB22 |      Hebrew_Letter |       Other_Letter |          Neutral | ﬢ
  | × | × |     FF0C |  Close_Punctuation |  Other_Punctuation |        Fullwidth | ,
  | × | × |     275F |          Quotation |       Other_Symbol |          Neutral | ❟
  | × | × |     1944 |        Exclamation |  Other_Punctuation |          Neutral | ᥄
  | ÷ | ÷ |    1F1F5 | Regional_Indicator |       Other_Symbol |          Neutral | 🇵
  | ÷ | ÷ |     20BC |     Prefix_Numeric |    Currency_Symbol |          Neutral | ₼
  | ÷ | ÷ |     1B44 |             Virama |       Spacing_Mark |          Neutral | ᭄
  | ÷ | ÷ |    1F1EF | Regional_Indicator |       Other_Symbol |          Neutral | 🇯
  | × | × |     0085 |          Next_Line |            Control |          Neutral |
  | ÷ | ÷ |     002F |      Break_Symbols |  Other_Punctuation |           Narrow | /
  | × | × |     0029 |  Close_Parenthesis |  Close_Punctuation |           Narrow | )
  | ÷ | ÷ |    1F46B |             E_Base |       Other_Symbol |             Wide | 👫
  | × | × |     FE15 |        Exclamation |  Other_Punctuation |             Wide | ︕
  | × | × |     002D |             Hyphen |   Dash_Punctuation |           Narrow | -
  | ÷ | ÷ |     27E6 |   Open_Punctuation |   Open_Punctuation |           Narrow | ⟦
  | × | × |     2014 |         Break_Both |   Dash_Punctuation |        Ambiguous | —
  | ÷ | ÷ |    1F3FD |         E_Modifier |    Modifier_Symbol |             Wide | 🏽
  | ÷ | ÷ |     05E2 |      Hebrew_Letter |       Other_Letter |          Neutral | ע
  | ÷ | ÷ |    1F6B4 |             E_Base |       Other_Symbol |             Wide | 🚴
  | × | × |     0085 |          Next_Line |            Control |          Neutral |
  | ÷ | ÷ |     D7E4 |                 JT |       Other_Letter |          Neutral | ퟤ
  | ÷ | ÷ |     FFFC |   Contingent_Break |       Other_Symbol |          Neutral | 
  | ÷ | ÷ |    1138B |       Aksara_Start |       Other_Letter |          Neutral | 𑎋
  | × | × |    1343B |               Glue |             Format |          Neutral | 𓐻

@makotokato

Copy link
Copy Markdown
Member Author

@eggrobin, what git revision and parameter do you create this tests?

@eggrobin

Copy link
Copy Markdown
Member

I was somewhere near the current ICU main, but the revision does not matter, the monkey tests have not changed since July 2025, so release-78.2 which you are using will produce the same results.

The command line was

.\icu4c\source\test\intltest\x64\Release\intltest.exe "rbbi/RBBITest/TestMonkey@type=line seed=1729 loop=5000000 scalars_only export=meow"

(The loop count gets divided by five for line breaking, so this generates a million tests. That said, if you are still getting an error every hundred thousand tests, a million is probably not going to be enough.)

@sffc

sffc commented Jun 10, 2026

Copy link
Copy Markdown
Member

Can someone characterize the nature of the tests that fail?

Which description is closer:

  1. Edge cases that occur 0.001% of the time (1/100k) and result in suboptimal but still intelligible output
  2. Flaws that occur frequently in certain types of text (emojis, certain languages, ...) and cause that text to be less intelligible

If the failures are closer to case (1), I personally have no problem landing and potentially shipping it with a documented "known issue". We would need consensus from the TC of course.

@eggrobin

Copy link
Copy Markdown
Member

See my earlier comment #7823 (comment); the testing strategy is not capable of discerning these two categories.

It would take a lot of work to generate a meaningful sample of errors, analyse them (which requires deep knowledge of the algorithm and the reasons for the rules, in practice that probably means I would need to be the one doing it), and understand what kind of text would run into them. Certainly even more than to fix them, and already that is more than I would want to do.

@Manishearth

Copy link
Copy Markdown
Member

Stepping back: I don't think we should be spending time on fixing this right now. @robertbastian has been experimenting on a new approach for segmenters that is more maintainable (easier to upgrade) where we can get data from upstream. These experiments are having promising results, and we hope to write up a proper proposal soon.

https://github.com/unicode-org/icu4x/tree/main/components/segmenter/src/neo

Having an updated Unicode 17 implementation is nice, but I would not consider it a priority if it is taking a lot of back and forth with the monkey testts.

@robertbastian

Copy link
Copy Markdown
Member

More context in #7962

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants