Update line segmenter to Unicode 17#8041
Conversation
| # © 2025 Unicode®, Inc. | ||
| # Unicode and the Unicode Logo are registered trademarks of Unicode, Inc. in the U.S. and other countries. | ||
| # For terms of use and license, see https://www.unicode.org/terms_of_use.html | ||
| # | ||
| # Generated from unicodetools that is replace SA_Mn, SA_Mc and SAmMnmMc with original SA to handle |
There was a problem hiding this comment.
I don't think this is necessary, as spec_test now uses new_for_non_complex_scripts
There was a problem hiding this comment.
Ah it fails the unaltered test file:
| A | E | Code pt. | Line_Break | General_Category | East_Asian_Width | Literal
| ÷ | ÷ | 0E31 | Complex_Context | Nonspacing_Mark | Neutral | ั
| × | × | 0308 | Combining_Mark | Nonspacing_Mark | Ambiguous | ̈
😭| ÷ | × | 0E31 | Complex_Context | Nonspacing_Mark | Neutral | ั
Test case #14562
It should really pass it
There was a problem hiding this comment.
I pushed a commit that handles this discrepancy in the test, so we can check in the unaltered test data
Unfortunately, for such a large state machine, it looks like this is not enough. I tried with a million strings, and got a first failure on test case #136268, see below (A more realistic and minimal test case would be |
|
The next unhappy monkey is #245722. With a failure every hundred thousand tests (which generate at a rate of a million per hour using the very slow naïve implementation), this is going to become impractical to debug much further… |
|
@eggrobin, what git revision and parameter do you create this tests? |
|
I was somewhere near the current ICU main, but the revision does not matter, the monkey tests have not changed since July 2025, so release-78.2 which you are using will produce the same results. The command line was
(The loop count gets divided by five for line breaking, so this generates a million tests. That said, if you are still getting an error every hundred thousand tests, a million is probably not going to be enough.) |
|
Can someone characterize the nature of the tests that fail? Which description is closer:
If the failures are closer to case (1), I personally have no problem landing and potentially shipping it with a documented "known issue". We would need consensus from the TC of course. |
|
See my earlier comment #7823 (comment); the testing strategy is not capable of discerning these two categories. It would take a lot of work to generate a meaningful sample of errors, analyse them (which requires deep knowledge of the algorithm and the reasons for the rules, in practice that probably means I would need to be the one doing it), and understand what kind of text would run into them. Certainly even more than to fix them, and already that is more than I would want to do. |
|
Stepping back: I don't think we should be spending time on fixing this right now. @robertbastian has been experimenting on a new approach for segmenters that is more maintainable (easier to upgrade) where we can get data from upstream. These experiments are having promising results, and we hope to write up a proper proposal soon. https://github.com/unicode-org/icu4x/tree/main/components/segmenter/src/neo Having an updated Unicode 17 implementation is nice, but I would not consider it a priority if it is taking a lot of back and forth with the monkey testts. |
|
More context in #7962 |
Update previous. I have passed monkeytest that is 32,000 tests.
Changelog