Fix content classification minimum content check for CJK languages#716
Fix content classification minimum content check for CJK languages#716i-anubhav-anand wants to merge 2 commits into
Conversation
Japanese, Chinese, and Korean don't separate words with spaces, so `count( text, 'words' )` returns near-zero for these languages even for long paragraphs. This makes `hasEnoughContent` always false, blocking classification for CJK users. Detect CJK content and switch to `characters_excluding_spaces` counting with the same 150-unit threshold, which is meaningful for CJK text (≈ a short paragraph).
The regex literal contained a raw ideographic space (U+3000) as the start of its first character range, which ESLint's no-irregular-whitespace rule rejects. Escaped ranges are equivalent and easier to review: \u3000-\u9FFF, \uAC00-\uD7FF, \uFF01-\uFF60.
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message. To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
|
@i-anubhav-anand This is another PR that duplicates effort in #581. Please review that PR and let's try and keep work there to avoid duplication |
|
Thanks @dkotter — you're right, this overlaps the existing PR you linked. Closing in favor of it to keep the work consolidated; happy to help review or iterate there instead. Apologies for the duplicated effort! |
Problem
Fixes #571
Japanese, Chinese, and Korean don't separate words with spaces.
count( text, 'words' )returns near-zero for CJK content, sohasEnoughContentwas alwaysfalsefor CJK users — the classification panel was permanently disabled even with long articles.Solution
Detect CJK content via a Unicode range regex and switch to
'characters_excluding_spaces'counting. The sameMINIMUM_WORD_COUNT = 150threshold is reused, which is meaningful in CJK context (≈ a short paragraph of ~150 characters).Non-CJK content continues to use word-based counting as before.
Changes
src/experiments/content-classification/components/useContentClassification.tsCJK_REGEXconstanthasEnoughContent: detect CJK and use character count instead of word countTesting