Skip to content

Fix content classification minimum content check for CJK languages#716

Closed
i-anubhav-anand wants to merge 2 commits into
WordPress:developfrom
i-anubhav-anand:fix/content-classification-cjk-word-count
Closed

Fix content classification minimum content check for CJK languages#716
i-anubhav-anand wants to merge 2 commits into
WordPress:developfrom
i-anubhav-anand:fix/content-classification-cjk-word-count

Conversation

@i-anubhav-anand

@i-anubhav-anand i-anubhav-anand commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Problem

Fixes #571

Japanese, Chinese, and Korean don't separate words with spaces. count( text, 'words' ) returns near-zero for CJK content, so hasEnoughContent was always false for CJK users — the classification panel was permanently disabled even with long articles.

Solution

Detect CJK content via a Unicode range regex and switch to 'characters_excluding_spaces' counting. The same MINIMUM_WORD_COUNT = 150 threshold is reused, which is meaningful in CJK context (≈ a short paragraph of ~150 characters).

Non-CJK content continues to use word-based counting as before.

Changes

  • src/experiments/content-classification/components/useContentClassification.ts
    • Add CJK_REGEX constant
    • hasEnoughContent: detect CJK and use character count instead of word count

Testing

  1. Create a post with 150+ Japanese/Chinese/Korean characters
  2. Open the Content Classification panel — the Generate button should be enabled
  3. Verify English posts still require 150+ words before the button enables
Open WordPress Playground Preview

i-anubhav-anand and others added 2 commits June 12, 2026 17:11
Japanese, Chinese, and Korean don't separate words with spaces, so
`count( text, 'words' )` returns near-zero for these languages even for
long paragraphs. This makes `hasEnoughContent` always false, blocking
classification for CJK users.

Detect CJK content and switch to `characters_excluding_spaces` counting
with the same 150-unit threshold, which is meaningful for CJK text
(≈ a short paragraph).
The regex literal contained a raw ideographic space (U+3000) as the
start of its first character range, which ESLint's
no-irregular-whitespace rule rejects. Escaped ranges are equivalent
and easier to review: \u3000-\u9FFF, \uAC00-\uD7FF, \uFF01-\uFF60.
@i-anubhav-anand i-anubhav-anand marked this pull request as ready for review June 12, 2026 20:02
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: i-anubhav-anand <anubhav24@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
Co-authored-by: t-hamano <wildworks@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@dkotter

dkotter commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

@i-anubhav-anand This is another PR that duplicates effort in #581. Please review that PR and let's try and keep work there to avoid duplication

@i-anubhav-anand

Copy link
Copy Markdown
Contributor Author

Thanks @dkotter — you're right, this overlaps the existing PR you linked. Closing in favor of it to keep the work consolidated; happy to help review or iterate there instead. Apologies for the duplicated effort!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Content Classification: Make character count locale-aware

2 participants