Skip to content

Content Classification: Fix minimum-content threshold for CJK languages#728

Closed
i-anubhav-anand wants to merge 36 commits into
WordPress:trunkfrom
i-anubhav-anand:fix/content-classification-cjk-word-count
Closed

Content Classification: Fix minimum-content threshold for CJK languages#728
i-anubhav-anand wants to merge 36 commits into
WordPress:trunkfrom
i-anubhav-anand:fix/content-classification-cjk-word-count

Conversation

@i-anubhav-anand

@i-anubhav-anand i-anubhav-anand commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

What?

Closes #571

Make the content classification minimum-content threshold locale-aware for CJK languages (Chinese, Japanese, Korean).

Why?

CJK text has no word boundaries — count(text, 'words') from @wordpress/wordcount always returns 0 for Japanese/Chinese. With the 150-word threshold hard-coded, the "Suggest Tags/Categories" button stays permanently disabled for CJK users regardless of how much content they write.

How?

  • Added a CJK_REGEX that matches CJK Unified Ideographs, Hangul, and fullwidth punctuation (/[ -鿿가-퟿!-⦆]/).
  • When the post content contains CJK characters, the threshold is evaluated with 'characters_excluding_spaces' instead of 'words'. The 150 minimum is reused; 150 characters is a reasonable proxy for "enough content" in CJK scripts.
  • Exported hasCJKContent from the hook so SuggestionPanel can adapt the hint text: "approximately 150 characters" for CJK content, "approximately 150 words" otherwise.

Note: instead of the locale-based _x() approach referenced in the issue, this uses runtime content detection, which works for multilingual posts and requires no translator intervention.

Use of AI Tools

AI assistance: Yes
Tool(s): Claude Code
Model(s): claude-sonnet-4-6
Used for: Implementation assistance; code reviewed and verified by me.

Testing Instructions

  1. Set site language to Japanese (Settings → General → Site Language).
  2. Open a post, paste こんにちは。 ~30 times in the content area.
  3. Open the Tags or Categories panel in the sidebar.
  4. Confirm the "Suggest Tags" button becomes enabled (previously always disabled for CJK).
  5. Verify the hint text says "approximately 150 characters" not "approximately 150 words".
  6. Switch back to English, add fewer than 150 words — confirm hint says "approximately 150 words".

Screenshots or screencast

N/A — the change is functional (button enables/disables correctly) and the hint text adapts.

Changelog Entry

Fixed - Content Classification: minimum-content threshold now uses character count for CJK languages (Chinese, Japanese, Korean) so the "Suggest Tags/Categories" button correctly enables for CJK content. The hint text adapts to say "approximately 150 characters" for CJK content.

Open WordPress Playground Preview

dependabot Bot and others added 30 commits May 28, 2026 08:38
Developer - Bump `tmp` from 0.2.5 to 0.2.7

Co-authored-by: dkotter <dkotter@git.wordpress.org>
… Next/Previous in media modal (WordPress#631)

Fixed - Alt Text Generation button becomes unresponsive after using Next/Previous in the media modal.

Unlinked contributors: kohcsi.

Co-authored-by: yogeshbhutkar <yogeshbhutkar@git.wordpress.org>
Co-authored-by: jeffpaul <jeffpaul@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
Developer - Bump `codecov/codecov-action` from 6.0.0 to 6.0.1

Co-authored-by: dkotter <dkotter@git.wordpress.org>
…WordPress#635)

Developer - Bump `phpstan/php-8-stubs` from 0.4.34 to 0.4.35 and `phpstan/phpstan` from 2.1.54 to 2.1.55

Co-authored-by: dkotter <dkotter@git.wordpress.org>
…ess#605)

Changed - Ensure the Editorial Notes and Editorial Updates controls stay grouped together in the post editor sidebar

Co-authored-by: macayu17 <ayushhoff@git.wordpress.org>
Co-authored-by: jeffpaul <jeffpaul@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
…lorer (WordPress#642)

Fixed - Added accessible labels to the provider and category filter dropdowns in the Abilities Explorer page.

Co-authored-by: Trushiv04 <trushiv@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
…ss on alt text (WordPress#645)

Fixed - Lost focus when generating the alt text in image block inspector controls.

Co-authored-by: hbhalodia <hbhalodia@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
Fixed - Lost focus when toggling the connector approval state.

Co-authored-by: hbhalodia <hbhalodia@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
Fixed - Abilities Explorer schema validation.

Co-authored-by: ekamran <ekamran@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
…rdPress#644)

Fixed - Lost focus after generating a title.

Co-authored-by: hbhalodia <hbhalodia@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
…trix (WordPress#637)

Fixed - Add descriptive accessible labels to approval matrix toggle controls.

Co-authored-by: ishitaj34 <ishitaj34@git.wordpress.org>
Co-authored-by: Trushiv04 <trushiv@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
…es Explorer (WordPress#649)

Fixed - Added an accessible label to the ability test payload textarea in the Abilities Explorer.

Co-authored-by: Trushiv04 <trushiv@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
Fixed - Excerpt generation post context payload.

Co-authored-by: ekamran <ekamran@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
Changed - Return a consistent decorative flag from alt text generation results

Co-authored-by: yusufhay <yusufmudagal@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
Fixed - Clear out the meta description suggestion when the modal closes

Co-authored-by: ekamran <ekamran@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
Developer - Bump `phpstan/phpstan` from 2.1.55 to 2.2.1

Co-authored-by: dkotter <dkotter@git.wordpress.org>
…29dac8d2bf9a1e8493865fc97cd1c3c87b to 5e92f5e3c80d06126f22e83e4bb21221fbbd3e7f in the github-actions-updates group (WordPress#673)

Developer - Bump `WordPress/action-wp-playground-pr-preview` to latest version

Co-authored-by: dkotter <dkotter@git.wordpress.org>
Changed - Use explicit UTF-8 encoding for generated meta description character counts.

Co-authored-by: yusufhay <yusufmudagal@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
…ress#669)

Fixed - Column reordering and hiding in the AI Request Logs table now persists instead of resetting to the default.

Unlinked contributors: alexWinterjuice.

Co-authored-by: Trushiv04 <trushiv@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
Co-authored-by: hbhalodia <hbhalodia@git.wordpress.org>
Fixed - UI inconsistency on AI Request Logs page.

Co-authored-by: hbhalodia <hbhalodia@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
Developer - Add `@WordPress/ai-maintainers` team.

Co-authored-by: jeffpaul <jeffpaul@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
…ent in AI Request Logs (WordPress#671)

Fixed - Summary statistics showing zero for short time periods on non-UTC MySQL servers.

Unlinked contributors: alexWinterjuice.

Co-authored-by: prasadkarmalkar <prasadkarmalkar@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
…ess#647)

Fixed - Lost focus after generating images.

Co-authored-by: yogeshbhutkar <yogeshbhutkar@git.wordpress.org>
Co-authored-by: t-hamano <wildworks@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
…(thoughtsTokenCount ignored) (WordPress#680)

Fixed - Ensuring thinking tokens are counted in request logs.

Co-authored-by: prasadkarmalkar <prasadkarmalkar@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
Co-authored-by: riccardodicurti <riccardodicurti@git.wordpress.org>
Added - Manual refresh button to the AI Request Logs table header.

Co-authored-by: prasadkarmalkar <prasadkarmalkar@git.wordpress.org>
Co-authored-by: pbearne <pbearne@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
… suggestions (WordPress#663)

Fixed - Lost focus after running content resizing actions.

Co-authored-by: yogeshbhutkar <yogeshbhutkar@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
…l in Alt_Text_Generation (WordPress#688)

Fixed - Ensure the Ability schemas and outputs are valid JSON Schema for strict REST and MCP consumers.

Co-authored-by: the-hercules <thehercules@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
Co-authored-by: WouterP0lman <wouterpolman@git.wordpress.org>
… template" off (normal editing mode), until reload (WordPress#694)

Fixed - Title generation button disappears after toggling "Show template" off.

Co-authored-by: hbhalodia <hbhalodia@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
Developer - Clarify AI Connector provider setup.

Co-authored-by: ekamran <ekamran@git.wordpress.org>
Co-authored-by: jeffpaul <jeffpaul@git.wordpress.org>
Fixed - Prevent accidental interactions and stale feedback in the Meta Description generation modal and improve focus handling.

Co-authored-by: yogeshbhutkar <yogeshbhutkar@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
Infinite-Null and others added 6 commits June 11, 2026 09:58
…ordPress#698)

Fixed - Ensure focus isn't lost after generating an excerpt inline.

Co-authored-by: Infinite-Null <ankitkumarshah@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
Changed - Show an error message immediately in the image generation UI when there's no AI Connector in place that supports image generation.

Co-authored-by: t-hamano <wildworks@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
Co-authored-by: mindctrl <mindctrl@git.wordpress.org>
…dPress#703)

Developer - Removes the `ready_for_review` pull request event from the Test and Plugin Check GitHub Actions workflows.

Co-authored-by: Infinite-Null <ankitkumarshah@git.wordpress.org>
Co-authored-by: yogeshbhutkar <yogeshbhutkar@git.wordpress.org>
Co-authored-by: dkotter <dkotter@git.wordpress.org>
Japanese, Chinese, and Korean don't separate words with spaces, so
`count( text, 'words' )` returns near-zero for these languages even for
long paragraphs. This makes `hasEnoughContent` always false, blocking
classification for CJK users.

Detect CJK content and switch to `characters_excluding_spaces` counting
with the same 150-unit threshold, which is meaningful for CJK text
(≈ a short paragraph).
The regex literal contained a raw ideographic space (U+3000) as the
start of its first character range, which ESLint's
no-irregular-whitespace rule rejects. Escaped ranges are equivalent
and easier to review: \u3000-\u9FFF, \uAC00-\uD7FF, \uFF01-\uFF60.
When the post content contains CJK characters, the minimum threshold
is already checked against character count rather than word count.
The hint text now reflects this by saying "approximately 150 characters"
instead of "approximately 150 words" for CJK content.
@github-actions

Copy link
Copy Markdown

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

If you're merging code through a pull request on GitHub, copy and paste the following into the bottom of the merge commit message.

Co-authored-by: yogeshbhutkar <yogeshbhutkar@git.wordpress.org>
Co-authored-by: macayu17 <ayushhoff@git.wordpress.org>
Co-authored-by: Trushiv04 <trushiv@git.wordpress.org>
Co-authored-by: hbhalodia <hbhalodia@git.wordpress.org>
Co-authored-by: ekamran <ekamran@git.wordpress.org>
Co-authored-by: ishitaj34 <ishitaj34@git.wordpress.org>
Co-authored-by: yusufhay <yusufmudagal@git.wordpress.org>
Co-authored-by: jeffpaul <jeffpaul@git.wordpress.org>
Co-authored-by: prasadkarmalkar <prasadkarmalkar@git.wordpress.org>
Co-authored-by: the-hercules <thehercules@git.wordpress.org>
Co-authored-by: Infinite-Null <ankitkumarshah@git.wordpress.org>
Co-authored-by: t-hamano <wildworks@git.wordpress.org>
Co-authored-by: i-anubhav-anand <anubhav24@git.wordpress.org>

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@dkotter

dkotter commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

One, this is opened against trunk, not develop. Two, this seems to just be a duplicate of #716 which we already ask to be closed as it itself was already a duplicate. As such, closing this out but let me know if I'm missing something here

@dkotter dkotter closed this Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.