782 improve docx importing with recursive text splitting to avoid excessively large cells#894
Merged
LeviXIII merged 11 commits intoJun 29, 2026
Conversation
Paragraphs in imported DOCX files are now recursively split into translator-friendly cells. A new general-purpose text splitter (NewSourceUploader/utils/textSplitter.ts) takes a plain string and an ideal cell length, then bisects at the boundary nearest the midpoint: sentence ends (L1), sub-sentence stops like commas/dashes/ellipsis (L2), and whitespace as a last resort (L3). Each tier has its own length threshold (×1.1 / ×1.5 / ×2.4) and a minimum side-length guard (×0.3) to prevent fragments. Multilingual punctuation is supported across Latin, CJK, Arabic, Urdu, Devanagari, Ethiopic, and several other scripts. The DOCX importer uses this splitter via run-aware helpers that preserve inline formatting (bold, italic, font, color, etc.) even when a split falls mid-run. Split cells carry segmentIndex/segmentCount metadata so the round-trip exporter can recombine translated segments in order before writing them back to the original <w:p>, keeping the output DOCX structurally identical to the input. Cell metadata now contains optional fields segmentIndex and segmentCount to support these subdivisions. The ideal cell length defaults to 160 characters and is user-adjustable via a collapsible Advanced Settings panel in the DOCX import UI.
Adds an `advancedSettings` prop slot to UnifiedImporterForm, rendered as a collapsible panel below the file-selection card, and uses it from the DOCX importer to let the user override the ideal cell length used by the recursive paragraph splitter. Defaults to 160 characters. Ports the intent of the old experiment-layout "Made Ideal Segment Length Button visible" commit onto the new UnifiedImporterForm-based DOCX form. Made-with: Cursor
473140b to
f375caf
Compare
…xt-splitting-to-avoid-excessively-large-cells
…xt-splitting-to-avoid-excessively-large-cells
…xt-splitting-to-avoid-excessively-large-cells
When a BCP-47 locale is supplied (e.g. derived from <w:lang> on DOCX runs), the recursive splitter now sources sentence (L1) and word (L3) split candidates from Intl.Segmenter, falling back to the existing regex when the locale is missing or unsupported. Sub-sentence stops (L2) remain regex-only because Intl has no clause granularity. Thresholds, midpoint preference, and min-side guard are unchanged. The DOCX importer picks each paragraph's dominant run-level lang (weighted by content length) and passes it through, giving correct word boundaries for scripts without space-separated words (Thai, Khmer, Lao, Myanmar, CJK) and smarter sentence detection. Uses an opaque SegmenterHandle type rather than Intl.Segmenter directly, so the file compiles under both the webview tsconfig (Vite) and the root tsconfig (webpack/ts-loader) which lacks ES2022.Intl.
Replaces the previous w:lang sniffing in the DOCX importer with the project's source language tag (from metadata.json), threaded through the wizard context. This gives Intl.Segmenter a stable, predictable locale for L1/L3 splitting that does not depend on whether the source DOCX (e.g. Google Docs exports) carries w:lang attributes. - Provider now reads metadata.json and includes sourceLanguageTag in the projectInventory message. - WizardContext / WizardState carry sourceLanguageTag through to importer components. - DocxImporterForm reads it from wizardContext and passes it as a locale option to parseFile. - parseFile / createCellsFromDocx accept the locale and forward it to splitTextIntoRanges; pickParagraphLocale (the w:lang tally helper) is removed.
…xt-splitting-to-avoid-excessively-large-cells
…xt-splitting-to-avoid-excessively-large-cells
LeviXIII
approved these changes
Jun 29, 2026
LeviXIII
left a comment
Contributor
There was a problem hiding this comment.
UI options are great (Advanced options) and work well.
Just a note on tables: they will create empty cells in the source. This is the same behaviour as before. When exporting on a roundtrip, everything works as it should and the tables are still in place (without translation as they are ignored). Things are working as they did originally and the cell splits don't reflect in the exported file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #782
More we could do here; can update later.
Testing Checklist
.docximports, with an "Ideal cell length" box pre-filled to 160 and helper text ("Set to 0 to disable splitting").Ideal cell length setting
Split-point quality (pecking order)
.!?), not mid-sentence.3.14159are never split inside the number.What should NOT split
Formatting
Round-trip export
Language / locale awareness
Downstream usability
Stability
.docx(many long paragraphs) imports without freezing or performance regression.Regression