782 improve docx importing with recursive text splitting to avoid excessively large cells by Luke-Bilhorn · Pull Request #894 · genesis-ai-dev/codex-editor

Luke-Bilhorn · 2026-04-21T03:03:57Z

Closes #782

More we could do here; can update later.

Testing Checklist

Confirm the "Advanced Settings" section appears for .docx imports, with an "Ideal cell length" box pre-filled to 160 and helper text ("Set to 0 to disable splitting").
Import a long multi-sentence paragraph (400+ chars) → it becomes multiple sentence-sized cells instead of one giant cell.
Import short paragraphs (under ~175 chars) → each stays as one cell, nothing split.
Read the new cells in order → text matches the original exactly, nothing lost or duplicated.
Confirm splitting only affects DOCX (the setting does not appear for other importers).

Ideal cell length setting

Set length to 80 vs 300 on the same file → 80 gives more/smaller cells, 300 gives fewer/larger cells.
Set length to 0 → splitting fully disabled, every paragraph stays one cell.
Box rejects negative/non-numeric input without breaking.

Split-point quality (pecking order)

Normal sentences → breaks land at sentence ends (. ! ?), not mid-sentence.
Run-on with commas but no periods → breaks land at commas/semicolons.
No punctuation at all → breaks land between words, never mid-word.
Numbers like 3.14159 are never split inside the number.
No awkwardly tiny leftover cells are produced.
Slightly-over-limit or comma-only paragraphs may stay as one cell — confirm this is expected, not a bug.

What should NOT split

Tables keep existing behavior (cells not run through the splitter) and still map correctly.
Empty paragraphs / empty table cells are preserved, not dropped.

Formatting

Bold/italic/colored text spanning a split → formatting preserved in each resulting cell.
Paragraph alignment/indent styling retained after splitting.

Round-trip export

Translate several split cells, then export → split pieces rejoin into the original single paragraph, in correct order.
Table-cell content still exports to the right place.
Export with some split cells left untranslated → result is coherent, no crash.

Language / locale awareness

English source: abbreviations like "Mr. Smith" don't break awkwardly when avoidable.
No-space scripts (Thai/Japanese/Chinese) source: long paragraphs still split into reasonable chunks.
Unusual/unsupported language tag → import still works (quiet fallback, no error).

Downstream usability

Each split cell can be drafted independently (AI/manual).
Each split cell can be validated independently.

Stability

Close and reopen the imported notebook → cell structure consistent.
Large .docx (many long paragraphs) imports without freezing or performance regression.

Regression

No console errors during import; progress reaches 100% and the notebook opens.
A normal short-paragraph document imports exactly as before.

Paragraphs in imported DOCX files are now recursively split into translator-friendly cells. A new general-purpose text splitter (NewSourceUploader/utils/textSplitter.ts) takes a plain string and an ideal cell length, then bisects at the boundary nearest the midpoint: sentence ends (L1), sub-sentence stops like commas/dashes/ellipsis (L2), and whitespace as a last resort (L3). Each tier has its own length threshold (×1.1 / ×1.5 / ×2.4) and a minimum side-length guard (×0.3) to prevent fragments. Multilingual punctuation is supported across Latin, CJK, Arabic, Urdu, Devanagari, Ethiopic, and several other scripts. The DOCX importer uses this splitter via run-aware helpers that preserve inline formatting (bold, italic, font, color, etc.) even when a split falls mid-run. Split cells carry segmentIndex/segmentCount metadata so the round-trip exporter can recombine translated segments in order before writing them back to the original <w:p>, keeping the output DOCX structurally identical to the input. Cell metadata now contains optional fields segmentIndex and segmentCount to support these subdivisions. The ideal cell length defaults to 160 characters and is user-adjustable via a collapsible Advanced Settings panel in the DOCX import UI.

Adds an `advancedSettings` prop slot to UnifiedImporterForm, rendered as a collapsible panel below the file-selection card, and uses it from the DOCX importer to let the user override the ideal cell length used by the recursive paragraph splitter. Defaults to 160 characters. Ports the intent of the old experiment-layout "Made Ideal Segment Length Button visible" commit onto the new UnifiedImporterForm-based DOCX form. Made-with: Cursor

…xt-splitting-to-avoid-excessively-large-cells

When a BCP-47 locale is supplied (e.g. derived from <w:lang> on DOCX runs), the recursive splitter now sources sentence (L1) and word (L3) split candidates from Intl.Segmenter, falling back to the existing regex when the locale is missing or unsupported. Sub-sentence stops (L2) remain regex-only because Intl has no clause granularity. Thresholds, midpoint preference, and min-side guard are unchanged. The DOCX importer picks each paragraph's dominant run-level lang (weighted by content length) and passes it through, giving correct word boundaries for scripts without space-separated words (Thai, Khmer, Lao, Myanmar, CJK) and smarter sentence detection. Uses an opaque SegmenterHandle type rather than Intl.Segmenter directly, so the file compiles under both the webview tsconfig (Vite) and the root tsconfig (webpack/ts-loader) which lacks ES2022.Intl.

Replaces the previous w:lang sniffing in the DOCX importer with the project's source language tag (from metadata.json), threaded through the wizard context. This gives Intl.Segmenter a stable, predictable locale for L1/L3 splitting that does not depend on whether the source DOCX (e.g. Google Docs exports) carries w:lang attributes. - Provider now reads metadata.json and includes sourceLanguageTag in the projectInventory message. - WizardContext / WizardState carry sourceLanguageTag through to importer components. - DocxImporterForm reads it from wizardContext and passes it as a locale option to parseFile. - parseFile / createCellsFromDocx accept the locale and forward it to splitTextIntoRanges; pickParagraphLocale (the w:lang tally helper) is removed.

…xt-splitting-to-avoid-excessively-large-cells

LeviXIII

UI options are great (Advanced options) and work well.

Just a note on tables: they will create empty cells in the source. This is the same behaviour as before. When exporting on a roundtrip, everything works as it should and the tables are still in place (without translation as they are ignored). Things are working as they did originally and the cell splits don't reflect in the exported file.

Luke-Bilhorn linked an issue Apr 21, 2026 that may be closed by this pull request

Improve docx importing with recursive text splitting to avoid excessively large cells #782

Closed

Luke-Bilhorn added 4 commits April 20, 2026 22:21

Added notes on spltting choices.

359d8b0

Changed "targetCellLength" to "idealCellLength"

ddb5734

Luke-Bilhorn force-pushed the 782-improve-docx-importing-with-recursive-text-splitting-to-avoid-excessively-large-cells branch from 473140b to f375caf Compare April 21, 2026 04:13

Luke-Bilhorn and others added 7 commits April 29, 2026 14:49

Merge branch 'main' into 782-improve-docx-importing-with-recursive-te…

7f5b29f

…xt-splitting-to-avoid-excessively-large-cells

Merge branch 'main' into 782-improve-docx-importing-with-recursive-te…

cc94efa

…xt-splitting-to-avoid-excessively-large-cells

Merge branch 'main' into 782-improve-docx-importing-with-recursive-te…

0a42a2b

…xt-splitting-to-avoid-excessively-large-cells

Merge branch 'main' into 782-improve-docx-importing-with-recursive-te…

e584050

…xt-splitting-to-avoid-excessively-large-cells

Merge branch 'main' into 782-improve-docx-importing-with-recursive-te…

1423ed1

…xt-splitting-to-avoid-excessively-large-cells

LeviXIII approved these changes Jun 29, 2026

View reviewed changes

LeviXIII merged commit 9bffc79 into main Jun 29, 2026
5 of 6 checks passed

LeviXIII deleted the 782-improve-docx-importing-with-recursive-text-splitting-to-avoid-excessively-large-cells branch June 29, 2026 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

782 improve docx importing with recursive text splitting to avoid excessively large cells#894

782 improve docx importing with recursive text splitting to avoid excessively large cells#894
LeviXIII merged 11 commits into
mainfrom
782-improve-docx-importing-with-recursive-text-splitting-to-avoid-excessively-large-cells

Luke-Bilhorn commented Apr 21, 2026 •

edited by LeviXIII

Loading

Uh oh!

LeviXIII left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Luke-Bilhorn commented Apr 21, 2026 • edited by LeviXIII Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing Checklist

Uh oh!

LeviXIII left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Luke-Bilhorn commented Apr 21, 2026 •

edited by LeviXIII

Loading