HTML API: Preserve rawtext contents of IFRAME, NOEMBED, NOFRAMES, and XMP when serializing by sirreal · Pull Request #51 · sirreal/wordpress-develop

sirreal · 2026-06-11T16:02:14Z

The HTML Processor's serializer destroys the contents of four rawtext elements, so a serialize/re-parse cycle produces a different document than the one parsed:

IFRAME, NOEMBED, NOFRAMES: contents are dropped entirely. <iframe>x</iframe>y normalizes to <iframe></iframe>y, removing the text node from the tree.
XMP: contents are escaped with htmlspecialchars(). XMP is rawtext — character references are never decoded on parse — so <xmp>1 < 2 & more</xmp> normalizes to <xmp>1 < 2 &amp; more</xmp>, which re-parses as the literal text 1 < 2 &amp; more.

Browsers keep these contents in the DOM and serialize them literally, per the HTML fragment serialization algorithm:

If the parent of current node is a style, script, xmp, iframe, noembed, noframes, or plaintext element, or if the parent of current node is a noscript element and scripting is enabled for the node, then append the value of current node's data literally.

Compare in a browser:

<iframe>x</iframe>y — the text node exists in the parsed tree, and innerHTML round-trips it.
<iframe></iframe>y — current serializer output: the text node is gone.
<xmp>1 < 2 & more</xmp> vs. current output <xmp>1 < 2 &amp; more</xmp> — different text contents.
<html><frameset><noframes>x</noframes> — NOFRAMES contents survive in frameset documents too.

This patch serializes all four literally, joining SCRIPT and STYLE. Output now matches PHP 8.4's WHATWG-compliant Dom\HTMLDocument byte-for-byte on the affected constructs (verified with ~30k differential fuzz inputs; the change fixes every reparse-fidelity failure in that run and introduces no new divergence).

Why literal emission is safe:

Rawtext scanning terminates at the first </tag followed by whitespace, /, or >, so parser-derived contents cannot contain their own closing tag; the appended closer cannot match early. (SCRIPT may contain double-escaped closers, but re-parsing identical bytes follows the same tokenization that produced them.)
Character references are never decoded in these contents (get_modifiable_text() returns raw bytes, NUL → U+FFFD), so no re-encoding is needed or correct.
set_modifiable_text() rejects these four elements, so only parser-derived text can ever reach the serializer.

For reviewers to ratify: if the blanking was intentional mXSS hardening (these contents are inert in HTML5 browsers but parse as live markup in HTML4-era parsers such as libxml2), this patch reverses it. I believe that hardening was illusory: SCRIPT/STYLE — strictly more dangerous — always passed through literally, no core code feeds serializer output to a legacy parser, and sanitization is kses's job, not the serializer's. The behavior also contradicted the feature's stated goal of Dom\HTMLDocument parity ([r59076]).

Two commits: the dropped-contents fix (IFRAME/NOEMBED/NOFRAMES) and the XMP escaping fix.

To test: vendor/bin/phpunit --filter Tests_HtmlApi_WpHtmlProcessor_Serialize

Trac ticket: …

Fixes #50

Use of AI Tools

AI assistance: Yes
Tool(s): Claude Code
Model(s): Claude Fable 5
Used for: Diagnosis, implementation, tests, and differential fuzz verification against Dom\HTMLDocument; reviewed and edited by me.

This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

…izing. The serializer dropped the raw text contents of these elements, removing parsed document content across a serialize/re-parse cycle. Per the HTML fragment serialization algorithm, their contents are emitted literally, as is already done for SCRIPT and STYLE. Raw text cannot contain its own closing tag, so literal emission cannot terminate the element early when re-parsing. See #65372.

XMP contents are raw text in which character references are never decoded. Escaping them changed document contents across a serialize/re-parse cycle: `<xmp>1 < 2</xmp>` serialized as `<xmp>1 < 2</xmp>`, which re-parses as the literal text "1 < 2". XMP contents now serialize literally like the other raw text elements, following the HTML fragment serialization algorithm. See #65372.

github-actions · 2026-06-11T16:02:32Z

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props desrosj, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

# Conflicts: # tests/phpunit/tests/html-api/wpHtmlProcessor-serialize.php

sirreal · 2026-06-12T11:55:36Z

Closing in favor of #54 and #55 (separating concerns)

sirreal added 2 commits June 11, 2026 17:37

sirreal marked this pull request as ready for review June 11, 2026 16:02

sirreal added a commit that referenced this pull request Jun 11, 2026

Merge PR #51: preserve rawtext contents

ce08149

# Conflicts: # tests/phpunit/tests/html-api/wpHtmlProcessor-serialize.php

sirreal closed this Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML API: Preserve rawtext contents of IFRAME, NOEMBED, NOFRAMES, and XMP when serializing#51

HTML API: Preserve rawtext contents of IFRAME, NOEMBED, NOFRAMES, and XMP when serializing#51
sirreal wants to merge 2 commits into
trunkfrom
html-api-normalize-restore-missing-text-content

sirreal commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

sirreal commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sirreal commented Jun 11, 2026

Use of AI Tools

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

sirreal commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant