Skip to content

HTML API: Preserve rawtext contents of IFRAME, NOEMBED, NOFRAMES, and XMP when serializing#51

Closed
sirreal wants to merge 2 commits into
trunkfrom
html-api-normalize-restore-missing-text-content
Closed

HTML API: Preserve rawtext contents of IFRAME, NOEMBED, NOFRAMES, and XMP when serializing#51
sirreal wants to merge 2 commits into
trunkfrom
html-api-normalize-restore-missing-text-content

Conversation

@sirreal

@sirreal sirreal commented Jun 11, 2026

Copy link
Copy Markdown
Owner

The HTML Processor's serializer destroys the contents of four rawtext elements, so a serialize/re-parse cycle produces a different document than the one parsed:

  • IFRAME, NOEMBED, NOFRAMES: contents are dropped entirely. <iframe>x</iframe>y normalizes to <iframe></iframe>y, removing the text node from the tree.
  • XMP: contents are escaped with htmlspecialchars(). XMP is rawtext — character references are never decoded on parse — so <xmp>1 < 2 &amp; more</xmp> normalizes to <xmp>1 &lt; 2 &amp;amp; more</xmp>, which re-parses as the literal text 1 &lt; 2 &amp;amp; more.

Browsers keep these contents in the DOM and serialize them literally, per the HTML fragment serialization algorithm:

If the parent of current node is a style, script, xmp, iframe, noembed, noframes, or plaintext element, or if the parent of current node is a noscript element and scripting is enabled for the node, then append the value of current node's data literally.

Compare in a browser:

This patch serializes all four literally, joining SCRIPT and STYLE. Output now matches PHP 8.4's WHATWG-compliant Dom\HTMLDocument byte-for-byte on the affected constructs (verified with ~30k differential fuzz inputs; the change fixes every reparse-fidelity failure in that run and introduces no new divergence).

Why literal emission is safe:

  • Rawtext scanning terminates at the first </tag followed by whitespace, /, or >, so parser-derived contents cannot contain their own closing tag; the appended closer cannot match early. (SCRIPT may contain double-escaped closers, but re-parsing identical bytes follows the same tokenization that produced them.)
  • Character references are never decoded in these contents (get_modifiable_text() returns raw bytes, NUL → U+FFFD), so no re-encoding is needed or correct.
  • set_modifiable_text() rejects these four elements, so only parser-derived text can ever reach the serializer.

For reviewers to ratify: if the blanking was intentional mXSS hardening (these contents are inert in HTML5 browsers but parse as live markup in HTML4-era parsers such as libxml2), this patch reverses it. I believe that hardening was illusory: SCRIPT/STYLE — strictly more dangerous — always passed through literally, no core code feeds serializer output to a legacy parser, and sanitization is kses's job, not the serializer's. The behavior also contradicted the feature's stated goal of Dom\HTMLDocument parity ([r59076]).

Two commits: the dropped-contents fix (IFRAME/NOEMBED/NOFRAMES) and the XMP escaping fix.

To test: vendor/bin/phpunit --filter Tests_HtmlApi_WpHtmlProcessor_Serialize

Trac ticket: …

Fixes #50

Use of AI Tools

AI assistance: Yes
Tool(s): Claude Code
Model(s): Claude Fable 5
Used for: Diagnosis, implementation, tests, and differential fuzz verification against Dom\HTMLDocument; reviewed and edited by me.


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

sirreal added 2 commits June 11, 2026 17:37
…izing.

The serializer dropped the raw text contents of these elements, removing
parsed document content across a serialize/re-parse cycle. Per the HTML
fragment serialization algorithm, their contents are emitted literally,
as is already done for SCRIPT and STYLE.

Raw text cannot contain its own closing tag, so literal emission cannot
terminate the element early when re-parsing.

See #65372.
XMP contents are raw text in which character references are never
decoded. Escaping them changed document contents across a
serialize/re-parse cycle: `<xmp>1 < 2</xmp>` serialized as
`<xmp>1 &lt; 2</xmp>`, which re-parses as the literal text "1 &lt; 2".

XMP contents now serialize literally like the other raw text elements,
following the HTML fragment serialization algorithm.

See #65372.
@sirreal sirreal marked this pull request as ready for review June 11, 2026 16:02
@github-actions

Copy link
Copy Markdown

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props desrosj, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

sirreal added a commit that referenced this pull request Jun 11, 2026
# Conflicts:
#	tests/phpunit/tests/html-api/wpHtmlProcessor-serialize.php
@sirreal

sirreal commented Jun 12, 2026

Copy link
Copy Markdown
Owner Author

Closing in favor of #54 and #55 (separating concerns)

@sirreal sirreal closed this Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HTML API: Raw-Text Serialization Data Loss in IFRAME, NOEMBED, and NOFRAMES

1 participant