HTML API: Preserve rawtext contents of IFRAME, NOEMBED, NOFRAMES, and XMP when serializing#51
Closed
sirreal wants to merge 2 commits into
Closed
HTML API: Preserve rawtext contents of IFRAME, NOEMBED, NOFRAMES, and XMP when serializing#51sirreal wants to merge 2 commits into
sirreal wants to merge 2 commits into
Conversation
…izing. The serializer dropped the raw text contents of these elements, removing parsed document content across a serialize/re-parse cycle. Per the HTML fragment serialization algorithm, their contents are emitted literally, as is already done for SCRIPT and STYLE. Raw text cannot contain its own closing tag, so literal emission cannot terminate the element early when re-parsing. See #65372.
XMP contents are raw text in which character references are never decoded. Escaping them changed document contents across a serialize/re-parse cycle: `<xmp>1 < 2</xmp>` serialized as `<xmp>1 < 2</xmp>`, which re-parses as the literal text "1 < 2". XMP contents now serialize literally like the other raw text elements, following the HTML fragment serialization algorithm. See #65372.
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
sirreal
added a commit
that referenced
this pull request
Jun 11, 2026
# Conflicts: # tests/phpunit/tests/html-api/wpHtmlProcessor-serialize.php
Owner
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The HTML Processor's serializer destroys the contents of four rawtext elements, so a serialize/re-parse cycle produces a different document than the one parsed:
IFRAME,NOEMBED,NOFRAMES: contents are dropped entirely.<iframe>x</iframe>ynormalizes to<iframe></iframe>y, removing the text node from the tree.XMP: contents are escaped withhtmlspecialchars(). XMP is rawtext — character references are never decoded on parse — so<xmp>1 < 2 & more</xmp>normalizes to<xmp>1 < 2 &amp; more</xmp>, which re-parses as the literal text1 < 2 &amp; more.Browsers keep these contents in the DOM and serialize them literally, per the HTML fragment serialization algorithm:
Compare in a browser:
<iframe>x</iframe>y— the text node exists in the parsed tree, andinnerHTMLround-trips it.<iframe></iframe>y— current serializer output: the text node is gone.<xmp>1 < 2 & more</xmp>vs. current output<xmp>1 < 2 &amp; more</xmp>— different text contents.<html><frameset><noframes>x</noframes>— NOFRAMES contents survive in frameset documents too.This patch serializes all four literally, joining
SCRIPTandSTYLE. Output now matches PHP 8.4's WHATWG-compliantDom\HTMLDocumentbyte-for-byte on the affected constructs (verified with ~30k differential fuzz inputs; the change fixes every reparse-fidelity failure in that run and introduces no new divergence).Why literal emission is safe:
</tagfollowed by whitespace,/, or>, so parser-derived contents cannot contain their own closing tag; the appended closer cannot match early. (SCRIPTmay contain double-escaped closers, but re-parsing identical bytes follows the same tokenization that produced them.)get_modifiable_text()returns raw bytes, NUL → U+FFFD), so no re-encoding is needed or correct.set_modifiable_text()rejects these four elements, so only parser-derived text can ever reach the serializer.For reviewers to ratify: if the blanking was intentional mXSS hardening (these contents are inert in HTML5 browsers but parse as live markup in HTML4-era parsers such as libxml2), this patch reverses it. I believe that hardening was illusory:
SCRIPT/STYLE— strictly more dangerous — always passed through literally, no core code feeds serializer output to a legacy parser, and sanitization is kses's job, not the serializer's. The behavior also contradicted the feature's stated goal ofDom\HTMLDocumentparity ([r59076]).Two commits: the dropped-contents fix (
IFRAME/NOEMBED/NOFRAMES) and theXMPescaping fix.To test:
vendor/bin/phpunit --filter Tests_HtmlApi_WpHtmlProcessor_SerializeTrac ticket: …
Fixes #50
Use of AI Tools
AI assistance: Yes
Tool(s): Claude Code
Model(s): Claude Fable 5
Used for: Diagnosis, implementation, tests, and differential fuzz verification against
Dom\HTMLDocument; reviewed and edited by me.This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.