Generated: 2026-06-11
Summary
The HTML API serializer currently drops raw-text contents for three HTML elements:
This is a WordPress HTML API serialization bug. The parser can build a tree containing text inside these elements, but WP_HTML_Processor::serialize_token() emits the element with an empty body. Parsing the serialized output therefore produces a different tree from the original parse.
The dominant failure class is normalize-tree-changed: the initial WordPress tree and the comparison tree agree, but the tree produced after WordPress serialization and reparse no longer contains the affected text node.
Impact In The Run
In the four-lane run snapshot:
- Total
normalize-tree-changed rows: 233,188
- Rows whose first changed path is text under
IFRAME, NOEMBED, or NOFRAMES: 230,697
- Distinct affected signatures:
77,174
- Distinct affected families:
116
Breakdown by affected element:
| Element |
Rows |
Distinct signatures |
Distinct families |
NOFRAMES |
109,375 |
26,408 |
94 |
IFRAME |
61,209 |
25,430 |
95 |
NOEMBED |
60,113 |
25,336 |
94 |
Breakdown by parser mode:
| Mode |
Rows |
Distinct signatures |
Distinct families |
fragment-body |
123,379 |
49,659 |
108 |
full-document |
107,318 |
27,515 |
92 |
These counts show that this is not a single edge case. It accounts for nearly all of the normalize-tree-changed volume in the run.
Root Cause
The relevant code is in src/wp-includes/html-api/class-wp-html-processor.php, in WP_HTML_Processor::serialize_token().
Current logic:
if ( $in_html && in_array( $tag_name, array( 'IFRAME', 'NOEMBED', 'NOFRAMES', 'SCRIPT', 'STYLE', 'TEXTAREA', 'TITLE', 'XMP' ), true ) ) {
$text = $this->get_modifiable_text();
switch ( $tag_name ) {
case 'IFRAME':
case 'NOEMBED':
case 'NOFRAMES':
$text = '';
break;
case 'SCRIPT':
case 'STYLE':
case 'XMP':
break;
default:
$text = self::serialize_decoded_text( $text );
}
$html .= "{$text}</{$qualified_name}>";
}
The serializer recognizes all three affected tags as self-contained raw-text-like elements, obtains their modifiable text, then discards that text. The final serialized HTML keeps the tags but removes the content.
The neighboring SCRIPT, STYLE, and XMP cases preserve raw text. TEXTAREA and TITLE serialize escaped decoded text. IFRAME, NOEMBED, and NOFRAMES should not be blanked during serialization because their text is part of the parsed tree.
Why This Is A Serializer Bug
The failing pattern is:
- Parse input with
WP_HTML_Processor.
- Render or inspect the tree. The raw-text node exists.
- Serialize or normalize through the HTML API.
- Parse the serialized output again.
- The raw-text node is gone.
The parser is not simply disagreeing with another implementation. The loss occurs across WordPress's own serialize and reparse cycle. That makes the output non-idempotent and changes document contents.
Problematic Examples
The following examples are direct, standalone reductions. "Current output" is what the serializer emits today. "Expected output" preserves the parsed raw-text content while still applying normal full-document or fragment wrapping.
Case Matrix
| Case |
Input |
Current output |
Expected output |
IFRAME in a fragment |
<iframe>x</iframe>y |
<iframe></iframe>y |
<iframe>x</iframe>y |
IFRAME in a full document body |
<iframe>x</iframe>y |
<html><head></head><body><iframe></iframe>y</body></html> |
<html><head></head><body><iframe>x</iframe>y</body></html> |
NOEMBED in a fragment |
<noembed>x</noembed>y |
<noembed></noembed>y |
<noembed>x</noembed>y |
NOEMBED in a full document body |
a<noembed>x</noembed> |
<html><head></head><body>a<noembed></noembed></body></html> |
<html><head></head><body>a<noembed>x</noembed></body></html> |
NOFRAMES in a fragment |
<section><noframes>x</noframes>y</section> |
<section><noframes></noframes>y</section> |
<section><noframes>x</noframes>y</section> |
NOFRAMES in a full document body |
a<noframes>x</noframes> |
<html><head></head><body>a<noframes></noframes></body></html> |
<html><head></head><body>a<noframes>x</noframes></body></html> |
NOFRAMES in a full document frameset |
<html><frameset><noframes>x</noframes> |
<html><head></head><frameset><noframes></noframes></frameset></html> |
<html><head></head><frameset><noframes>x</noframes></frameset></html> |
Adjacent-Token Variants
When another token follows the affected element, the first visible tree difference may appear as that following token moving into the comparison position after the text is removed. The root problem is still the missing raw-text payload.
| Case |
Input |
Current output |
Expected output |
IFRAME before a following comment in a full document |
<h3><div><small><dd><iframe>x</iframe><!----> |
<html><head></head><body><h3><div><small><dd><iframe></iframe><!----></dd></small></div></h3></body></html> |
<html><head></head><body><h3><div><small><dd><iframe>x</iframe><!----></dd></small></div></h3></body></html> |
NOFRAMES before a bogus comment in a fragment |
<section><noframes>x</noframes><!> |
<section><noframes></noframes><!----></section> |
<section><noframes>x</noframes><!----></section> |
Behavioral Consequences
This bug can change content in several ways:
- Text fallback inside
IFRAME is removed.
- Fallback content inside
NOEMBED is removed.
NOFRAMES content is removed both in ordinary body parsing and in frameset parsing.
- A later sibling token can appear as the first difference after reparse, because the expected text node disappeared.
- Normalization is not idempotent for affected documents and fragments.
For example, this fragment:
currently serializes to:
After reparse, the NOEMBED element no longer contains the text node "x". This is a semantic content change, not only a formatting difference.
Recommended Fix
Preserve the get_modifiable_text() value for IFRAME, NOEMBED, and NOFRAMES instead of forcing it to an empty string.
Minimal shape:
switch ( $tag_name ) {
case 'SCRIPT':
case 'STYLE':
case 'XMP':
case 'IFRAME':
case 'NOEMBED':
case 'NOFRAMES':
break;
default:
$text = self::serialize_decoded_text( $text );
}
This keeps the existing raw-text preservation behavior for SCRIPT, STYLE, and XMP, and extends it to the three elements currently losing content.
Suggested Regression Tests
Add focused serializer tests around the affected tags and modes. These tests should not depend on generated artifacts.
/**
* @dataProvider data_raw_text_elements_preserved_in_fragments
*/
public function test_raw_text_element_contents_are_preserved_in_fragments( string $html ): void {
$this->assertSame( $html, WP_HTML_Processor::normalize( $html ) );
}
public static function data_raw_text_elements_preserved_in_fragments(): array {
return array(
'IFRAME fragment' => array( '<iframe>x</iframe>y' ),
'NOEMBED fragment' => array( '<noembed>x</noembed>y' ),
'NOFRAMES fragment' => array( '<section><noframes>x</noframes>y</section>' ),
);
}
/**
* @dataProvider data_raw_text_elements_preserved_in_full_documents
*/
public function test_raw_text_element_contents_are_preserved_in_full_documents( string $html, string $expected ): void {
$processor = WP_HTML_Processor::create_full_parser( $html );
$this->assertNotNull( $processor );
$this->assertSame( $expected, $processor->serialize() );
}
public static function data_raw_text_elements_preserved_in_full_documents(): array {
return array(
'IFRAME body' => array(
'<iframe>x</iframe>y',
'<html><head></head><body><iframe>x</iframe>y</body></html>',
),
'NOEMBED body' => array(
'a<noembed>x</noembed>',
'<html><head></head><body>a<noembed>x</noembed></body></html>',
),
'NOFRAMES body' => array(
'a<noframes>x</noframes>',
'<html><head></head><body>a<noframes>x</noframes></body></html>',
),
'NOFRAMES frameset' => array(
'<html><frameset><noframes>x</noframes>',
'<html><head></head><frameset><noframes>x</noframes></frameset></html>',
),
);
}
Also add an adjacent-token regression so the test suite catches the confusing variant where the following token appears as the first tree difference:
public function test_iframe_raw_text_is_preserved_before_following_comment(): void {
$processor = WP_HTML_Processor::create_full_parser(
'<h3><div><small><dd><iframe>x</iframe><!---->'
);
$this->assertNotNull( $processor );
$this->assertSame(
'<html><head></head><body><h3><div><small><dd><iframe>x</iframe><!----></dd></small></div></h3></body></html>',
$processor->serialize()
);
}
Acceptance Criteria
IFRAME, NOEMBED, and NOFRAMES serialization preserves their raw-text contents.
- Existing
SCRIPT, STYLE, XMP, TEXTAREA, and TITLE serialization behavior remains unchanged.
- Fragment normalization remains idempotent for the affected elements.
- Full-document serialization remains idempotent for the affected elements in both body and frameset contexts.
- Adjacent sibling tokens no longer become the first tree difference solely because affected raw text was dropped.
Generated: 2026-06-11
Summary
The HTML API serializer currently drops raw-text contents for three HTML elements:
IFRAMENOEMBEDNOFRAMESThis is a WordPress HTML API serialization bug. The parser can build a tree containing text inside these elements, but
WP_HTML_Processor::serialize_token()emits the element with an empty body. Parsing the serialized output therefore produces a different tree from the original parse.The dominant failure class is
normalize-tree-changed: the initial WordPress tree and the comparison tree agree, but the tree produced after WordPress serialization and reparse no longer contains the affected text node.Impact In The Run
In the four-lane run snapshot:
normalize-tree-changedrows:233,188IFRAME,NOEMBED, orNOFRAMES:230,69777,174116Breakdown by affected element:
NOFRAMESIFRAMENOEMBEDBreakdown by parser mode:
fragment-bodyfull-documentThese counts show that this is not a single edge case. It accounts for nearly all of the
normalize-tree-changedvolume in the run.Root Cause
The relevant code is in
src/wp-includes/html-api/class-wp-html-processor.php, inWP_HTML_Processor::serialize_token().Current logic:
The serializer recognizes all three affected tags as self-contained raw-text-like elements, obtains their modifiable text, then discards that text. The final serialized HTML keeps the tags but removes the content.
The neighboring
SCRIPT,STYLE, andXMPcases preserve raw text.TEXTAREAandTITLEserialize escaped decoded text.IFRAME,NOEMBED, andNOFRAMESshould not be blanked during serialization because their text is part of the parsed tree.Why This Is A Serializer Bug
The failing pattern is:
WP_HTML_Processor.The parser is not simply disagreeing with another implementation. The loss occurs across WordPress's own serialize and reparse cycle. That makes the output non-idempotent and changes document contents.
Problematic Examples
The following examples are direct, standalone reductions. "Current output" is what the serializer emits today. "Expected output" preserves the parsed raw-text content while still applying normal full-document or fragment wrapping.
Case Matrix
IFRAMEin a fragment<iframe>x</iframe>y<iframe></iframe>y<iframe>x</iframe>yIFRAMEin a full document body<iframe>x</iframe>y<html><head></head><body><iframe></iframe>y</body></html><html><head></head><body><iframe>x</iframe>y</body></html>NOEMBEDin a fragment<noembed>x</noembed>y<noembed></noembed>y<noembed>x</noembed>yNOEMBEDin a full document bodya<noembed>x</noembed><html><head></head><body>a<noembed></noembed></body></html><html><head></head><body>a<noembed>x</noembed></body></html>NOFRAMESin a fragment<section><noframes>x</noframes>y</section><section><noframes></noframes>y</section><section><noframes>x</noframes>y</section>NOFRAMESin a full document bodya<noframes>x</noframes><html><head></head><body>a<noframes></noframes></body></html><html><head></head><body>a<noframes>x</noframes></body></html>NOFRAMESin a full document frameset<html><frameset><noframes>x</noframes><html><head></head><frameset><noframes></noframes></frameset></html><html><head></head><frameset><noframes>x</noframes></frameset></html>Adjacent-Token Variants
When another token follows the affected element, the first visible tree difference may appear as that following token moving into the comparison position after the text is removed. The root problem is still the missing raw-text payload.
IFRAMEbefore a following comment in a full document<h3><div><small><dd><iframe>x</iframe><!----><html><head></head><body><h3><div><small><dd><iframe></iframe><!----></dd></small></div></h3></body></html><html><head></head><body><h3><div><small><dd><iframe>x</iframe><!----></dd></small></div></h3></body></html>NOFRAMESbefore a bogus comment in a fragment<section><noframes>x</noframes><!><section><noframes></noframes><!----></section><section><noframes>x</noframes><!----></section>Behavioral Consequences
This bug can change content in several ways:
IFRAMEis removed.NOEMBEDis removed.NOFRAMEScontent is removed both in ordinary body parsing and in frameset parsing.For example, this fragment:
currently serializes to:
After reparse, the
NOEMBEDelement no longer contains the text node"x". This is a semantic content change, not only a formatting difference.Recommended Fix
Preserve the
get_modifiable_text()value forIFRAME,NOEMBED, andNOFRAMESinstead of forcing it to an empty string.Minimal shape:
This keeps the existing raw-text preservation behavior for
SCRIPT,STYLE, andXMP, and extends it to the three elements currently losing content.Suggested Regression Tests
Add focused serializer tests around the affected tags and modes. These tests should not depend on generated artifacts.
Also add an adjacent-token regression so the test suite catches the confusing variant where the following token appears as the first tree difference:
Acceptance Criteria
IFRAME,NOEMBED, andNOFRAMESserialization preserves their raw-text contents.SCRIPT,STYLE,XMP,TEXTAREA, andTITLEserialization behavior remains unchanged.