Skip to content

HTML API: Raw-Text Serialization Data Loss in IFRAME, NOEMBED, and NOFRAMES #50

@sirreal

Description

@sirreal

Generated: 2026-06-11

Summary

The HTML API serializer currently drops raw-text contents for three HTML elements:

  • IFRAME
  • NOEMBED
  • NOFRAMES

This is a WordPress HTML API serialization bug. The parser can build a tree containing text inside these elements, but WP_HTML_Processor::serialize_token() emits the element with an empty body. Parsing the serialized output therefore produces a different tree from the original parse.

The dominant failure class is normalize-tree-changed: the initial WordPress tree and the comparison tree agree, but the tree produced after WordPress serialization and reparse no longer contains the affected text node.

Impact In The Run

In the four-lane run snapshot:

  • Total normalize-tree-changed rows: 233,188
  • Rows whose first changed path is text under IFRAME, NOEMBED, or NOFRAMES: 230,697
  • Distinct affected signatures: 77,174
  • Distinct affected families: 116

Breakdown by affected element:

Element Rows Distinct signatures Distinct families
NOFRAMES 109,375 26,408 94
IFRAME 61,209 25,430 95
NOEMBED 60,113 25,336 94

Breakdown by parser mode:

Mode Rows Distinct signatures Distinct families
fragment-body 123,379 49,659 108
full-document 107,318 27,515 92

These counts show that this is not a single edge case. It accounts for nearly all of the normalize-tree-changed volume in the run.

Root Cause

The relevant code is in src/wp-includes/html-api/class-wp-html-processor.php, in WP_HTML_Processor::serialize_token().

Current logic:

if ( $in_html && in_array( $tag_name, array( 'IFRAME', 'NOEMBED', 'NOFRAMES', 'SCRIPT', 'STYLE', 'TEXTAREA', 'TITLE', 'XMP' ), true ) ) {
	$text = $this->get_modifiable_text();

	switch ( $tag_name ) {
		case 'IFRAME':
		case 'NOEMBED':
		case 'NOFRAMES':
			$text = '';
			break;

		case 'SCRIPT':
		case 'STYLE':
		case 'XMP':
			break;

		default:
			$text = self::serialize_decoded_text( $text );
	}

	$html .= "{$text}</{$qualified_name}>";
}

The serializer recognizes all three affected tags as self-contained raw-text-like elements, obtains their modifiable text, then discards that text. The final serialized HTML keeps the tags but removes the content.

The neighboring SCRIPT, STYLE, and XMP cases preserve raw text. TEXTAREA and TITLE serialize escaped decoded text. IFRAME, NOEMBED, and NOFRAMES should not be blanked during serialization because their text is part of the parsed tree.

Why This Is A Serializer Bug

The failing pattern is:

  1. Parse input with WP_HTML_Processor.
  2. Render or inspect the tree. The raw-text node exists.
  3. Serialize or normalize through the HTML API.
  4. Parse the serialized output again.
  5. The raw-text node is gone.

The parser is not simply disagreeing with another implementation. The loss occurs across WordPress's own serialize and reparse cycle. That makes the output non-idempotent and changes document contents.

Problematic Examples

The following examples are direct, standalone reductions. "Current output" is what the serializer emits today. "Expected output" preserves the parsed raw-text content while still applying normal full-document or fragment wrapping.

Case Matrix

Case Input Current output Expected output
IFRAME in a fragment <iframe>x</iframe>y <iframe></iframe>y <iframe>x</iframe>y
IFRAME in a full document body <iframe>x</iframe>y <html><head></head><body><iframe></iframe>y</body></html> <html><head></head><body><iframe>x</iframe>y</body></html>
NOEMBED in a fragment <noembed>x</noembed>y <noembed></noembed>y <noembed>x</noembed>y
NOEMBED in a full document body a<noembed>x</noembed> <html><head></head><body>a<noembed></noembed></body></html> <html><head></head><body>a<noembed>x</noembed></body></html>
NOFRAMES in a fragment <section><noframes>x</noframes>y</section> <section><noframes></noframes>y</section> <section><noframes>x</noframes>y</section>
NOFRAMES in a full document body a<noframes>x</noframes> <html><head></head><body>a<noframes></noframes></body></html> <html><head></head><body>a<noframes>x</noframes></body></html>
NOFRAMES in a full document frameset <html><frameset><noframes>x</noframes> <html><head></head><frameset><noframes></noframes></frameset></html> <html><head></head><frameset><noframes>x</noframes></frameset></html>

Adjacent-Token Variants

When another token follows the affected element, the first visible tree difference may appear as that following token moving into the comparison position after the text is removed. The root problem is still the missing raw-text payload.

Case Input Current output Expected output
IFRAME before a following comment in a full document <h3><div><small><dd><iframe>x</iframe><!----> <html><head></head><body><h3><div><small><dd><iframe></iframe><!----></dd></small></div></h3></body></html> <html><head></head><body><h3><div><small><dd><iframe>x</iframe><!----></dd></small></div></h3></body></html>
NOFRAMES before a bogus comment in a fragment <section><noframes>x</noframes><!> <section><noframes></noframes><!----></section> <section><noframes>x</noframes><!----></section>

Behavioral Consequences

This bug can change content in several ways:

  • Text fallback inside IFRAME is removed.
  • Fallback content inside NOEMBED is removed.
  • NOFRAMES content is removed both in ordinary body parsing and in frameset parsing.
  • A later sibling token can appear as the first difference after reparse, because the expected text node disappeared.
  • Normalization is not idempotent for affected documents and fragments.

For example, this fragment:

<noembed>x</noembed>y

currently serializes to:

<noembed></noembed>y

After reparse, the NOEMBED element no longer contains the text node "x". This is a semantic content change, not only a formatting difference.

Recommended Fix

Preserve the get_modifiable_text() value for IFRAME, NOEMBED, and NOFRAMES instead of forcing it to an empty string.

Minimal shape:

switch ( $tag_name ) {
	case 'SCRIPT':
	case 'STYLE':
	case 'XMP':
	case 'IFRAME':
	case 'NOEMBED':
	case 'NOFRAMES':
		break;

	default:
		$text = self::serialize_decoded_text( $text );
}

This keeps the existing raw-text preservation behavior for SCRIPT, STYLE, and XMP, and extends it to the three elements currently losing content.

Suggested Regression Tests

Add focused serializer tests around the affected tags and modes. These tests should not depend on generated artifacts.

/**
 * @dataProvider data_raw_text_elements_preserved_in_fragments
 */
public function test_raw_text_element_contents_are_preserved_in_fragments( string $html ): void {
	$this->assertSame( $html, WP_HTML_Processor::normalize( $html ) );
}

public static function data_raw_text_elements_preserved_in_fragments(): array {
	return array(
		'IFRAME fragment'   => array( '<iframe>x</iframe>y' ),
		'NOEMBED fragment'  => array( '<noembed>x</noembed>y' ),
		'NOFRAMES fragment' => array( '<section><noframes>x</noframes>y</section>' ),
	);
}

/**
 * @dataProvider data_raw_text_elements_preserved_in_full_documents
 */
public function test_raw_text_element_contents_are_preserved_in_full_documents( string $html, string $expected ): void {
	$processor = WP_HTML_Processor::create_full_parser( $html );

	$this->assertNotNull( $processor );
	$this->assertSame( $expected, $processor->serialize() );
}

public static function data_raw_text_elements_preserved_in_full_documents(): array {
	return array(
		'IFRAME body' => array(
			'<iframe>x</iframe>y',
			'<html><head></head><body><iframe>x</iframe>y</body></html>',
		),
		'NOEMBED body' => array(
			'a<noembed>x</noembed>',
			'<html><head></head><body>a<noembed>x</noembed></body></html>',
		),
		'NOFRAMES body' => array(
			'a<noframes>x</noframes>',
			'<html><head></head><body>a<noframes>x</noframes></body></html>',
		),
		'NOFRAMES frameset' => array(
			'<html><frameset><noframes>x</noframes>',
			'<html><head></head><frameset><noframes>x</noframes></frameset></html>',
		),
	);
}

Also add an adjacent-token regression so the test suite catches the confusing variant where the following token appears as the first tree difference:

public function test_iframe_raw_text_is_preserved_before_following_comment(): void {
	$processor = WP_HTML_Processor::create_full_parser(
		'<h3><div><small><dd><iframe>x</iframe><!---->'
	);

	$this->assertNotNull( $processor );
	$this->assertSame(
		'<html><head></head><body><h3><div><small><dd><iframe>x</iframe><!----></dd></small></div></h3></body></html>',
		$processor->serialize()
	);
}

Acceptance Criteria

  • IFRAME, NOEMBED, and NOFRAMES serialization preserves their raw-text contents.
  • Existing SCRIPT, STYLE, XMP, TEXTAREA, and TITLE serialization behavior remains unchanged.
  • Fragment normalization remains idempotent for the affected elements.
  • Full-document serialization remains idempotent for the affected elements in both body and frameset contexts.
  • Adjacent sibling tokens no longer become the first tree difference solely because affected raw text was dropped.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions