HTML API: Raw-Text Serialization Data Loss in `IFRAME`, `NOEMBED`, and `NOFRAMES`

Generated: 2026-06-11

## Summary

The HTML API serializer currently drops raw-text contents for three HTML elements:

- `IFRAME`
- `NOEMBED`
- `NOFRAMES`

This is a WordPress HTML API serialization bug. The parser can build a tree containing text inside these elements, but `WP_HTML_Processor::serialize_token()` emits the element with an empty body. Parsing the serialized output therefore produces a different tree from the original parse.

The dominant failure class is `normalize-tree-changed`: the initial WordPress tree and the comparison tree agree, but the tree produced after WordPress serialization and reparse no longer contains the affected text node.

## Impact In The Run

In the four-lane run snapshot:

- Total `normalize-tree-changed` rows: `233,188`
- Rows whose first changed path is text under `IFRAME`, `NOEMBED`, or `NOFRAMES`: `230,697`
- Distinct affected signatures: `77,174`
- Distinct affected families: `116`

Breakdown by affected element:

| Element | Rows | Distinct signatures | Distinct families |
|---|---:|---:|---:|
| `NOFRAMES` | 109,375 | 26,408 | 94 |
| `IFRAME` | 61,209 | 25,430 | 95 |
| `NOEMBED` | 60,113 | 25,336 | 94 |

Breakdown by parser mode:

| Mode | Rows | Distinct signatures | Distinct families |
|---|---:|---:|---:|
| `fragment-body` | 123,379 | 49,659 | 108 |
| `full-document` | 107,318 | 27,515 | 92 |

These counts show that this is not a single edge case. It accounts for nearly all of the `normalize-tree-changed` volume in the run.

## Root Cause

The relevant code is in `src/wp-includes/html-api/class-wp-html-processor.php`, in `WP_HTML_Processor::serialize_token()`.

Current logic:

```php
if ( $in_html && in_array( $tag_name, array( 'IFRAME', 'NOEMBED', 'NOFRAMES', 'SCRIPT', 'STYLE', 'TEXTAREA', 'TITLE', 'XMP' ), true ) ) {
	$text = $this->get_modifiable_text();

	switch ( $tag_name ) {
		case 'IFRAME':
		case 'NOEMBED':
		case 'NOFRAMES':
			$text = '';
			break;

		case 'SCRIPT':
		case 'STYLE':
		case 'XMP':
			break;

		default:
			$text = self::serialize_decoded_text( $text );
	}

	$html .= "{$text}</{$qualified_name}>";
}
```

The serializer recognizes all three affected tags as self-contained raw-text-like elements, obtains their modifiable text, then discards that text. The final serialized HTML keeps the tags but removes the content.

The neighboring `SCRIPT`, `STYLE`, and `XMP` cases preserve raw text. `TEXTAREA` and `TITLE` serialize escaped decoded text. `IFRAME`, `NOEMBED`, and `NOFRAMES` should not be blanked during serialization because their text is part of the parsed tree.

## Why This Is A Serializer Bug

The failing pattern is:

1. Parse input with `WP_HTML_Processor`.
2. Render or inspect the tree. The raw-text node exists.
3. Serialize or normalize through the HTML API.
4. Parse the serialized output again.
5. The raw-text node is gone.

The parser is not simply disagreeing with another implementation. The loss occurs across WordPress's own serialize and reparse cycle. That makes the output non-idempotent and changes document contents.

## Problematic Examples

The following examples are direct, standalone reductions. "Current output" is what the serializer emits today. "Expected output" preserves the parsed raw-text content while still applying normal full-document or fragment wrapping.

### Case Matrix

| Case | Input | Current output | Expected output |
|---|---|---|---|
| `IFRAME` in a fragment | `<iframe>x</iframe>y` | `<iframe></iframe>y` | `<iframe>x</iframe>y` |
| `IFRAME` in a full document body | `<iframe>x</iframe>y` | `<html><head></head><body><iframe></iframe>y</body></html>` | `<html><head></head><body><iframe>x</iframe>y</body></html>` |
| `NOEMBED` in a fragment | `<noembed>x</noembed>y` | `<noembed></noembed>y` | `<noembed>x</noembed>y` |
| `NOEMBED` in a full document body | `a<noembed>x</noembed>` | `<html><head></head><body>a<noembed></noembed></body></html>` | `<html><head></head><body>a<noembed>x</noembed></body></html>` |
| `NOFRAMES` in a fragment | `<section><noframes>x</noframes>y</section>` | `<section><noframes></noframes>y</section>` | `<section><noframes>x</noframes>y</section>` |
| `NOFRAMES` in a full document body | `a<noframes>x</noframes>` | `<html><head></head><body>a<noframes></noframes></body></html>` | `<html><head></head><body>a<noframes>x</noframes></body></html>` |
| `NOFRAMES` in a full document frameset | `<html><frameset><noframes>x</noframes>` | `<html><head></head><frameset><noframes></noframes></frameset></html>` | `<html><head></head><frameset><noframes>x</noframes></frameset></html>` |

### Adjacent-Token Variants

When another token follows the affected element, the first visible tree difference may appear as that following token moving into the comparison position after the text is removed. The root problem is still the missing raw-text payload.

| Case | Input | Current output | Expected output |
|---|---|---|---|
| `IFRAME` before a following comment in a full document | `<h3><div><small><dd><iframe>x</iframe>` | `<html><head></head><body><h3><div><small><dd><iframe></iframe></dd></small></div></h3></body></html>` | `<html><head></head><body><h3><div><small><dd><iframe>x</iframe></dd></small></div></h3></body></html>` |
| `NOFRAMES` before a bogus comment in a fragment | `<section><noframes>x</noframes><!>` | `<section><noframes></noframes></section>` | `<section><noframes>x</noframes></section>` |

## Behavioral Consequences

This bug can change content in several ways:

- Text fallback inside `IFRAME` is removed.
- Fallback content inside `NOEMBED` is removed.
- `NOFRAMES` content is removed both in ordinary body parsing and in frameset parsing.
- A later sibling token can appear as the first difference after reparse, because the expected text node disappeared.
- Normalization is not idempotent for affected documents and fragments.

For example, this fragment:

```html
<noembed>x</noembed>y
```

currently serializes to:

```html
<noembed></noembed>y
```

After reparse, the `NOEMBED` element no longer contains the text node `"x"`. This is a semantic content change, not only a formatting difference.

## Recommended Fix

Preserve the `get_modifiable_text()` value for `IFRAME`, `NOEMBED`, and `NOFRAMES` instead of forcing it to an empty string.

Minimal shape:

```php
switch ( $tag_name ) {
	case 'SCRIPT':
	case 'STYLE':
	case 'XMP':
	case 'IFRAME':
	case 'NOEMBED':
	case 'NOFRAMES':
		break;

	default:
		$text = self::serialize_decoded_text( $text );
}
```

This keeps the existing raw-text preservation behavior for `SCRIPT`, `STYLE`, and `XMP`, and extends it to the three elements currently losing content.

## Suggested Regression Tests

Add focused serializer tests around the affected tags and modes. These tests should not depend on generated artifacts.

```php
/**
 * @dataProvider data_raw_text_elements_preserved_in_fragments
 */
public function test_raw_text_element_contents_are_preserved_in_fragments( string $html ): void {
	$this->assertSame( $html, WP_HTML_Processor::normalize( $html ) );
}

public static function data_raw_text_elements_preserved_in_fragments(): array {
	return array(
		'IFRAME fragment'   => array( '<iframe>x</iframe>y' ),
		'NOEMBED fragment'  => array( '<noembed>x</noembed>y' ),
		'NOFRAMES fragment' => array( '<section><noframes>x</noframes>y</section>' ),
	);
}

/**
 * @dataProvider data_raw_text_elements_preserved_in_full_documents
 */
public function test_raw_text_element_contents_are_preserved_in_full_documents( string $html, string $expected ): void {
	$processor = WP_HTML_Processor::create_full_parser( $html );

	$this->assertNotNull( $processor );
	$this->assertSame( $expected, $processor->serialize() );
}

public static function data_raw_text_elements_preserved_in_full_documents(): array {
	return array(
		'IFRAME body' => array(
			'<iframe>x</iframe>y',
			'<html><head></head><body><iframe>x</iframe>y</body></html>',
		),
		'NOEMBED body' => array(
			'a<noembed>x</noembed>',
			'<html><head></head><body>a<noembed>x</noembed></body></html>',
		),
		'NOFRAMES body' => array(
			'a<noframes>x</noframes>',
			'<html><head></head><body>a<noframes>x</noframes></body></html>',
		),
		'NOFRAMES frameset' => array(
			'<html><frameset><noframes>x</noframes>',
			'<html><head></head><frameset><noframes>x</noframes></frameset></html>',
		),
	);
}
```

Also add an adjacent-token regression so the test suite catches the confusing variant where the following token appears as the first tree difference:

```php
public function test_iframe_raw_text_is_preserved_before_following_comment(): void {
	$processor = WP_HTML_Processor::create_full_parser(
		'<h3><div><small><dd><iframe>x</iframe>'
	);

	$this->assertNotNull( $processor );
	$this->assertSame(
		'<html><head></head><body><h3><div><small><dd><iframe>x</iframe></dd></small></div></h3></body></html>',
		$processor->serialize()
	);
}
```

## Acceptance Criteria

- `IFRAME`, `NOEMBED`, and `NOFRAMES` serialization preserves their raw-text contents.
- Existing `SCRIPT`, `STYLE`, `XMP`, `TEXTAREA`, and `TITLE` serialization behavior remains unchanged.
- Fragment normalization remains idempotent for the affected elements.
- Full-document serialization remains idempotent for the affected elements in both body and frameset contexts.
- Adjacent sibling tokens no longer become the first tree difference solely because affected raw text was dropped.

Case	Input	Current output	Expected output
`IFRAME` in a fragment	`<iframe>x</iframe>y`	`<iframe></iframe>y`	`<iframe>x</iframe>y`
`IFRAME` in a full document body	`<iframe>x</iframe>y`	`<html><head></head><body><iframe></iframe>y</body></html>`	`<html><head></head><body><iframe>x</iframe>y</body></html>`
`NOEMBED` in a fragment	`<noembed>x</noembed>y`	`<noembed></noembed>y`	`<noembed>x</noembed>y`
`NOEMBED` in a full document body	`a<noembed>x</noembed>`	`<html><head></head><body>a<noembed></noembed></body></html>`	`<html><head></head><body>a<noembed>x</noembed></body></html>`
`NOFRAMES` in a fragment	`<section><noframes>x</noframes>y</section>`	`<section><noframes></noframes>y</section>`	`<section><noframes>x</noframes>y</section>`
`NOFRAMES` in a full document body	`a<noframes>x</noframes>`	`<html><head></head><body>a<noframes></noframes></body></html>`	`<html><head></head><body>a<noframes>x</noframes></body></html>`
`NOFRAMES` in a full document frameset	`<html><frameset><noframes>x</noframes>`	`<html><head></head><frameset><noframes></noframes></frameset></html>`	`<html><head></head><frameset><noframes>x</noframes></frameset></html>`

Case	Input	Current output	Expected output
`IFRAME` before a following comment in a full document	`<h3><div><small><dd><iframe>x</iframe><!---->`	`<html><head></head><body><h3><div><small><dd><iframe></iframe><!----></dd></small></div></h3></body></html>`	`<html><head></head><body><h3><div><small><dd><iframe>x</iframe><!----></dd></small></div></h3></body></html>`
`NOFRAMES` before a bogus comment in a fragment	`<section><noframes>x</noframes><!>`	`<section><noframes></noframes><!----></section>`	`<section><noframes>x</noframes><!----></section>`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML API: Raw-Text Serialization Data Loss in `IFRAME`, `NOEMBED`, and `NOFRAMES` #50

Summary

Impact In The Run

Root Cause

Why This Is A Serializer Bug

Problematic Examples

Case Matrix

Adjacent-Token Variants

Behavioral Consequences

Recommended Fix

Suggested Regression Tests

Acceptance Criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Element	Rows	Distinct signatures	Distinct families
`NOFRAMES`	109,375	26,408	94
`IFRAME`	61,209	25,430	95
`NOEMBED`	60,113	25,336	94

Mode	Rows	Distinct signatures	Distinct families
`fragment-body`	123,379	49,659	108
`full-document`	107,318	27,515	92

HTML API: Raw-Text Serialization Data Loss in IFRAME, NOEMBED, and NOFRAMES #50

Description

Summary

Impact In The Run

Root Cause

Why This Is A Serializer Bug

Problematic Examples

Case Matrix

Adjacent-Token Variants

Behavioral Consequences

Recommended Fix

Suggested Regression Tests

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

HTML API: Raw-Text Serialization Data Loss in `IFRAME`, `NOEMBED`, and `NOFRAMES` #50