Html api fuzzer by sirreal · Pull Request #40 · sirreal/wordpress-develop

sirreal · 2026-06-09T21:12:54Z

Trac ticket:

Use of AI Tools

Yeah 🙂

This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

html5lib/html5lib-tests#178

Remove SELECT case and update comments numbers. The SELECT case was removed from the algorithm in the standard.

This was removed from the HTML standard

These insertion modes are removed from the standard.

See https://github.com/html5lib/html5lib-tests/pull/178/files?show-viewed-files=true&file-filters%5B%5D=#r2222195057

When SELECT > BUTTON > SELECTEDCONTENT is encountered, the selected option may need to be cloned into the SELECTEDCONTENT. The HTML processor does not support this action as it may require out of order processing.

html5lib/html5lib-tests@5ad026d

Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>

…t tags separate from wp_localize_script Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>

…p and simplify iteration Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>

Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>

- Removed separate print_script_data() method and wp_print_script_data() wrapper - Removed action hooks from wp_footer and admin_print_footer_scripts - Filter now runs during script processing in do_item() - Data script tag is output immediately before each script tag - Updated all tests to use wp_print_scripts instead of wp_print_script_data - Added test to verify data tag appears before script tag Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>

Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>

@SInCE

- Rename $data_tag to $script_data_tag for clarity - Remove unnecessary (string) coercion from wp_json_encode - Update @SInCE tag to 7.0.0 - Remove empty line before closing brace Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>

- Add JSON_INVALID_UTF8_SUBSTITUTE to json_encode flags - This prevents failures on invalid UTF-8 by substituting with U+FFFD - Provides defense-in-depth beyond wp_json_encode's sanity checking - Document the flag's purpose in the inline comment Co-authored-by: dmsnell <5431237+dmsnell@users.noreply.github.com>

This reverts commit d89d333. The flag was added based on incorrect analysis. wp_json_encode() already handles invalid UTF-8 through its fallback mechanism (_wp_json_sanity_check), converting invalid bytes to "?". The implementation now matches script modules exactly, using only: - JSON_HEX_TAG - JSON_UNESCAPED_SLASHES - JSON_UNESCAPED_UNICODE (UTF-8 pages only) - JSON_UNESCAPED_LINE_TERMINATORS (UTF-8 pages only) Co-authored-by: dmsnell <5431237+dmsnell@users.noreply.github.com>

Without this flag, json_encode() returns false on invalid UTF-8, triggering wp_json_encode()'s expensive fallback mechanism (_wp_json_sanity_check) which recursively walks all data and re-encodes. With JSON_INVALID_UTF8_SUBSTITUTE: - Invalid UTF-8 bytes are substituted with U+FFFD (�) in a single pass - No fallback overhead - Standard, consistent behavior This is the proper solution rather than relying on the fallback. Co-authored-by: dmsnell <5431237+dmsnell@users.noreply.github.com>

Red TDD step: browser-verified expectation that classList-equivalent reads preserve NULL bytes in values set through the API; the U+0000 replacement belongs to the tokenizer, and document-sourced values already receive it in get_attribute(). See #65372.

class_list() received its NULL-byte replacement when reading raw class values; that replacement now happens in get_attribute() for values from the input document. Performing it on API-supplied values diverged from browsers, where classList preserves NULL bytes in values set via setAttribute(). See #65372.

Benchmark-guided: reading an attribute value applies up to three str_replace passes which doubled read cost for long values containing no bytes needing replacement. Guarding with strpos keeps the common case at two fast scans; values are typically free of CR and NULL. Benchmark (PHP 8.4, medians of 3): scanning 100-tag documents reading 3 attributes each, 2000 iterations: trunk 667ms, unguarded 714ms, guarded 699ms. Reading a 10.8KB clean attribute value 200k times: trunk 147ms, unguarded 313ms, guarded 258ms. The remaining cost is the unavoidable byte inspection. See #65372.

Red TDD step from adversarial review: a named character reference without a terminating semicolon must decode when followed by a NULL byte or any non-ASCII byte. Replacing NULL with U+FFFD before decoding fed the decoder a multi-byte follower whose classification by ctype_alnum() depends on the process locale, suppressing valid decodes in attribute values, diverging from browsers and from trunk. See #65372.

The tokenizer replaces U+0000 NULL bytes as it consumes input, so a character reference without a terminating semicolon sees the raw NULL byte as its follower, which is unambiguous, and the reference decodes. Replacing before decoding handed the decoder U+FFFD's lead byte, whose ctype_alnum() classification depends on the process locale, wrongly suppressing the decode under UTF-8 locales. No character reference decodes into NULL, so replacing after decoding is equivalent for the value's own bytes and faithful to the tokenizer's order. See #65372.

Per the named-character-reference state, a semicolon-less reference is ambiguous only when followed by an ASCII alphanumeric or equals sign. ctype_alnum() classifies bytes 0x80 and above as alphanumeric under UTF-8 locales, wrongly suppressing decodes followed by any non-ASCII byte and making decoding depend on the process locale. See #65372.

Red TDD step from adversarial review: next_tag() must match tag names in the same U+FFFD-replaced alphabet that get_tag() exposes, so the getter round-trips into queries, raw NULL spellings match nothing, and the Tag Processor agrees with the HTML Processor, whose queries already compare against the replaced token name. See #65372.

next_tag() compared sought tag names against raw document bytes while get_tag() returns names with NULL bytes replaced by U+FFFD, breaking the getter-to-query round trip and disagreeing with the HTML Processor's queries. Matching now happens in the exposed alphabet; the existing byte comparison is unchanged for names without NULL bytes, so the hot path costs the same. See #65372.

Red TDD step from adversarial review: get_attribute( 'CLASS' ) returned a stale value when class updates were pending, because the flush guard compared the attribute name case-sensitively. See #65372.

Attribute lookups are ASCII-case-insensitive, but the pending-class flush in get_attribute() compared the requested name case-sensitively, returning a stale value for spellings like "CLASS". See #65372.

@SInCE

From adversarial review: pins for class helpers over replaced source values, boolean attributes with NULL-byte names, verbatim prefix matching in get_attribute_names_with_prefix(), and HTML Processor end-tag matching across NULL and U+FFFD spellings (browser-verified: both spellings tokenize to the same name). Documents the @SInCE 7.1.0 behavior on indirectly-affected getters and the known asymmetry of set_modifiable_text(), whose value reads back normalized unlike attribute values, which round-trip verbatim. See #65372.

Red TDD step: decoded carriage returns in text and attribute values must serialize as  so that normalized output is idempotent: a raw CR in serialized output would be normalized to a line feed when parsed again. The raw-CR attribute and class-update cases pass already through the preprocessing-correct getters and pin that behavior. See #65372.

The serializer emitted decoded carriage returns raw into text and attribute values, where input preprocessing turns them into line feeds on the next parse: normalized output never reached a fixed point for documents containing . Escaping CR after htmlspecialchars() keeps the character through parse/serialize round trips. Attribute values read through get_attribute(), whose input preprocessing guarantees raw source carriage returns already arrive normalized to line feeds, so only genuinely decoded CRs are escaped. See #65372.

An attribute value set through set_attribute() may contain NULL bytes; serializing them as U+FFFD keeps normalized output idempotent, where browsers' innerHTML emits the raw byte and loses it to replacement on the next parse. This pins the behavior ahead of consolidating the serializer's NULL handling. See #65372.

The getters now expose tag and attribute names with NULL bytes already replaced by U+FFFD, leaving the serializer's name scrubbing dead, and the only live input to the per-attribute whole-buffer scrub was an API-supplied attribute value. That replacement moves into serialize_decoded_text() next to the carriage-return escaping, which exists for the same reason: emitting bytes the next parse would transform. UTF-8 scrubbing of qualified names remains, as invalid sequences can still reach serialization through source names. See #65372.

From adversarial review: pins that SCRIPT and STYLE contents serialize without escaping, where character references do not decode, and that serialize_token() output for modified class and NULL-containing attribute values parses back to the same decoded values. See #65372.

# Conflicts: # src/wp-includes/html-api/class-wp-html-tag-processor.php

# Conflicts: # src/wp-includes/html-api/class-wp-html-processor.php # tests/phpunit/tests/html-api/wpHtmlProcessor-serialize.php

# Conflicts: # tests/phpunit/tests/html-api/wpHtmlProcessor-serialize.php

# Conflicts: # tests/phpunit/tests/dependencies/scripts.php

Add syntax-heavy terminal payloads and a text-fragment profile for short standalone inputs, with smoke coverage for final generated contexts. See #65372.

sirreal and others added 30 commits July 22, 2025 08:52

Apply patch from <select> parser PR

aaee111

html5lib/html5lib-tests#178

Update reset_insertion_mode_appropriately

9ab745d

Remove SELECT case and update comments numbers. The SELECT case was removed from the algorithm in the standard.

Deprecate the has_element_in_select_scope method

1acfce7

This was removed from the HTML standard

Update "normal" end tag handling to include SELECT

aac853d

Update INPUT handling to account for SELECT

89c113d

Update HR tag handling

f75a08f

Update SELECT insertion handling

099f013

Update OPTION, OPTGROUP insertion handling

7fe3598

Remove in_select and in_select_in_table step methods

20e0807

These insertion modes are removed from the standard.

Deprecate unused insertion mode constants

45edeae

Adjust html5lib-test expectation

446cdaa

See https://github.com/html5lib/html5lib-tests/pull/178/files?show-viewed-files=true&file-filters%5B%5D=#r2222195057

Add close option tag comments, do not handle cloning

0ed0a68

Update has_element_in_scope with SELECT

8beb128

Bail when parsing SELECTEDCONTENT requiring clone

6be6287

When SELECT > BUTTON > SELECTEDCONTENT is encountered, the selected option may need to be cloned into the SELECTEDCONTENT. The HTML processor does not support this action as it may require out of order processing.

Update html5lib-tests

9e79bb2

html5lib/html5lib-tests@5ad026d

Initial plan

661399e

Add script_data_{$handle} filter for classic scripts

2c0960c

Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>

Fix test docblocks - remove TBD ticket annotations

f4afe13

Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>

Implement script_data_{$handle} filter correctly - outputs JSON scrip…

d3cd233

…t tags separate from wp_localize_script Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>

Optimize print_script_data method - extract charset check outside loo…

4e490ee

…p and simplify iteration Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>

Remove redundant is_array check and improve test naming

14e4ff9

Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>

Add comment explaining wp_print_inline_script_tag return parameter

3edd50e

Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>

Address code review feedback

0fbee29

- Rename $data_tag to $script_data_tag for clarity - Remove unnecessary (string) coercion from wp_json_encode - Update @SInCE tag to 7.0.0 - Remove empty line before closing brace Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>

Add test for xlink serialization

592c169

Serialize adjusted foreign attributes with prefixes

8df62fd

HTML API: Expand foreign attribute serialization coverage

0ac7e6d

sirreal added 30 commits June 11, 2026 18:29

HTML API: Add test for case-insensitive class update flushing.

5292c7d

Red TDD step from adversarial review: get_attribute( 'CLASS' ) returned a stale value when class updates were pending, because the flush guard compared the attribute name case-sensitively. See #65372.

HTML API: Flush class updates for any case spelling of "class".

8c26adf

Attribute lookups are ASCII-case-insensitive, but the pending-class flush in get_attribute() compared the requested name case-sensitively, returning a stale value for spellings like "CLASS". See #65372.

Merge PR #53: HTML API input preprocessing

5a3cbf1

# Conflicts: # src/wp-includes/html-api/class-wp-html-tag-processor.php

Merge updated PR #42: decoded carriage return serialization

ce2af0e

# Conflicts: # src/wp-includes/html-api/class-wp-html-processor.php # tests/phpunit/tests/html-api/wpHtmlProcessor-serialize.php

Merge PR #51: preserve rawtext contents

ce08149

# Conflicts: # tests/phpunit/tests/html-api/wpHtmlProcessor-serialize.php

Merge PR #17: classic script data filter

adbe354

# Conflicts: # tests/phpunit/tests/dependencies/scripts.php

Document merged PR rationale

5c3b731

Fix script data test tag assertion

57d458d

Update merge note for script data test fix

088b775

Use Lexbor master for source oracle

63b2c4a

Improve HTML API fuzz terminal payload coverage

05cf434

Add syntax-heavy terminal payloads and a text-fragment profile for short standalone inputs, with smoke coverage for final generated contexts. See #65372.

Document br mutation oracle follow-up

1a14d7c

Expand known element normalization coverage

b269d6b

Add explicit comment form fuzz coverage

64f160c

Preserve Lexbor SVG adjusted names

e504903

Add in-process probe mode to fuzzer minimizer

b80297d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Html api fuzzer#40

Html api fuzzer#40
sirreal wants to merge 144 commits into
trunkfrom
html-api-fuzz

sirreal commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sirreal commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Use of AI Tools

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sirreal commented Jun 9, 2026 •

edited

Loading