Skip to content

Html api fuzzer#40

Draft
sirreal wants to merge 144 commits into
trunkfrom
html-api-fuzz
Draft

Html api fuzzer#40
sirreal wants to merge 144 commits into
trunkfrom
html-api-fuzz

Conversation

@sirreal

@sirreal sirreal commented Jun 9, 2026

Copy link
Copy Markdown
Owner

Trac ticket:

Use of AI Tools

Yeah 🙂


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

sirreal and others added 30 commits July 22, 2025 08:52
Remove SELECT case and update comments numbers.

The SELECT case was removed from the algorithm in the standard.
This was removed from the HTML standard
These insertion modes are removed from the standard.
When SELECT > BUTTON > SELECTEDCONTENT is encountered, the selected
option may need to be cloned into the SELECTEDCONTENT. The HTML
processor does not support this action as it may require out of order
processing.
Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>
Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>
…t tags separate from wp_localize_script

Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>
…p and simplify iteration

Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>
Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>
- Removed separate print_script_data() method and wp_print_script_data() wrapper
- Removed action hooks from wp_footer and admin_print_footer_scripts
- Filter now runs during script processing in do_item()
- Data script tag is output immediately before each script tag
- Updated all tests to use wp_print_scripts instead of wp_print_script_data
- Added test to verify data tag appears before script tag

Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>
Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>
- Rename $data_tag to $script_data_tag for clarity
- Remove unnecessary (string) coercion from wp_json_encode
- Update @SInCE tag to 7.0.0
- Remove empty line before closing brace

Co-authored-by: sirreal <841763+sirreal@users.noreply.github.com>
- Add JSON_INVALID_UTF8_SUBSTITUTE to json_encode flags
- This prevents failures on invalid UTF-8 by substituting with U+FFFD
- Provides defense-in-depth beyond wp_json_encode's sanity checking
- Document the flag's purpose in the inline comment

Co-authored-by: dmsnell <5431237+dmsnell@users.noreply.github.com>
This reverts commit d89d333. The flag was added based on incorrect analysis.
wp_json_encode() already handles invalid UTF-8 through its fallback
mechanism (_wp_json_sanity_check), converting invalid bytes to "?".

The implementation now matches script modules exactly, using only:
- JSON_HEX_TAG
- JSON_UNESCAPED_SLASHES
- JSON_UNESCAPED_UNICODE (UTF-8 pages only)
- JSON_UNESCAPED_LINE_TERMINATORS (UTF-8 pages only)

Co-authored-by: dmsnell <5431237+dmsnell@users.noreply.github.com>
Without this flag, json_encode() returns false on invalid UTF-8, triggering
wp_json_encode()'s expensive fallback mechanism (_wp_json_sanity_check)
which recursively walks all data and re-encodes.

With JSON_INVALID_UTF8_SUBSTITUTE:
- Invalid UTF-8 bytes are substituted with U+FFFD (�) in a single pass
- No fallback overhead
- Standard, consistent behavior

This is the proper solution rather than relying on the fallback.

Co-authored-by: dmsnell <5431237+dmsnell@users.noreply.github.com>
sirreal added 30 commits June 11, 2026 18:29
Red TDD step: browser-verified expectation that classList-equivalent
reads preserve NULL bytes in values set through the API; the U+0000
replacement belongs to the tokenizer, and document-sourced values
already receive it in get_attribute().

See #65372.
class_list() received its NULL-byte replacement when reading raw class
values; that replacement now happens in get_attribute() for values
from the input document. Performing it on API-supplied values diverged
from browsers, where classList preserves NULL bytes in values set via
setAttribute().

See #65372.
Benchmark-guided: reading an attribute value applies up to three
str_replace passes which doubled read cost for long values containing
no bytes needing replacement. Guarding with strpos keeps the common
case at two fast scans; values are typically free of CR and NULL.

Benchmark (PHP 8.4, medians of 3): scanning 100-tag documents reading
3 attributes each, 2000 iterations: trunk 667ms, unguarded 714ms,
guarded 699ms. Reading a 10.8KB clean attribute value 200k times:
trunk 147ms, unguarded 313ms, guarded 258ms. The remaining cost is
the unavoidable byte inspection.

See #65372.
Red TDD step from adversarial review: a named character reference
without a terminating semicolon must decode when followed by a NULL
byte or any non-ASCII byte. Replacing NULL with U+FFFD before decoding
fed the decoder a multi-byte follower whose classification by
ctype_alnum() depends on the process locale, suppressing valid decodes
in attribute values, diverging from browsers and from trunk.

See #65372.
The tokenizer replaces U+0000 NULL bytes as it consumes input, so a
character reference without a terminating semicolon sees the raw NULL
byte as its follower, which is unambiguous, and the reference decodes.
Replacing before decoding handed the decoder U+FFFD's lead byte, whose
ctype_alnum() classification depends on the process locale, wrongly
suppressing the decode under UTF-8 locales. No character reference
decodes into NULL, so replacing after decoding is equivalent for the
value's own bytes and faithful to the tokenizer's order.

See #65372.
Per the named-character-reference state, a semicolon-less reference is
ambiguous only when followed by an ASCII alphanumeric or equals sign.
ctype_alnum() classifies bytes 0x80 and above as alphanumeric under
UTF-8 locales, wrongly suppressing decodes followed by any non-ASCII
byte and making decoding depend on the process locale.

See #65372.
Red TDD step from adversarial review: next_tag() must match tag names
in the same U+FFFD-replaced alphabet that get_tag() exposes, so the
getter round-trips into queries, raw NULL spellings match nothing, and
the Tag Processor agrees with the HTML Processor, whose queries
already compare against the replaced token name.

See #65372.
next_tag() compared sought tag names against raw document bytes while
get_tag() returns names with NULL bytes replaced by U+FFFD, breaking
the getter-to-query round trip and disagreeing with the HTML
Processor's queries. Matching now happens in the exposed alphabet; the
existing byte comparison is unchanged for names without NULL bytes, so
the hot path costs the same.

See #65372.
Red TDD step from adversarial review: get_attribute( 'CLASS' )
returned a stale value when class updates were pending, because the
flush guard compared the attribute name case-sensitively.

See #65372.
Attribute lookups are ASCII-case-insensitive, but the pending-class
flush in get_attribute() compared the requested name case-sensitively,
returning a stale value for spellings like "CLASS".

See #65372.
From adversarial review: pins for class helpers over replaced source
values, boolean attributes with NULL-byte names, verbatim prefix
matching in get_attribute_names_with_prefix(), and HTML Processor
end-tag matching across NULL and U+FFFD spellings (browser-verified:
both spellings tokenize to the same name). Documents the @SInCE 7.1.0
behavior on indirectly-affected getters and the known asymmetry of
set_modifiable_text(), whose value reads back normalized unlike
attribute values, which round-trip verbatim.

See #65372.
Red TDD step: decoded carriage returns in text and attribute values
must serialize as &#13; so that normalized output is idempotent: a
raw CR in serialized output would be normalized to a line feed when
parsed again. The raw-CR attribute and class-update cases pass already
through the preprocessing-correct getters and pin that behavior.

See #65372.
The serializer emitted decoded carriage returns raw into text and
attribute values, where input preprocessing turns them into line feeds
on the next parse: normalized output never reached a fixed point for
documents containing &#13;. Escaping CR after htmlspecialchars() keeps
the character through parse/serialize round trips. Attribute values
read through get_attribute(), whose input preprocessing guarantees raw
source carriage returns already arrive normalized to line feeds, so
only genuinely decoded CRs are escaped.

See #65372.
An attribute value set through set_attribute() may contain NULL bytes;
serializing them as U+FFFD keeps normalized output idempotent, where
browsers' innerHTML emits the raw byte and loses it to replacement on
the next parse. This pins the behavior ahead of consolidating the
serializer's NULL handling.

See #65372.
The getters now expose tag and attribute names with NULL bytes already
replaced by U+FFFD, leaving the serializer's name scrubbing dead, and
the only live input to the per-attribute whole-buffer scrub was an
API-supplied attribute value. That replacement moves into
serialize_decoded_text() next to the carriage-return escaping, which
exists for the same reason: emitting bytes the next parse would
transform. UTF-8 scrubbing of qualified names remains, as invalid
sequences can still reach serialization through source names.

See #65372.
From adversarial review: pins that SCRIPT and STYLE contents serialize
without escaping, where character references do not decode, and that
serialize_token() output for modified class and NULL-containing
attribute values parses back to the same decoded values.

See #65372.
# Conflicts:
#	src/wp-includes/html-api/class-wp-html-tag-processor.php
# Conflicts:
#	src/wp-includes/html-api/class-wp-html-processor.php
#	tests/phpunit/tests/html-api/wpHtmlProcessor-serialize.php
# Conflicts:
#	tests/phpunit/tests/html-api/wpHtmlProcessor-serialize.php
# Conflicts:
#	tests/phpunit/tests/dependencies/scripts.php
Add syntax-heavy terminal payloads and a text-fragment profile for short standalone inputs, with smoke coverage for final generated contexts.

See #65372.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants