Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions handoffs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Fuzzing / testing work lanes

Self-contained handoff documents, one per independent lane of work. Each
can be picked up by a separate agent or contributor with no shared
context beyond the document itself.

| Lane | Doc | Shape of work |
|------|-----|---------------|
| Extend the UTF-8 encoding fuzzer | [extend-encoding-fuzzer.md](extend-encoding-fuzzer.md) | Add targets to an existing, working fuzzer |
| WP_HTML_Decoder fuzzer | [html-decoder-fuzzer.md](html-decoder-fuzzer.md) | New independent fuzzer, Dom\HTMLDocument oracle |
| WP_Token_Map property tests | [token-map-properties.md](token-map-properties.md) | PHPUnit property tests against a naive reference |
| Legacy UTF-8 helper divergence survey | [legacy-utf8-divergence-survey.md](legacy-utf8-divergence-survey.md) | One-shot documented survey, no continuous fuzzing |

Background: `tools/encoding-fuzz/` (this branch, commit `3cc3e64765`)
is a working differential fuzzer for `wp_is_valid_utf8()` /
`wp_scrub_utf8()` and their pure-PHP fallbacks. ~570k cases have run
clean against five independent oracles. Its architecture (deterministic
`(seed, case)` generation, oracle battery, worker/runner/replay/minimize,
mutation-tested harness) is the reference pattern for the other lanes.
175 changes: 175 additions & 0 deletions handoffs/extend-encoding-fuzzer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# Handoff: extend the UTF-8 encoding fuzzer with three new targets

## Status

All three sections DONE. The host fuzzer (`tools/encoding-fuzz/`) is
complete and working on branch `fuzz-encoder`; read its `README.md`
first. ~570k cases had run clean against the original targets before
this work started.

## Goal

Round out coverage of `src/wp-includes/compat-utf8.php` (plus one
html-api encoder) by adding:

1. `_wp_utf8_encode_fallback()` / `_wp_utf8_decode_fallback()`
differentials against the native `utf8_encode()` / `utf8_decode()`.
2. `wp_has_noncharacters()` / `_wp_has_noncharacters_fallback()`
differential — **after resolving the semantic question below**.
3. A one-shot exhaustive test of
`WP_HTML_Decoder::code_point_to_utf8_bytes()` (not fuzzing).

## 1. utf8_encode / utf8_decode fallbacks — DONE, premise corrected

**Implemented**, but a premise of this section was falsified during
implementation and the oracle design adapted (2026-06-10, PHP 8.4.21):

- The original claim "No known divergence going in" was wrong: the
earlier spot-probes missed it. Native `utf8_decode()` groups a
well-formed lead byte with its expected continuation length and emits
a single `?` for surrogates (`ED A0 80` → `?`), beyond-U+10FFFF
sequences (`F4 90 80 80` → `?`), 3-/4-byte overlongs, and a
well-formed lead before an invalid continuation (`C2 C0` → `?`),
where the fallback emits one `?` per maximal subpart (`???` etc.).
- That divergence is **intentional** in WordPress: the PHP 9 polyfill
in `compat.php` prefers `mb_convert_encoding()` (which uses maximal
subparts) over the fallback, and the #63863 PHPUnit tests assert
mb-equivalence. So "the native functions are the only ground truth"
was also wrong — WP's chosen ground truth is `mb_convert_encoding()`.
- Oracle design as built: `mb` (`mb_convert_encoding()`) is the primary
encode/decode oracle on arbitrary input; `native` is an encode oracle
on arbitrary input and a decode oracle on **valid input only**
(native ≡ mb on every valid code point, verified exhaustively). On
PHP 9+ `native` reports `oracle-unavailable` and is skipped. The
legacy divergence is pinned by hand-computed battery vectors.
- Round-trip `decode(encode(s)) === s`, encode-output-validity, the
smoke-test mutation variants (cp1252-confused encoder, identity
encoder, per-byte decoder, valid-input mangler, round-trip violator,
null-returning targets), and the `ENCODING_FUZZ_FAULT=encode-cp1252`
/ `decode-per-byte` end-to-end fault variants are all in place.

**Upstream finding, not fixed here:** the cited core test
`tests/phpunit/tests/formatting/deprecatedUtfEncodeDecode.php` has
vacuous invalid-input coverage — its surrogate branch interpolates
integers instead of `chr()` bytes (`"{$byte1}{$byte2}{$byte3}"`
produces ASCII digits), its single-quoted `'\x95'` data is literal
backslash text, and the `$i < 0xD800 || $i > 0xE000` boundary routes
valid U+E000 through the broken branch. It only ever asserts
mb-equivalence on valid input. Worth a follow-up patch on #63863.

## 2. wp_has_noncharacters — DONE via option (a); core decision still open

**Known divergence, confirmed empirically (2026-06-10):**

```php
$probe = "\xC0\xEF\xBF\xBE"; // invalid byte, then U+FFFE
wp_has_noncharacters( $probe ); // false — PCRE path: preg_match fails on ill-formed UTF-8
_wp_has_noncharacters_fallback( $probe ); // true — scan skips invalid spans, finds U+FFFE
```

**Implemented as option (a):** the fuzzer treats behavior as undefined
unless `wp_is_valid_utf8()` and runs the three-way differential —
`wp_has_noncharacters()` (PCRE branch) vs
`_wp_has_noncharacters_fallback()` vs a trivial `mb_str_split()` /
`mb_ord()` reference (battery-verified at block boundaries, block
interior, and the final two code points of every plane with their
neighbors — the PCRE class enumerates each plane by hand, so per-plane
vectors are the point) — on **valid inputs only**. The probe above is
pinned as a fixed regression vector in the smoke test, so any semantic
change to either branch surfaces immediately. `BOUNDARY_CODE_POINTS`
in `lib/Generator.php` gained adjacent NON-noncharacters, a block
interior point, and mid-plane finals. Mutation variants: blind
detector, U+FDD0-block miss, over-eager detector; fault injection:
`ENCODING_FUZZ_FAULT=nonchars-miss-fdd0|nonchars-overeager` (one per
target).

**Still open upstream (option b path):** whether core should align the
implementations or document the undefined-on-invalid stance in the
`wp_has_noncharacters()` docblock. That needs a decision from the
function author (Trac discussion). Note for whoever picks that up: if
core aligns on PCRE semantics (false on any ill-formed input), the mb
reference oracle and its battery must be extended for ill-formed input
too — removing the valid-only gate alone is NOT sufficient, since the
reference throws on ill-formed input by design.

## 3. code_point_to_utf8_bytes — DONE; upstream finding documented

Implemented as `tools/encoding-fuzz/tests/code-point-to-utf8-exhaustive.php`
(standalone, not wired into `Bootstrap.php` — the class is required
only by this script, which parses cleanly with no other dependencies;
loading html-api code into every fuzz worker would buy nothing).
Every code point 0x0–0x10FFFF plus out-of-range probes, compared
against the pure-arithmetic `Generator::encode_code_point()` (the
independent oracle) with an additional `mb_chr( $cp, 'UTF-8' )`
consistency cross-check (the implementation is itself mb_chr-backed;
the cross-check would expose a bug shared between implementation and
arithmetic encoder). Surrogates and out-of-range values yield U+FFFD
as documented. Runs in ~0.4s, passes on PHP 8.4.21. The harness smoke
test executes it and proves its detection fires via the
`ENCODING_FUZZ_FAULT=codepoint-surrogate-qmark` broken variant.

**Upstream finding (real bug — an unreleased trunk REGRESSION, not
fixed here):** the implementation is `mb_chr( $code_point )` with NO
explicit encoding, so it inherits `mb_internal_encoding()` — which
WordPress sets from `blog_charset` (`wp_set_internal_encoding()`,
`src/wp-includes/load.php`). On a non-UTF-8 site it returns raw legacy
bytes for mappable code points (e.g. `"\xE9"` for U+00E9 under
ISO-8859-1) while still returning UTF-8 U+FFFD for invalid ones,
contradicting its docblock. Aggravating facts for the upstream report:

- Introduced by [62424] (#65342, `@since 7.1.0`, unreleased): the
6.6.0 original was a pure-arithmetic encoder that always emitted
UTF-8 regardless of mbstring state. Fix-before-release territory.
- WP's own `_mb_chr()` polyfill in `compat.php` documents
`@param "UTF-8"|null $encoding Must be 'UTF-8' or null` and treats
null as UTF-8 — so mbstring-less hosts always emit UTF-8 while
mbstring hosts follow `blog_charset`. Same WordPress, divergent
output by extension presence.
- Named character references decode through the UTF-8 token map
regardless: on a latin1 site `&eacute;` → UTF-8 `C3 A9` but
`&#233;` → latin1 `E9` in the same decoded string. There is no
intentional-behavior steelman; output is mixed-encoding either way.
- The same commit silently changed `code_point_to_utf8_bytes( 0 )`
from `U+FFFD` to `"\0"` (the old guard was `$code_point <= 0`).
Callers are unaffected (`&#0;` is intercepted earlier) and the new
behavior matches the docblock, but it belongs in the same report.

One-line fix: `mb_chr( $code_point, 'UTF-8' )`. The script pins the
current buggy behavior as a labeled KNOWN ISSUE check so the stance
cannot silently go stale; update or remove the pin when fixed.

## Verification / definition of done

All verified 2026-06-10 on PHP 8.4.21:

- `php tools/encoding-fuzz/tests/harness-smoke.php` passes, including
broken-variant detections for every added check (seventeen mutation
classes plus the exhaustive script's surrogate fault).
- Fault-injection variants per new target
(`ENCODING_FUZZ_FAULT=encode-cp1252|decode-per-byte|nonchars-miss-fdd0|nonchars-overeager`)
exercised worker → replay → minimize end to end; artifacts now record
the fault name and `pcre_u` in environment metadata. The script-local
`codepoint-surrogate-qmark` fault is proven via the smoke test's
subprocess run (the exhaustive script never enters the worker
pipeline).
- `php tools/encoding-fuzz/runner.php --lanes 4 --duration-seconds 60`
ran clean (32,000 cases, 0 failures, 0 stalled, final tree). Findings
that were
triaged and documented rather than silenced: the legacy
`utf8_decode()` divergence (§1), the `wp_has_noncharacters()`
ill-formed-input divergence (§2), the `code_point_to_utf8_bytes()`
internal-encoding regression and the #63863 test bug (§§1, 3).
- README.md oracle/check tables updated (Encode/Decode/Nonchars).

## Gotchas inherited from the existing harness

- All scrub/validity oracles passed a hand-computed battery; new
oracles must too (`Oracles::battery()` pattern). iconv is excluded
for accepting code points above U+10FFFF — don't re-add it.
- Workers run checks in-process; an infinite loop in a new target will
trip the runner's 120s stall watchdog and record the seed. Keep that
property: no per-case subprocesses.
- Everything must stay derivable from `(seed, case index)` — no
`random_int()`, no time-dependent generation. Per-case chunking-type
randomness derives from `sha256(input)` (see
`Checks::check_chunked_scan()`).
130 changes: 130 additions & 0 deletions handoffs/html-decoder-fuzzer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Handoff: independent fuzzer for WP_HTML_Decoder

## Status

Not started. This is a NEW fuzzer, separate from `tools/encoding-fuzz/`
(UTF-8 functions) and from the `html-api-fuzz` branch (whole-tree
parser comparison). Reuse the architecture of `tools/encoding-fuzz/` —
deterministic `(seed, case)` generation, oracle startup battery,
worker/runner/replay/minimize CLIs, mutation-tested harness smoke test —
but as its own tool directory (suggested: `tools/html-decoder-fuzz/`).

## Target

`WP_HTML_Decoder` in `src/wp-includes/html-api/class-wp-html-decoder.php`:

- `decode_text_node( $text )`
- `decode_attribute( $text )`
- `read_character_reference( $context, $text, $at, &$match_byte_length )`
- `attribute_starts_with( $haystack, $search, $case_sensitivity )`

This is security-relevant code: decoded attribute values feed
`javascript:` URL detection via `attribute_starts_with`. Existing unit
tests are thin (`tests/phpunit/tests/html-api/wpHtmlDecoder.php`, 4 test
methods) — fuzzing has real headroom here.

Dependency note: the named-reference path uses `WP_Token_Map` and the
`$html5_named_character_reference` map
(`src/wp-includes/html-api/html5-named-character-references.php`).
A decoder fuzzer transitively exercises both.

## Oracle

`Dom\HTMLDocument` (lexbor, PHP 8.4+) — the same oracle the
`html-api-fuzz` branch uses for tree comparison:

- Text context: parse `<!DOCTYPE html><body><div>PAYLOAD</div>`, read
the div's `textContent`; compare with `decode_text_node( PAYLOAD )`.
- Attribute context: parse `<div title="PAYLOAD">`, read
`getAttribute('title')`; compare with `decode_attribute( PAYLOAD )`.

Do NOT use `html_entity_decode( ENT_HTML5 )` as the primary oracle: it
does not implement the WHATWG attribute-context rules (named reference
without semicolon followed by `=` or alphanumeric must NOT decode in
attributes) and will drown the run in false divergences. It MAY serve
as a third opinion on the text context only, gated by a known-answer
battery like `Oracles::battery()` in the encoding fuzzer — verify
empirically before trusting it, including C1-control numeric reference
remapping (`&#x80;` → U+20AC etc.).

## Confounders the harness must neutralize

The oracle is a full HTML parser; the target is a pure decoder. The
generator must avoid payload bytes the parser treats specially, or the
comparison measures parser behavior instead of decoding:

- `<`, `>`, `&` followed by structure-breaking content — escape `<` as
text? No: restrict generated payloads to never contain raw `<`; `&`
is the whole point and is fine in both contexts.
- Quote characters in the attribute payload — generate with `"`
excluded (or swap quote style per case), since it terminates the
attribute in the oracle document but not in `decode_attribute()`.
- CR / CRLF: the HTML parser normalizes `\r` and `\r\n` to `\n` before
tokenization; the decoder does not. Either exclude `\r` from payloads
or pre-normalize before comparison — decide once, document it.
- NUL bytes: parser replaces U+0000 with U+FFFD in some contexts /
drops in others; the decoder has its own documented NUL handling
(see existing test `test_character_reference_with_null_byte...`).
Probably exclude raw NUL from oracle-compared cases and cover NUL
via fixed regression vectors instead.
- Invalid UTF-8 payload bytes: lexbor may scrub them before the
tokenizer sees them. Start with valid-UTF-8 payloads only; invalid
bytes inside character references (`&am\xC0p;`) are a later, careful
extension.

## Generator: entity grammar, not byte noise

Weighted mix targeting the reference-matching state machine:

- Named references from the real token map: exact (`&amp;`), without
semicolon (`&amp`), longest-match ambiguity (`&not` vs `&notin;` —
the map is greedy-longest), case variants (`&AMP` vs `&amp`),
truncations (`&am`), nonexistent lookalikes (`&ampx;`).
- The attribute-context discriminator: no-semicolon named reference
followed by `=`, by alphanumerics, by `;` later in the string —
decode in text, not in attribute.
- Numeric: decimal and hex, mixed case `x`/`X`, leading zeros (many),
value classes: ASCII, C1 controls 0x80–0x9F (windows-1252 remap
table), surrogates, noncharacters, > 0x10FFFF, huge (overflow
arithmetic), zero, missing digits (`&#;`, `&#x;`).
- Adjacency and boundaries: references back to back, reference at
string start/end, `&` at end of input, references split by the
string boundary at every prefix length (truncation sweep).
- Plain text with multibyte UTF-8 around references (offset arithmetic).

Each case is `(context, payload)`; derive both from the PRNG.

## Checks

1. Differential vs oracle in both contexts (primary).
2. `read_character_reference()` consistency: decoding the whole string
by repeated `read_character_reference` + literal spans must equal
`decode()` output, and `$match_byte_length` must always advance.
3. `attribute_starts_with( $haystack, $search )` agrees with
`str_starts_with( decode_attribute( $haystack ), $search )` for
ASCII search strings, both case sensitivities.
4. Output is valid UTF-8 (reuse `mb_check_encoding`).
5. Idempotence does NOT hold for decoding (`&amp;amp;` decodes to
`&amp;`) — do not add it; add instead: decoding text with no `&`
is identity.

## Harness requirements (carry over from encoding fuzzer)

- Known-answer startup battery for the oracle path (hand-computed
WHATWG expectations, including the C1 remap and no-semicolon
attribute rules) — if the local `Dom\HTMLDocument` fails it, abort
loudly.
- Mutation-tested smoke test: broken decoder variants (skip C1 remap,
decode no-semicolon refs in attributes, off-by-one match length)
must be caught before the fuzzer is trusted.
- Failure artifacts self-contained (base64 input + context + expected/
got), replay + signature-preserving minimizer.
- Note `html-api-fuzz` branch precedent: its `attributes-entities`
generator profile and oracle handling are prior art worth reading
(`tools/html-api-fuzz/lib/Generator.php` on that branch).

## Definition of done

Smoke test green (including broken-variant detection), a 5-minute
multi-lane run either clean or with triaged findings, README with the
oracle-confounder decisions documented.
Loading
Loading