sirreal · sirreal · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026 · Jun 10, 2026
diff --git a/handoffs/README.md b/handoffs/README.md
@@ -0,0 +1,19 @@
+# Fuzzing / testing work lanes
+
+Self-contained handoff documents, one per independent lane of work. Each
+can be picked up by a separate agent or contributor with no shared
+context beyond the document itself.
+
+| Lane | Doc | Shape of work |
+|------|-----|---------------|
+| Extend the UTF-8 encoding fuzzer | [extend-encoding-fuzzer.md](extend-encoding-fuzzer.md) | Add targets to an existing, working fuzzer |
+| WP_HTML_Decoder fuzzer | [html-decoder-fuzzer.md](html-decoder-fuzzer.md) | New independent fuzzer, Dom\HTMLDocument oracle |
+| WP_Token_Map property tests | [token-map-properties.md](token-map-properties.md) | PHPUnit property tests against a naive reference |
+| Legacy UTF-8 helper divergence survey | [legacy-utf8-divergence-survey.md](legacy-utf8-divergence-survey.md) | One-shot documented survey, no continuous fuzzing |
+
+Background: `tools/encoding-fuzz/` (this branch, commit `3cc3e64765`)
+is a working differential fuzzer for `wp_is_valid_utf8()` /
+`wp_scrub_utf8()` and their pure-PHP fallbacks. ~570k cases have run
+clean against five independent oracles. Its architecture (deterministic
+`(seed, case)` generation, oracle battery, worker/runner/replay/minimize,
+mutation-tested harness) is the reference pattern for the other lanes.
diff --git a/handoffs/extend-encoding-fuzzer.md b/handoffs/extend-encoding-fuzzer.md
@@ -0,0 +1,175 @@
+# Handoff: extend the UTF-8 encoding fuzzer with three new targets
+
+## Status
+
+All three sections DONE. The host fuzzer (`tools/encoding-fuzz/`) is
+complete and working on branch `fuzz-encoder`; read its `README.md`
+first. ~570k cases had run clean against the original targets before
+this work started.
+
+## Goal
+
+Round out coverage of `src/wp-includes/compat-utf8.php` (plus one
+html-api encoder) by adding:
+
+1. `_wp_utf8_encode_fallback()` / `_wp_utf8_decode_fallback()`
+   differentials against the native `utf8_encode()` / `utf8_decode()`.
+2. `wp_has_noncharacters()` / `_wp_has_noncharacters_fallback()`
+   differential — **after resolving the semantic question below**.
+3. A one-shot exhaustive test of
+   `WP_HTML_Decoder::code_point_to_utf8_bytes()` (not fuzzing).
+
+## 1. utf8_encode / utf8_decode fallbacks — DONE, premise corrected
+
+**Implemented**, but a premise of this section was falsified during
+implementation and the oracle design adapted (2026-06-10, PHP 8.4.21):
+
+- The original claim "No known divergence going in" was wrong: the
+  earlier spot-probes missed it. Native `utf8_decode()` groups a
+  well-formed lead byte with its expected continuation length and emits
+  a single `?` for surrogates (`ED A0 80` → `?`), beyond-U+10FFFF
+  sequences (`F4 90 80 80` → `?`), 3-/4-byte overlongs, and a
+  well-formed lead before an invalid continuation (`C2 C0` → `?`),
+  where the fallback emits one `?` per maximal subpart (`???` etc.).
+- That divergence is **intentional** in WordPress: the PHP 9 polyfill
+  in `compat.php` prefers `mb_convert_encoding()` (which uses maximal
+  subparts) over the fallback, and the #63863 PHPUnit tests assert
+  mb-equivalence. So "the native functions are the only ground truth"
+  was also wrong — WP's chosen ground truth is `mb_convert_encoding()`.
+- Oracle design as built: `mb` (`mb_convert_encoding()`) is the primary
+  encode/decode oracle on arbitrary input; `native` is an encode oracle
+  on arbitrary input and a decode oracle on **valid input only**
+  (native ≡ mb on every valid code point, verified exhaustively). On
+  PHP 9+ `native` reports `oracle-unavailable` and is skipped. The
+  legacy divergence is pinned by hand-computed battery vectors.
+- Round-trip `decode(encode(s)) === s`, encode-output-validity, the
+  smoke-test mutation variants (cp1252-confused encoder, identity
+  encoder, per-byte decoder, valid-input mangler, round-trip violator,
+  null-returning targets), and the `ENCODING_FUZZ_FAULT=encode-cp1252`
+  / `decode-per-byte` end-to-end fault variants are all in place.
+
+**Upstream finding, not fixed here:** the cited core test
+`tests/phpunit/tests/formatting/deprecatedUtfEncodeDecode.php` has
+vacuous invalid-input coverage — its surrogate branch interpolates
+integers instead of `chr()` bytes (`"{$byte1}{$byte2}{$byte3}"`
+produces ASCII digits), its single-quoted `'\x95'` data is literal
+backslash text, and the `$i < 0xD800 || $i > 0xE000` boundary routes
+valid U+E000 through the broken branch. It only ever asserts
+mb-equivalence on valid input. Worth a follow-up patch on #63863.
+
+## 2. wp_has_noncharacters — DONE via option (a); core decision still open
+
+**Known divergence, confirmed empirically (2026-06-10):**
+
+```php
+$probe = "\xC0\xEF\xBF\xBE"; // invalid byte, then U+FFFE
+wp_has_noncharacters( $probe );             // false — PCRE path: preg_match fails on ill-formed UTF-8
+_wp_has_noncharacters_fallback( $probe );   // true  — scan skips invalid spans, finds U+FFFE
+```
+
+**Implemented as option (a):** the fuzzer treats behavior as undefined
+unless `wp_is_valid_utf8()` and runs the three-way differential —
+`wp_has_noncharacters()` (PCRE branch) vs
+`_wp_has_noncharacters_fallback()` vs a trivial `mb_str_split()` /
+`mb_ord()` reference (battery-verified at block boundaries, block
+interior, and the final two code points of every plane with their
+neighbors — the PCRE class enumerates each plane by hand, so per-plane
+vectors are the point) — on **valid inputs only**. The probe above is
+pinned as a fixed regression vector in the smoke test, so any semantic
+change to either branch surfaces immediately. `BOUNDARY_CODE_POINTS`
+in `lib/Generator.php` gained adjacent NON-noncharacters, a block
+interior point, and mid-plane finals. Mutation variants: blind
+detector, U+FDD0-block miss, over-eager detector; fault injection:
+`ENCODING_FUZZ_FAULT=nonchars-miss-fdd0|nonchars-overeager` (one per
+target).
+
+**Still open upstream (option b path):** whether core should align the
+implementations or document the undefined-on-invalid stance in the
+`wp_has_noncharacters()` docblock. That needs a decision from the
+function author (Trac discussion). Note for whoever picks that up: if
+core aligns on PCRE semantics (false on any ill-formed input), the mb
+reference oracle and its battery must be extended for ill-formed input
+too — removing the valid-only gate alone is NOT sufficient, since the
+reference throws on ill-formed input by design.
+
+## 3. code_point_to_utf8_bytes — DONE; upstream finding documented
+
+Implemented as `tools/encoding-fuzz/tests/code-point-to-utf8-exhaustive.php`
+(standalone, not wired into `Bootstrap.php` — the class is required
+only by this script, which parses cleanly with no other dependencies;
+loading html-api code into every fuzz worker would buy nothing).
+Every code point 0x0–0x10FFFF plus out-of-range probes, compared
+against the pure-arithmetic `Generator::encode_code_point()` (the
+independent oracle) with an additional `mb_chr( $cp, 'UTF-8' )`
+consistency cross-check (the implementation is itself mb_chr-backed;
+the cross-check would expose a bug shared between implementation and
+arithmetic encoder). Surrogates and out-of-range values yield U+FFFD
+as documented. Runs in ~0.4s, passes on PHP 8.4.21. The harness smoke
+test executes it and proves its detection fires via the
+`ENCODING_FUZZ_FAULT=codepoint-surrogate-qmark` broken variant.
+
+**Upstream finding (real bug — an unreleased trunk REGRESSION, not
+fixed here):** the implementation is `mb_chr( $code_point )` with NO
+explicit encoding, so it inherits `mb_internal_encoding()` — which
+WordPress sets from `blog_charset` (`wp_set_internal_encoding()`,
+`src/wp-includes/load.php`). On a non-UTF-8 site it returns raw legacy
+bytes for mappable code points (e.g. `"\xE9"` for U+00E9 under
+ISO-8859-1) while still returning UTF-8 U+FFFD for invalid ones,
+contradicting its docblock. Aggravating facts for the upstream report:
+
+- Introduced by [62424] (#65342, `@since 7.1.0`, unreleased): the
+  6.6.0 original was a pure-arithmetic encoder that always emitted
+  UTF-8 regardless of mbstring state. Fix-before-release territory.
+- WP's own `_mb_chr()` polyfill in `compat.php` documents
+  `@param "UTF-8"|null $encoding Must be 'UTF-8' or null` and treats
+  null as UTF-8 — so mbstring-less hosts always emit UTF-8 while
+  mbstring hosts follow `blog_charset`. Same WordPress, divergent
+  output by extension presence.
+- Named character references decode through the UTF-8 token map
+  regardless: on a latin1 site `&eacute;` → UTF-8 `C3 A9` but
+  `&#233;` → latin1 `E9` in the same decoded string. There is no
+  intentional-behavior steelman; output is mixed-encoding either way.
+- The same commit silently changed `code_point_to_utf8_bytes( 0 )`
+  from `U+FFFD` to `"\0"` (the old guard was `$code_point <= 0`).
+  Callers are unaffected (`&#0;` is intercepted earlier) and the new
+  behavior matches the docblock, but it belongs in the same report.
+
+One-line fix: `mb_chr( $code_point, 'UTF-8' )`. The script pins the
+current buggy behavior as a labeled KNOWN ISSUE check so the stance
+cannot silently go stale; update or remove the pin when fixed.
+
+## Verification / definition of done
+
+All verified 2026-06-10 on PHP 8.4.21:
+
+- `php tools/encoding-fuzz/tests/harness-smoke.php` passes, including
+  broken-variant detections for every added check (seventeen mutation
+  classes plus the exhaustive script's surrogate fault).
+- Fault-injection variants per new target
+  (`ENCODING_FUZZ_FAULT=encode-cp1252|decode-per-byte|nonchars-miss-fdd0|nonchars-overeager`)
+  exercised worker → replay → minimize end to end; artifacts now record
+  the fault name and `pcre_u` in environment metadata. The script-local
+  `codepoint-surrogate-qmark` fault is proven via the smoke test's
+  subprocess run (the exhaustive script never enters the worker
+  pipeline).
+- `php tools/encoding-fuzz/runner.php --lanes 4 --duration-seconds 60`
+  ran clean (32,000 cases, 0 failures, 0 stalled, final tree). Findings
+  that were
+  triaged and documented rather than silenced: the legacy
+  `utf8_decode()` divergence (§1), the `wp_has_noncharacters()`
+  ill-formed-input divergence (§2), the `code_point_to_utf8_bytes()`
+  internal-encoding regression and the #63863 test bug (§§1, 3).
+- README.md oracle/check tables updated (Encode/Decode/Nonchars).
+
+## Gotchas inherited from the existing harness
+
+- All scrub/validity oracles passed a hand-computed battery; new
+  oracles must too (`Oracles::battery()` pattern). iconv is excluded
+  for accepting code points above U+10FFFF — don't re-add it.
+- Workers run checks in-process; an infinite loop in a new target will
+  trip the runner's 120s stall watchdog and record the seed. Keep that
+  property: no per-case subprocesses.
+- Everything must stay derivable from `(seed, case index)` — no
+  `random_int()`, no time-dependent generation. Per-case chunking-type
+  randomness derives from `sha256(input)` (see
+  `Checks::check_chunked_scan()`).
diff --git a/handoffs/html-decoder-fuzzer.md b/handoffs/html-decoder-fuzzer.md
@@ -0,0 +1,130 @@
+# Handoff: independent fuzzer for WP_HTML_Decoder
+
+## Status
+
+Not started. This is a NEW fuzzer, separate from `tools/encoding-fuzz/`
+(UTF-8 functions) and from the `html-api-fuzz` branch (whole-tree
+parser comparison). Reuse the architecture of `tools/encoding-fuzz/` —
+deterministic `(seed, case)` generation, oracle startup battery,
+worker/runner/replay/minimize CLIs, mutation-tested harness smoke test —
+but as its own tool directory (suggested: `tools/html-decoder-fuzz/`).
+
+## Target
+
+`WP_HTML_Decoder` in `src/wp-includes/html-api/class-wp-html-decoder.php`:
+
+- `decode_text_node( $text )`
+- `decode_attribute( $text )`
+- `read_character_reference( $context, $text, $at, &$match_byte_length )`
+- `attribute_starts_with( $haystack, $search, $case_sensitivity )`
+
+This is security-relevant code: decoded attribute values feed
+`javascript:` URL detection via `attribute_starts_with`. Existing unit
+tests are thin (`tests/phpunit/tests/html-api/wpHtmlDecoder.php`, 4 test
+methods) — fuzzing has real headroom here.
+
+Dependency note: the named-reference path uses `WP_Token_Map` and the
+`$html5_named_character_reference` map
+(`src/wp-includes/html-api/html5-named-character-references.php`).
+A decoder fuzzer transitively exercises both.
+
+## Oracle
+
+`Dom\HTMLDocument` (lexbor, PHP 8.4+) — the same oracle the
+`html-api-fuzz` branch uses for tree comparison:
+
+- Text context: parse `<!DOCTYPE html><body><div>PAYLOAD</div>`, read
+  the div's `textContent`; compare with `decode_text_node( PAYLOAD )`.
+- Attribute context: parse `<div title="PAYLOAD">`, read
+  `getAttribute('title')`; compare with `decode_attribute( PAYLOAD )`.
+
+Do NOT use `html_entity_decode( ENT_HTML5 )` as the primary oracle: it
+does not implement the WHATWG attribute-context rules (named reference
+without semicolon followed by `=` or alphanumeric must NOT decode in
+attributes) and will drown the run in false divergences. It MAY serve
+as a third opinion on the text context only, gated by a known-answer
+battery like `Oracles::battery()` in the encoding fuzzer — verify
+empirically before trusting it, including C1-control numeric reference
+remapping (`&#x80;` → U+20AC etc.).
+
+## Confounders the harness must neutralize
+
+The oracle is a full HTML parser; the target is a pure decoder. The
+generator must avoid payload bytes the parser treats specially, or the
+comparison measures parser behavior instead of decoding:
+
+- `<`, `>`, `&` followed by structure-breaking content — escape `<` as
+  text? No: restrict generated payloads to never contain raw `<`; `&`
+  is the whole point and is fine in both contexts.
+- Quote characters in the attribute payload — generate with `"` 
+  excluded (or swap quote style per case), since it terminates the
+  attribute in the oracle document but not in `decode_attribute()`.
+- CR / CRLF: the HTML parser normalizes `\r` and `\r\n` to `\n` before
+  tokenization; the decoder does not. Either exclude `\r` from payloads
+  or pre-normalize before comparison — decide once, document it.
+- NUL bytes: parser replaces U+0000 with U+FFFD in some contexts /
+  drops in others; the decoder has its own documented NUL handling
+  (see existing test `test_character_reference_with_null_byte...`).
+  Probably exclude raw NUL from oracle-compared cases and cover NUL
+  via fixed regression vectors instead.
+- Invalid UTF-8 payload bytes: lexbor may scrub them before the
+  tokenizer sees them. Start with valid-UTF-8 payloads only; invalid
+  bytes inside character references (`&am\xC0p;`) are a later, careful
+  extension.
+
+## Generator: entity grammar, not byte noise
+
+Weighted mix targeting the reference-matching state machine:
+
+- Named references from the real token map: exact (`&amp;`), without
+  semicolon (`&amp`), longest-match ambiguity (`&not` vs `&notin;` —
+  the map is greedy-longest), case variants (`&AMP` vs `&amp`),
+  truncations (`&am`), nonexistent lookalikes (`&ampx;`).
+- The attribute-context discriminator: no-semicolon named reference
+  followed by `=`, by alphanumerics, by `;` later in the string —
+  decode in text, not in attribute.
+- Numeric: decimal and hex, mixed case `x`/`X`, leading zeros (many),
+  value classes: ASCII, C1 controls 0x80–0x9F (windows-1252 remap
+  table), surrogates, noncharacters, > 0x10FFFF, huge (overflow
+  arithmetic), zero, missing digits (`&#;`, `&#x;`).
+- Adjacency and boundaries: references back to back, reference at
+  string start/end, `&` at end of input, references split by the
+  string boundary at every prefix length (truncation sweep).
+- Plain text with multibyte UTF-8 around references (offset arithmetic).
+
+Each case is `(context, payload)`; derive both from the PRNG.
+
+## Checks
+
+1. Differential vs oracle in both contexts (primary).
+2. `read_character_reference()` consistency: decoding the whole string
+   by repeated `read_character_reference` + literal spans must equal
+   `decode()` output, and `$match_byte_length` must always advance.
+3. `attribute_starts_with( $haystack, $search )` agrees with
+   `str_starts_with( decode_attribute( $haystack ), $search )` for
+   ASCII search strings, both case sensitivities.
+4. Output is valid UTF-8 (reuse `mb_check_encoding`).
+5. Idempotence does NOT hold for decoding (`&amp;amp;` decodes to
+   `&amp;`) — do not add it; add instead: decoding text with no `&`
+   is identity.
+
+## Harness requirements (carry over from encoding fuzzer)
+
+- Known-answer startup battery for the oracle path (hand-computed
+  WHATWG expectations, including the C1 remap and no-semicolon
+  attribute rules) — if the local `Dom\HTMLDocument` fails it, abort
+  loudly.
+- Mutation-tested smoke test: broken decoder variants (skip C1 remap,
+  decode no-semicolon refs in attributes, off-by-one match length)
+  must be caught before the fuzzer is trusted.
+- Failure artifacts self-contained (base64 input + context + expected/
+  got), replay + signature-preserving minimizer.
+- Note `html-api-fuzz` branch precedent: its `attributes-entities`
+  generator profile and oracle handling are prior art worth reading
+  (`tools/html-api-fuzz/lib/Generator.php` on that branch).
+
+## Definition of done
+
+Smoke test green (including broken-variant detection), a 5-minute
+multi-lane run either clean or with triaged findings, README with the
+oracle-confounder decisions documented.