Fuzz encoder by sirreal · Pull Request #52 · sirreal/wordpress-develop

sirreal · 2026-06-11T16:02:44Z

Trac ticket:

Use of AI Tools

This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

Fuzzes wp_is_valid_utf8(), wp_scrub_utf8(), and their pure-PHP fallbacks against five independent known-good oracles: mbstring, PCRE2, ICU (intl), CPython, and the WHATWG TextDecoder (Node), the last two as persistent subprocesses. All oracles must pass a hand-computed known-answer battery before use; iconv is excluded because libiconv accepts code points above U+10FFFF. Beyond the differentials, internal invariants are checked: validity iff scrub identity, scrub output validity, scrub idempotence, code point counts against the scrubbed length, and chunked _wp_scan_utf8() reconstruction with deterministic resumable-scan budgets. Inputs mix nine deterministic strategies (random bytes, boundary- heavy valid UTF-8, mutations, invalid-atom splices, latin1, UTF-16, ASCII fast-path stress, repeated motifs); every case is reproducible from (seed, case index) alone. Includes a multi-lane runner with stall detection, replay and signature-preserving minimization tools, and a harness self-test that mutation-tests detection against seven classes of deliberately broken implementations.

Four self-contained work-lane documents: extending the encoding fuzzer (utf8_encode/decode fallback differentials before PHP 9 removes the native oracles, the confirmed wp_has_noncharacters PCRE-vs-fallback divergence on ill-formed input, exhaustive code_point_to_utf8_bytes), an independent WP_HTML_Decoder fuzzer against the Dom\HTMLDocument oracle, WP_Token_Map property tests against a naive reference (building on the existing wpTokenMap.php tests), and a one-shot divergence survey of seems_utf8 and wp_check_invalid_utf8.

Extend the encoding fuzzer with targets for _wp_utf8_encode_fallback() and _wp_utf8_decode_fallback(), fuzzing them against mb_convert_encoding (primary) and the deprecated native utf8_encode()/utf8_decode() pair while it still exists, plus round-trip and output-validity invariants. The handoff's premise that native and fallback share semantics on invalid input was falsified during implementation: legacy utf8_decode() groups a well-formed lead byte with its expected continuation length into a single '?' (surrogates, beyond-U+10FFFF, 3/4-byte overlongs, C2 C0), while WordPress deliberately follows mb_convert_encoding's maximal-subpart semantics (the PHP 9 polyfill in compat.php prefers mb; ticket #63863). The native decode oracle is therefore trusted on valid input only — where it provably agrees with mb on every code point — and the divergence is pinned by hand-computed battery vectors instead of fuzzed. Detection is mutation-tested: seven new broken-implementation classes in the smoke test (cp1252-confused encoder, identity encoder, per-byte decoder, valid-input mangler, round-trip violator, null-returning encoder and decoder — the fallbacks are untyped, so non-string returns are reported as target-bad-return rather than silently skipped), and ENCODING_FUZZ_FAULT=encode-cp1252|decode-per-byte exercise the worker → replay → minimize pipeline end to end (minimal counterexamples: '80' and 'E7 B8'). Also records an upstream finding in the handoff: the #63863 PHPUnit test's invalid-input coverage is vacuous (integer interpolation instead of chr(), single-quoted escape sequences, U+E000 boundary off-by-one).

Fuzz wp_has_noncharacters() (PCRE branch) and _wp_has_noncharacters_fallback() against a trivial mb_str_split/mb_ord reference oracle, on valid input only. On ill-formed input the public function's answer depends on which environment branch of utf8.php loaded — the PCRE branch returns false whenever preg_match fails while the fallback skips invalid spans and reports noncharacters around them ("\xC0\xEF\xBF\xBE": PCRE false, fallback true). Per the handoff's option (a), the fuzzer treats behavior as undefined unless wp_is_valid_utf8() and pins the divergence with a fixed regression vector in the smoke test; whether core aligns the implementations or documents the stance remains an open question for the function author. The reference oracle's battery covers the boundaries and interior of the U+FDD0–U+FDEF block and the final two code points of EVERY plane with their lower neighbors — the PCRE implementation enumerates each plane as a separate hand-typed escape, so a single-plane typo is the realistic bug class and now has deterministic coverage. The oracle throws on ill-formed input rather than silently coercing mb_ord(false). BOUNDARY_CODE_POINTS gains block-interior, adjacent-negative, and mid-plane code points (seed re-derivation of older findings is invalidated; documented in the README — artifact replays are unaffected). Mutation variants: blind detector, U+FDD0-block miss, over-eager detector (shared between the smoke test and the new ENCODING_FUZZ_FAULT=nonchars-miss-fdd0|nonchars-overeager fault modes, one per target; both verified through worker, replay, and minimize). Worker environment metadata now records pcre_u (which utf8.php branch loaded) and the active fault name so injected artifacts can never be mistaken for real findings.

The function's domain (~1.1M code points) is small enough to test completely instead of fuzzing. The new standalone script checks every code point 0x0-0x10FFFF plus out-of-range probes against the fuzzer's pure-arithmetic encoder (the independent oracle), with an explicit mb_chr( $cp, 'UTF-8' ) consistency cross-check; surrogates and out-of-range values must yield U+FFFD. Runs in ~0.4s. The harness smoke test executes it and proves its detection fires via the script-local ENCODING_FUZZ_FAULT=codepoint-surrogate-qmark variant. Documents an upstream regression (pinned as a labeled KNOWN ISSUE check so the stance cannot silently go stale): since [62424] (#65342, unreleased) the implementation calls mb_chr() without an explicit encoding, inheriting mb_internal_encoding() — which WordPress sets from blog_charset — so non-UTF-8 sites get raw legacy bytes for mappable code points while invalid ones still yield UTF-8 U+FFFD, contradicting the docblock and mixing encodings with the named character reference path. The 6.6.0 original was pure arithmetic and always emitted UTF-8; the same commit changed code point 0 from U+FFFD to NUL. One-line upstream fix: mb_chr( $code_point, 'UTF-8' ). Closes out the extend-encoding-fuzzer handoff: all three sections done, definition of done verified and recorded in the handoff doc.

sirreal added 15 commits June 10, 2026 14:54

Expand encoding fuzzer for mb chr ord

9d15731

Add UTF-8 codepoint span fuzzing

6ea247f

Add mb substr fuzzing

1f875a1

Add bounded codepoint count fuzzing

a0f6820

Add bounded UTF-8 scan fuzzing

1c208ac

Add deterministic UTF-8 boundary corpus

4005f40

Add encoding fuzzer environment matrix

a6d67b1

Document invalid-input noncharacter policy

56dc15b

Merge branch 'trunk' into fuzz-encoder

cea963e

Adapt encoding fuzzer to trunk noncharacter behavior

98303df

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuzz encoder#52

Fuzz encoder#52
sirreal wants to merge 15 commits into
trunkfrom
fuzz-encoder

sirreal commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sirreal commented Jun 11, 2026

Use of AI Tools

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant