Skip to content

Fuzz encoder#52

Draft
sirreal wants to merge 15 commits into
trunkfrom
fuzz-encoder
Draft

Fuzz encoder#52
sirreal wants to merge 15 commits into
trunkfrom
fuzz-encoder

Conversation

@sirreal

@sirreal sirreal commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Trac ticket:

Use of AI Tools


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

sirreal added 15 commits June 10, 2026 14:54
Fuzzes wp_is_valid_utf8(), wp_scrub_utf8(), and their pure-PHP
fallbacks against five independent known-good oracles: mbstring,
PCRE2, ICU (intl), CPython, and the WHATWG TextDecoder (Node), the
last two as persistent subprocesses. All oracles must pass a
hand-computed known-answer battery before use; iconv is excluded
because libiconv accepts code points above U+10FFFF.

Beyond the differentials, internal invariants are checked: validity
iff scrub identity, scrub output validity, scrub idempotence,
code point counts against the scrubbed length, and chunked
_wp_scan_utf8() reconstruction with deterministic resumable-scan
budgets.

Inputs mix nine deterministic strategies (random bytes, boundary-
heavy valid UTF-8, mutations, invalid-atom splices, latin1, UTF-16,
ASCII fast-path stress, repeated motifs); every case is reproducible
from (seed, case index) alone. Includes a multi-lane runner with
stall detection, replay and signature-preserving minimization tools,
and a harness self-test that mutation-tests detection against seven
classes of deliberately broken implementations.
Four self-contained work-lane documents: extending the encoding
fuzzer (utf8_encode/decode fallback differentials before PHP 9
removes the native oracles, the confirmed wp_has_noncharacters
PCRE-vs-fallback divergence on ill-formed input, exhaustive
code_point_to_utf8_bytes), an independent WP_HTML_Decoder fuzzer
against the Dom\HTMLDocument oracle, WP_Token_Map property tests
against a naive reference (building on the existing wpTokenMap.php
tests), and a one-shot divergence survey of seems_utf8 and
wp_check_invalid_utf8.
Extend the encoding fuzzer with targets for _wp_utf8_encode_fallback()
and _wp_utf8_decode_fallback(), fuzzing them against mb_convert_encoding
(primary) and the deprecated native utf8_encode()/utf8_decode() pair
while it still exists, plus round-trip and output-validity invariants.

The handoff's premise that native and fallback share semantics on
invalid input was falsified during implementation: legacy utf8_decode()
groups a well-formed lead byte with its expected continuation length
into a single '?' (surrogates, beyond-U+10FFFF, 3/4-byte overlongs,
C2 C0), while WordPress deliberately follows mb_convert_encoding's
maximal-subpart semantics (the PHP 9 polyfill in compat.php prefers mb;
ticket #63863). The native decode oracle is therefore trusted on valid
input only — where it provably agrees with mb on every code point — and
the divergence is pinned by hand-computed battery vectors instead of
fuzzed.

Detection is mutation-tested: seven new broken-implementation classes
in the smoke test (cp1252-confused encoder, identity encoder, per-byte
decoder, valid-input mangler, round-trip violator, null-returning
encoder and decoder — the fallbacks are untyped, so non-string returns
are reported as target-bad-return rather than silently skipped), and
ENCODING_FUZZ_FAULT=encode-cp1252|decode-per-byte exercise the
worker → replay → minimize pipeline end to end (minimal counterexamples:
'80' and 'E7 B8').

Also records an upstream finding in the handoff: the #63863 PHPUnit
test's invalid-input coverage is vacuous (integer interpolation instead
of chr(), single-quoted escape sequences, U+E000 boundary off-by-one).
Fuzz wp_has_noncharacters() (PCRE branch) and
_wp_has_noncharacters_fallback() against a trivial mb_str_split/mb_ord
reference oracle, on valid input only. On ill-formed input the public
function's answer depends on which environment branch of utf8.php
loaded — the PCRE branch returns false whenever preg_match fails while
the fallback skips invalid spans and reports noncharacters around them
("\xC0\xEF\xBF\xBE": PCRE false, fallback true). Per the handoff's
option (a), the fuzzer treats behavior as undefined unless
wp_is_valid_utf8() and pins the divergence with a fixed regression
vector in the smoke test; whether core aligns the implementations or
documents the stance remains an open question for the function author.

The reference oracle's battery covers the boundaries and interior of
the U+FDD0–U+FDEF block and the final two code points of EVERY plane
with their lower neighbors — the PCRE implementation enumerates each
plane as a separate hand-typed escape, so a single-plane typo is the
realistic bug class and now has deterministic coverage. The oracle
throws on ill-formed input rather than silently coercing mb_ord(false).
BOUNDARY_CODE_POINTS gains block-interior, adjacent-negative, and
mid-plane code points (seed re-derivation of older findings is
invalidated; documented in the README — artifact replays are
unaffected).

Mutation variants: blind detector, U+FDD0-block miss, over-eager
detector (shared between the smoke test and the new
ENCODING_FUZZ_FAULT=nonchars-miss-fdd0|nonchars-overeager fault modes,
one per target; both verified through worker, replay, and minimize).
Worker environment metadata now records pcre_u (which utf8.php branch
loaded) and the active fault name so injected artifacts can never be
mistaken for real findings.
The function's domain (~1.1M code points) is small enough to test
completely instead of fuzzing. The new standalone script checks every
code point 0x0-0x10FFFF plus out-of-range probes against the fuzzer's
pure-arithmetic encoder (the independent oracle), with an explicit
mb_chr( $cp, 'UTF-8' ) consistency cross-check; surrogates and
out-of-range values must yield U+FFFD. Runs in ~0.4s. The harness
smoke test executes it and proves its detection fires via the
script-local ENCODING_FUZZ_FAULT=codepoint-surrogate-qmark variant.

Documents an upstream regression (pinned as a labeled KNOWN ISSUE
check so the stance cannot silently go stale): since [62424] (#65342,
unreleased) the implementation calls mb_chr() without an explicit
encoding, inheriting mb_internal_encoding() — which WordPress sets
from blog_charset — so non-UTF-8 sites get raw legacy bytes for
mappable code points while invalid ones still yield UTF-8 U+FFFD,
contradicting the docblock and mixing encodings with the named
character reference path. The 6.6.0 original was pure arithmetic and
always emitted UTF-8; the same commit changed code point 0 from
U+FFFD to NUL. One-line upstream fix: mb_chr( $code_point, 'UTF-8' ).

Closes out the extend-encoding-fuzzer handoff: all three sections
done, definition of done verified and recorded in the handoff doc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant