Skip to content

Html css fuzz css token processor#63

Draft
sirreal wants to merge 213 commits into
trunkfrom
html-css-fuzz-css-token-processor
Draft

Html css fuzz css token processor#63
sirreal wants to merge 213 commits into
trunkfrom
html-css-fuzz-css-token-processor

Conversation

@sirreal

@sirreal sirreal commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Trac ticket:

Use of AI Tools


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

sirreal added 30 commits June 12, 2026 21:17
Track the core EOF-escape fix (candidate finding 4, now confirmed
against lexbor and fixed):

- The invalid bucket's lone '\' entry is now a valid selector (type
  U+FFFD); replace it with "\\n" (backslash before a newline is not
  a valid escape and stays invalid).
- New 'eof-escape' kind in the edge-escape bucket generates '.name\',
  '#name\', and 'name\' (including empty-name variants) with expected
  ASTs ending in U+FFFD, exercising both the EOF escape decode and the
  normalize-input trailing-whitespace handling.

NEXT-STEPS.md: mark candidate finding 4 fixed (including the
trailing-trim wrong-match-set bug its review surfaced) and record the
session decisions: EOF-truncated attribute selectors will be made
spec-conformant (auto-close), the HTML case-insensitive attribute
value list will be implemented, grammar-level truncations stay
invalid, no Trac tickets.

Self-check OK; 5000-seed runs clean at seeds 1-5000 and 7000000+
(reviewer-chosen fresh range) with lexbor comparisons active.
Track the core EOF auto-close change:

- New 'eof-truncated' kind in the edge-escape bucket: render an
  attribute compound, strip the trailing ']', sometimes also drop a
  closing string quote (EOF terminates the string, then the block) and
  sometimes append a trailing backslash to the unterminated string
  (the 'do nothing' escape arm, keeping that branch exercised after
  the '[a="x\' invalid template became valid).
- Invalid bucket reshuffled along the new validity boundary: entries
  that EOF auto-close makes valid ('[a', '[ a', '[a=b', '[a="x\',
  '[a="b]', "[a='b]", '[a=b i') are replaced with still-invalid
  grammar-level truncations ('[a=', '[a= ', '[a~', '[a^', '[a=b x',
  '[a=b ix', '[a=b i x', '[5=b', '[a="b<LF>c').

COVERAGE.md regenerated: 384/408 lines (94.1%), effective 396/408
(97.1%). The parse_string() and ident-start EOF guards flipped from
'defensive, unreachable' to genuinely covered ('[a=' now reaches the
value parsers at EOF); next_two_are_valid_escape()'s EOF guard is the
remaining defensive line.

NEXT-STEPS.md: record candidate finding 5 (escaped attribute-selector
modifier idents like '[a=b \69]' are rejected by the byte-wise
modifier switch; Chromium itself is inconsistent — accepts \69/i,
rejects \73/s; fail-safe refusal, not fixed).

Review panel: two approvals; 20k-case generator oracle loop and a
fresh 9M-seed-range fuzz run by reviewers found no oracle mismatches.
Gates: self-check OK, full suite 1631 green, 5000-seed run clean.
Track the core change in the oracle stack:

- ReferenceMatcher: independent copy of the 46-name list (the oracle
  must not share a possible misreading with the implementation under
  test); attr matching folds when no modifier + html-namespace row +
  listed name. A new $html_attr_ci flag threads through the matching
  entry points so the lexbor comparison can model an engine without
  the rule.
- TreeCapture: html-processor rows now carry the element's namespace
  (svg/math subtrees and foreignObject integration points get correct
  per-element folding; rows without the field default to html, which
  matches both the model generator's html-only output and the
  namespace-blind standalone Tag Processor).
- Worker: lexbor does not implement the rule ([rel=NOFOLLOW] fails to
  match rel="nofollow"); its expectation is now always recomputed
  with the list disabled, composed with the existing issue-368 quirks
  fold. Candidate upstream report recorded in NEXT-STEPS.md.
- SelectorGenerator: case-flip twists in gen_attr_selector and, more
  importantly, path_attr_feature — the path-directed bucket pairs
  attribute name and value from the same real element, so a flipped
  operand makes the folding rule load-bearing for the mustMatchFid
  invariant. Mutation-tested: with the core folding branch disabled,
  a 3000-seed run fires 11 match-mismatch failures (review found the
  earlier pool-based flip alone was load-bearing in ~1/14k cases).
- util: ascii_strtoupper and str_shuffle_case (ASCII-only, multibyte
  bytes pass through untouched).

Review panel: two approvals (spec reviewer machine-diffed both list
constants against the live spec; oracle reviewer verified row
namespaces against match-time get_namespace() including integration
points, the lexbor compensation composition, and util determinism).
Gates: self-check OK, suite 1640 green, 5000-seed run clean, fresh
11M-seed-range reviewer run clean.
Issues 4 and 5, both surfaced while verifying the WP conformance fixes
against lexbor and both re-verified directly against the harness:

4. EOF does not auto-close an open attribute selector block:
   '[att=val' is a parse error where CSS Syntax §5.4.8 returns the
   block ('[att=val]') and Chrome accepts it. Controls confirm
   grammar-level truncation ('[att=', '[') is correctly rejected and
   escape-at-EOF ('.foo\' -> foo U+FFFD) already works — the gap is
   specifically the simple-block auto-close.

5. HTML's case-insensitive attribute value list is not implemented:
   '[rel=nofollow]' does not match rel="NOFOLLOW" where the HTML
   spec and Chrome fold the 46 listed attributes' values. Controls
   confirm explicit i/s modifiers and unlisted attributes behave.
   Includes the namespace-scoping caveat (spec scopes to HTML
   elements; Chrome folds SVG too).

Both are compensated for in this fuzzer's differential (issue 4 never
reaches lexbor because the differential compares canonical re-renders;
issue 5 is compensated like WordPress#368 by comparing lexbor against the
reference run with the list disabled). Same filing-agent protocol as
issues 1-3: re-verify at master, dedupe, one self-contained C repro
per issue.
- Add the missing IMPLEMENTED entry for EOF auto-close (the session
  decision block still said 'will be made spec-conformant').
- Record the invalid-UTF-8 escape-decode wart both review panels
  flagged ('\' + invalid byte decodes to mb_substitute_character,
  '?', instead of U+FFFD) and tie it to the open handoff item 5
  contract decision.
- Point the lexbor gaps at their now-drafted UPSTREAM-ISSUES.md
  entries (issues 4 and 5) instead of 'candidate upstream report'.
- Note the mutation-test result for the path-directed case-flip and
  the two minor review leftovers (namespace-defaulting dead helpers,
  s-modifier differential coverage).
- Fix the stale repo-state paragraph (the tooling has been committed
  on this branch since 2026-06-10) and list what remains open.
Move the perf follow-up out of the still-open list and record the
outcome: _wp_scan_utf8-based in-place sizing, byte-identical behavior
(74M differential cases), linear scaling, and the deliberately-kept
quadratic mb_substr fallback for escaped invalid bytes pending the
invalid-UTF-8 policy decision.
The invalid-UTF-8 escape-decode note now points at the
mb_substitute_character canary fixture and pins landed in
wpCssSelectorParserMatcher.php, which serve as the ready-made red suite
for the item-5 U+FFFD fix.
The core scrub change makes from_selectors() report _doing_it_wrong()
once per parse of an invalid-UTF-8 selector. The worker's notice
invariants assumed zero notices for parseable selectors and exactly two
for unparseable ones; the chaos/mutated buckets organically produce
invalid-UTF-8 selectors, so a 5000-seed run failed 108 cases against
the new core behavior (94 doing-it-wrong-unexpected, 12
doing-it-wrong-missing, 2 case-determinism).

Worker changes:
- flush_select_parse_caches(): both select() implementations memoize
  the most recently parsed selector string in a function static, so
  whether a call re-parses — and therefore whether the parse-time
  scrub notice fires — depended on worker history, breaking the
  case-determinism re-run. Parsing a sentinel (#-fuzz-cache-flush-)
  through both processors before each notice-assertion window makes
  exactly one parse happen inside it. The flush works even for
  unparseable sentinels (the cache assigns before the null check) and
  precedes reset_doing_it_wrong(), so it cannot pollute recordings.
- check_select_matches() expects exactly one scrub notice — named
  WP_CSS_Compound_Selector_List::from_selectors for the tag target,
  WP_CSS_Complex_Selector_List::from_selectors for html — iff
  wp_is_valid_utf8() rejects the selector string, and nothing else.
  Review verified the predicate is exactly equivalent to "the scrub
  changed the input" (exhaustive 1-2-byte strings plus 2M random).
- check_select_rejection() expects the two per-call select() notices
  (those fire on cache hits too) plus one leading scrub notice for
  invalid-UTF-8 selectors, order- and name-exact via notices_match().

Stale comments updated now that parsed ASTs are valid UTF-8 by
construction: Metamorph's variants() guard and the lexbor
differential's skipped-utf8 state are kept as defense in depth (a
nonzero skipped-utf8 tally now indicates a normalization bypass), and
the invariant glossary describes the expected-set semantics.

NEXT-STEPS.md: the invalid-UTF-8 policy item is resolved as scrub
(decision history, the linked value-getter pin obligation, the
optional parse()-visibility follow-up); the O(1) decode entry notes
its mb_substr() fallback was since removed; the still-open list points
at the deferred coverage work (dedicated invalid-UTF-8 generator
bucket, raw-byte mutation class, explicit lexbor probe — handoff
drafted) and records that the chaos/mutated buckets already exercise
the scrub organically, with lexbor agreeing across clean 5000- and
10000-seed runs.

Gates: self-check OK, 5000 seeds 0 failures, plus independent reviewer
runs (2x2000 determinism-checked, 8000 additional seeds, all clean).
The core scrub (598ed6f) decodes selector input before parsing: each
maximal subpart of an ill-formed UTF-8 sequence becomes one U+FFFD (CSS
Syntax 3 §3.2 via the WHATWG decoder). Until now only the chaos/mutated
buckets exercised that path, organically and without AST expectations.

The new bucket (weight 5 in both maps) injects one raw ill-formed
sequence into a class/ID/attribute-name ident or quoted attribute string
operand — lead/mid/trail/whole position, optionally behind a span type —
and carries the post-scrub AST as its expectation. The per-class U+FFFD
counts are pinned in INVALID_UTF8_CLASSES independently of
wp_scrub_utf8(), so the AST round-trip is a real differential against
the core scrub: lone continuation, truncated 2/3/4-byte leads, and
invalid leads F5/FF decode to 1; overlong C0 80 / C1 BF to 2; surrogate
half ED A0 80 to 3; beyond-max F4 90 80 80 to 4. An injected sequence is
always followed by ASCII or end of input, so a continuation byte can
never complete a truncated lead and shift the subpart boundaries.

self-check gains a forced-bucket section (150 seeds): the selector must
be invalid UTF-8, parse in both grammars, and parse to exactly the
pinned AST; variety assertions require all subpart counts {1,2,3,4}, all
four injection sites, and all ten byte classes. The class names and byte
values are duplicated in the test deliberately — tallying from the
generator's own table would shrink the assertion with a deleted entry
and self-validate a drifted byte value (both demonstrated live in
review).

Adversarial review: three hostile reviewers. The spec reviewer verified
the count table against an independently written WHATWG decoder (960
table contexts x 3 oracles; all 2880 site/position/class render
combinations decode to the assumed post-scrub string; key-order-exact
ASTs in both grammars; 3000-seed sweep clean). The test reviewer ran
nine mutations — count drift, raw-byte expectations, suffix-guarantee
removal, core scrub no-op, per-byte core scrub (killed exclusively by
the two truncated classes that discriminate maximal-subpart from
per-byte replacement), class deletion and de-selection, byte drift —
all killed after two hardening rounds; two disclosed low-severity
survivors remain (lone-continuation substring ambiguity; a class added
to the table alone gets no variety pin). The integration reviewer
confirmed the scrub-notice contract cannot flip (5200 constructed cases
all invalid UTF-8), the lexbor differential stays live via the
canonical re-render (zero skipped-utf8), digest determinism on every
in-bucket seed in 1-400, and replay/minimizer behavior on raw-byte
selectors.

Gates: self-check OK; 5000 seeds, 0 failures (241 invalid-utf8 cases).
The mutated bucket's operations drew from a pure-ASCII alphabet, so the
only ill-formed UTF-8 it produced came from delete/duplicate corrupting
the pools' few multibyte characters (leads C3/CE/E2/F0 only). A new
mutation kind (weight 12) splices one raw sequence from
INVALID_UTF8_CLASSES at an arbitrary byte offset — possibly splitting an
existing multibyte character or landing where a following continuation
byte re-validates the string. These cases carry no AST expectation; they
exercise crash, scrub-notice, and differential paths, and they make the
worker's invalid-UTF-8 rejection branch hot (an unparseable
invalid-UTF-8 selector expects scrub + two select() notices), which no
bucket reached before: the invalid-utf8 bucket always parses and the
chaos alphabets are valid UTF-8.

self-check asserts the operation fires: at least 10 of 200 forced
mutated seeds must contain a marker byte C0/C1/ED/F4/F5/FF (currently
28). The marker set is exactly the sound subset: those bytes cannot
occur in any clean render (C0/C1/F5/FF never appear in valid UTF-8;
ED/F4 only for U+D000-D7FF / above U+FFFFF, which no pool emits), while
the four marker-free sequences (80, C3, E2 8C, F0 9F 82) reuse bytes
that legitimate pool characters contain.

Adversarial review: the same three hostile reviewers, all approved.
Spec: splice arithmetic verified at every boundary (empty selector,
at=0/length, cross-round corruption), a 20000-seed crash sweep with
warnings escalated to exceptions came back clean, marker exclusivity
confirmed against the pre-change generator (0 hits in 20000 seeds; the
red loop reproduced exactly). Test adequacy: dead arm, dead weighted
entry, and empty payload all collapse to 0/200 against the 28/200
baseline (threshold ~4 sigma below the mean under PRNG reshuffles); the
one survivor (dropping only marker-free payloads) is probe diversity,
not verification, and the bucket commit pins all ten classes.
Integration: 5000 seeds 0 failures with byte-identical bucket
distribution to the pre-change baseline, the notice contract verified
self-keyed on the final byte string under 11 adversarial splice shapes
including validity-restoring ones, determinism on every mutated seed in
1-400.

Gates: self-check OK; 5000 seeds, 0 failures.
The handoff's open question — what does lexbor do with raw ill-formed
UTF-8 in selectors — is resolved empirically: lexbor v3.0.0 accepts the
bytes (no parse error) and replaces them with U+FFFD, but not per the
WHATWG maximal-subpart rule CSS Syntax 3 §3.2 invokes. Truncated
multi-byte sequences decode to one U+FFFD per byte (E2 8C to 2, spec 1;
F0 9F 82 to 3, spec 1) and UTF-8-encoded surrogate halves decode
permissively as a single unit (ED A0 80 to 1, spec 3); agreement on the
other byte classes is coincidental overlap of the two algorithms.
Drafted as UPSTREAM-ISSUES.md issue 6 with the probe table. On the
document side lexbor keeps raw invalid bytes unchanged in the DOM (the
same stance as the Tag Processor), so raw doc bytes match nothing in
either engine. The differential needs no n/a gating for the
invalid-utf8 bucket: the worker hands lexbor a canonical re-render of
the post-scrub AST (pure ASCII), the same mechanism that sidesteps
lexbor's other byte-level parsing bugs, so the bucket compares normally.

NEXT-STEPS.md: the deferred scrub-coverage item resolves as implemented
(bucket, splice, probe); the handoff's optional metamorphic relation
parse(s) === parse(scrub(s)) is recorded as deliberately skipped (no
public path bypasses from_selectors(), so it is near-tautological). New
small open item from review: gen_chaos()'s whole-codepoint unicode
branch is dead code (string-vs-key comparison), and its byte-sliced
fallback is what makes chaos emit invalid UTF-8 organically (~15% of
chaos cases) — making the branch live is a behavior decision now that
deliberate ill-formed coverage exists.

COVERAGE.md regenerated at the 3000-seed window: 396/424 = 93.4% raw,
408/424 = 96.2% effective. All 28 unreached lines accounted for: 12
phpdbg case-label artifacts, 12 defensive guards, the 2-line
escape-decoder invalid-byte arm the scrub made unreachable through
from_selectors() (pinned by PHPUnit), and 2 reachable lines this window
misses (witnesses verified under phpdbg: '[' reaches the attribute
length guard, '[a="b' reaches the string-to-EOF break). Stale
93.8%/96.8% references in NEXT-STEPS.md and FINDINGS.md updated to
point at COVERAGE.md as the source of truth.

Adversarial review: the same three hostile reviewers, all approved
after two correction rounds. The issue-6 table was reproduced 10/10
rows against the pinned harness by two reviewers independently, the
WHATWG column confirmed against an independent spec-transcribed
decoder, and the ED-restricts-its-first-continuation-to-9F subpart
reasoning checked against the Encoding Standard's ranges. The coverage
table and uncovered-line list reproduced exactly; the case-label
artifact demonstrated mechanistically (executable-but-unloggable label
lines with executing bodies); the un-normalized-only claim verified in
both directions (direct parse_ident hits the arm, from_selectors never
does). Corrections from review: the issue-6 legend mislabeled U+FFFD
counts as byte counts; stale coverage numbers contradicted the
regenerated report; chaos's organic invalid UTF-8 was misattributed to
pool corruption (it byte-slices its unicode alphabet); an
'all production moves' claim ignored mutated's residual organic
corruption (~2% of pre-splice mutated cases).

Gates: self-check OK; docs-only change — code identical to 8333b93,
whose 5000-seed run was clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants