Html css fuzz css token processor by sirreal · Pull Request #63 · sirreal/wordpress-develop

sirreal · 2026-06-12T20:46:47Z

Trac ticket:

Use of AI Tools

This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

Track the core EOF-escape fix (candidate finding 4, now confirmed against lexbor and fixed): - The invalid bucket's lone '\' entry is now a valid selector (type U+FFFD); replace it with "\\n" (backslash before a newline is not a valid escape and stays invalid). - New 'eof-escape' kind in the edge-escape bucket generates '.name\', '#name\', and 'name\' (including empty-name variants) with expected ASTs ending in U+FFFD, exercising both the EOF escape decode and the normalize-input trailing-whitespace handling. NEXT-STEPS.md: mark candidate finding 4 fixed (including the trailing-trim wrong-match-set bug its review surfaced) and record the session decisions: EOF-truncated attribute selectors will be made spec-conformant (auto-close), the HTML case-insensitive attribute value list will be implemented, grammar-level truncations stay invalid, no Trac tickets. Self-check OK; 5000-seed runs clean at seeds 1-5000 and 7000000+ (reviewer-chosen fresh range) with lexbor comparisons active.

Track the core EOF auto-close change: - New 'eof-truncated' kind in the edge-escape bucket: render an attribute compound, strip the trailing ']', sometimes also drop a closing string quote (EOF terminates the string, then the block) and sometimes append a trailing backslash to the unterminated string (the 'do nothing' escape arm, keeping that branch exercised after the '[a="x\' invalid template became valid). - Invalid bucket reshuffled along the new validity boundary: entries that EOF auto-close makes valid ('[a', '[ a', '[a=b', '[a="x\', '[a="b]', "[a='b]", '[a=b i') are replaced with still-invalid grammar-level truncations ('[a=', '[a= ', '[a~', '[a^', '[a=b x', '[a=b ix', '[a=b i x', '[5=b', '[a="b<LF>c'). COVERAGE.md regenerated: 384/408 lines (94.1%), effective 396/408 (97.1%). The parse_string() and ident-start EOF guards flipped from 'defensive, unreachable' to genuinely covered ('[a=' now reaches the value parsers at EOF); next_two_are_valid_escape()'s EOF guard is the remaining defensive line. NEXT-STEPS.md: record candidate finding 5 (escaped attribute-selector modifier idents like '[a=b \69]' are rejected by the byte-wise modifier switch; Chromium itself is inconsistent — accepts \69/i, rejects \73/s; fail-safe refusal, not fixed). Review panel: two approvals; 20k-case generator oracle loop and a fresh 9M-seed-range fuzz run by reviewers found no oracle mismatches. Gates: self-check OK, full suite 1631 green, 5000-seed run clean.

Track the core change in the oracle stack: - ReferenceMatcher: independent copy of the 46-name list (the oracle must not share a possible misreading with the implementation under test); attr matching folds when no modifier + html-namespace row + listed name. A new $html_attr_ci flag threads through the matching entry points so the lexbor comparison can model an engine without the rule. - TreeCapture: html-processor rows now carry the element's namespace (svg/math subtrees and foreignObject integration points get correct per-element folding; rows without the field default to html, which matches both the model generator's html-only output and the namespace-blind standalone Tag Processor). - Worker: lexbor does not implement the rule ([rel=NOFOLLOW] fails to match rel="nofollow"); its expectation is now always recomputed with the list disabled, composed with the existing issue-368 quirks fold. Candidate upstream report recorded in NEXT-STEPS.md. - SelectorGenerator: case-flip twists in gen_attr_selector and, more importantly, path_attr_feature — the path-directed bucket pairs attribute name and value from the same real element, so a flipped operand makes the folding rule load-bearing for the mustMatchFid invariant. Mutation-tested: with the core folding branch disabled, a 3000-seed run fires 11 match-mismatch failures (review found the earlier pool-based flip alone was load-bearing in ~1/14k cases). - util: ascii_strtoupper and str_shuffle_case (ASCII-only, multibyte bytes pass through untouched). Review panel: two approvals (spec reviewer machine-diffed both list constants against the live spec; oracle reviewer verified row namespaces against match-time get_namespace() including integration points, the lexbor compensation composition, and util determinism). Gates: self-check OK, suite 1640 green, 5000-seed run clean, fresh 11M-seed-range reviewer run clean.

Issues 4 and 5, both surfaced while verifying the WP conformance fixes against lexbor and both re-verified directly against the harness: 4. EOF does not auto-close an open attribute selector block: '[att=val' is a parse error where CSS Syntax §5.4.8 returns the block ('[att=val]') and Chrome accepts it. Controls confirm grammar-level truncation ('[att=', '[') is correctly rejected and escape-at-EOF ('.foo\' -> foo U+FFFD) already works — the gap is specifically the simple-block auto-close. 5. HTML's case-insensitive attribute value list is not implemented: '[rel=nofollow]' does not match rel="NOFOLLOW" where the HTML spec and Chrome fold the 46 listed attributes' values. Controls confirm explicit i/s modifiers and unlisted attributes behave. Includes the namespace-scoping caveat (spec scopes to HTML elements; Chrome folds SVG too). Both are compensated for in this fuzzer's differential (issue 4 never reaches lexbor because the differential compares canonical re-renders; issue 5 is compensated like WordPress#368 by comparing lexbor against the reference run with the list disabled). Same filing-agent protocol as issues 1-3: re-verify at master, dedupe, one self-contained C repro per issue.

- Add the missing IMPLEMENTED entry for EOF auto-close (the session decision block still said 'will be made spec-conformant'). - Record the invalid-UTF-8 escape-decode wart both review panels flagged ('\' + invalid byte decodes to mb_substitute_character, '?', instead of U+FFFD) and tie it to the open handoff item 5 contract decision. - Point the lexbor gaps at their now-drafted UPSTREAM-ISSUES.md entries (issues 4 and 5) instead of 'candidate upstream report'. - Note the mutation-test result for the path-directed case-flip and the two minor review leftovers (namespace-defaulting dead helpers, s-modifier differential coverage). - Fix the stale repo-state paragraph (the tooling has been committed on this branch since 2026-06-10) and list what remains open.

Move the perf follow-up out of the still-open list and record the outcome: _wp_scan_utf8-based in-place sizing, byte-identical behavior (74M differential cases), linear scaling, and the deliberately-kept quadratic mb_substr fallback for escaped invalid bytes pending the invalid-UTF-8 policy decision.

The invalid-UTF-8 escape-decode note now points at the mb_substitute_character canary fixture and pins landed in wpCssSelectorParserMatcher.php, which serve as the ready-made red suite for the item-5 U+FFFD fix.

The core scrub change makes from_selectors() report _doing_it_wrong() once per parse of an invalid-UTF-8 selector. The worker's notice invariants assumed zero notices for parseable selectors and exactly two for unparseable ones; the chaos/mutated buckets organically produce invalid-UTF-8 selectors, so a 5000-seed run failed 108 cases against the new core behavior (94 doing-it-wrong-unexpected, 12 doing-it-wrong-missing, 2 case-determinism). Worker changes: - flush_select_parse_caches(): both select() implementations memoize the most recently parsed selector string in a function static, so whether a call re-parses — and therefore whether the parse-time scrub notice fires — depended on worker history, breaking the case-determinism re-run. Parsing a sentinel (#-fuzz-cache-flush-) through both processors before each notice-assertion window makes exactly one parse happen inside it. The flush works even for unparseable sentinels (the cache assigns before the null check) and precedes reset_doing_it_wrong(), so it cannot pollute recordings. - check_select_matches() expects exactly one scrub notice — named WP_CSS_Compound_Selector_List::from_selectors for the tag target, WP_CSS_Complex_Selector_List::from_selectors for html — iff wp_is_valid_utf8() rejects the selector string, and nothing else. Review verified the predicate is exactly equivalent to "the scrub changed the input" (exhaustive 1-2-byte strings plus 2M random). - check_select_rejection() expects the two per-call select() notices (those fire on cache hits too) plus one leading scrub notice for invalid-UTF-8 selectors, order- and name-exact via notices_match(). Stale comments updated now that parsed ASTs are valid UTF-8 by construction: Metamorph's variants() guard and the lexbor differential's skipped-utf8 state are kept as defense in depth (a nonzero skipped-utf8 tally now indicates a normalization bypass), and the invariant glossary describes the expected-set semantics. NEXT-STEPS.md: the invalid-UTF-8 policy item is resolved as scrub (decision history, the linked value-getter pin obligation, the optional parse()-visibility follow-up); the O(1) decode entry notes its mb_substr() fallback was since removed; the still-open list points at the deferred coverage work (dedicated invalid-UTF-8 generator bucket, raw-byte mutation class, explicit lexbor probe — handoff drafted) and records that the chaos/mutated buckets already exercise the scrub organically, with lexbor agreeing across clean 5000- and 10000-seed runs. Gates: self-check OK, 5000 seeds 0 failures, plus independent reviewer runs (2x2000 determinism-checked, 8000 additional seeds, all clean).

The core scrub (598ed6f) decodes selector input before parsing: each maximal subpart of an ill-formed UTF-8 sequence becomes one U+FFFD (CSS Syntax 3 §3.2 via the WHATWG decoder). Until now only the chaos/mutated buckets exercised that path, organically and without AST expectations. The new bucket (weight 5 in both maps) injects one raw ill-formed sequence into a class/ID/attribute-name ident or quoted attribute string operand — lead/mid/trail/whole position, optionally behind a span type — and carries the post-scrub AST as its expectation. The per-class U+FFFD counts are pinned in INVALID_UTF8_CLASSES independently of wp_scrub_utf8(), so the AST round-trip is a real differential against the core scrub: lone continuation, truncated 2/3/4-byte leads, and invalid leads F5/FF decode to 1; overlong C0 80 / C1 BF to 2; surrogate half ED A0 80 to 3; beyond-max F4 90 80 80 to 4. An injected sequence is always followed by ASCII or end of input, so a continuation byte can never complete a truncated lead and shift the subpart boundaries. self-check gains a forced-bucket section (150 seeds): the selector must be invalid UTF-8, parse in both grammars, and parse to exactly the pinned AST; variety assertions require all subpart counts {1,2,3,4}, all four injection sites, and all ten byte classes. The class names and byte values are duplicated in the test deliberately — tallying from the generator's own table would shrink the assertion with a deleted entry and self-validate a drifted byte value (both demonstrated live in review). Adversarial review: three hostile reviewers. The spec reviewer verified the count table against an independently written WHATWG decoder (960 table contexts x 3 oracles; all 2880 site/position/class render combinations decode to the assumed post-scrub string; key-order-exact ASTs in both grammars; 3000-seed sweep clean). The test reviewer ran nine mutations — count drift, raw-byte expectations, suffix-guarantee removal, core scrub no-op, per-byte core scrub (killed exclusively by the two truncated classes that discriminate maximal-subpart from per-byte replacement), class deletion and de-selection, byte drift — all killed after two hardening rounds; two disclosed low-severity survivors remain (lone-continuation substring ambiguity; a class added to the table alone gets no variety pin). The integration reviewer confirmed the scrub-notice contract cannot flip (5200 constructed cases all invalid UTF-8), the lexbor differential stays live via the canonical re-render (zero skipped-utf8), digest determinism on every in-bucket seed in 1-400, and replay/minimizer behavior on raw-byte selectors. Gates: self-check OK; 5000 seeds, 0 failures (241 invalid-utf8 cases).

The mutated bucket's operations drew from a pure-ASCII alphabet, so the only ill-formed UTF-8 it produced came from delete/duplicate corrupting the pools' few multibyte characters (leads C3/CE/E2/F0 only). A new mutation kind (weight 12) splices one raw sequence from INVALID_UTF8_CLASSES at an arbitrary byte offset — possibly splitting an existing multibyte character or landing where a following continuation byte re-validates the string. These cases carry no AST expectation; they exercise crash, scrub-notice, and differential paths, and they make the worker's invalid-UTF-8 rejection branch hot (an unparseable invalid-UTF-8 selector expects scrub + two select() notices), which no bucket reached before: the invalid-utf8 bucket always parses and the chaos alphabets are valid UTF-8. self-check asserts the operation fires: at least 10 of 200 forced mutated seeds must contain a marker byte C0/C1/ED/F4/F5/FF (currently 28). The marker set is exactly the sound subset: those bytes cannot occur in any clean render (C0/C1/F5/FF never appear in valid UTF-8; ED/F4 only for U+D000-D7FF / above U+FFFFF, which no pool emits), while the four marker-free sequences (80, C3, E2 8C, F0 9F 82) reuse bytes that legitimate pool characters contain. Adversarial review: the same three hostile reviewers, all approved. Spec: splice arithmetic verified at every boundary (empty selector, at=0/length, cross-round corruption), a 20000-seed crash sweep with warnings escalated to exceptions came back clean, marker exclusivity confirmed against the pre-change generator (0 hits in 20000 seeds; the red loop reproduced exactly). Test adequacy: dead arm, dead weighted entry, and empty payload all collapse to 0/200 against the 28/200 baseline (threshold ~4 sigma below the mean under PRNG reshuffles); the one survivor (dropping only marker-free payloads) is probe diversity, not verification, and the bucket commit pins all ten classes. Integration: 5000 seeds 0 failures with byte-identical bucket distribution to the pre-change baseline, the notice contract verified self-keyed on the final byte string under 11 adversarial splice shapes including validity-restoring ones, determinism on every mutated seed in 1-400. Gates: self-check OK; 5000 seeds, 0 failures.

The handoff's open question — what does lexbor do with raw ill-formed UTF-8 in selectors — is resolved empirically: lexbor v3.0.0 accepts the bytes (no parse error) and replaces them with U+FFFD, but not per the WHATWG maximal-subpart rule CSS Syntax 3 §3.2 invokes. Truncated multi-byte sequences decode to one U+FFFD per byte (E2 8C to 2, spec 1; F0 9F 82 to 3, spec 1) and UTF-8-encoded surrogate halves decode permissively as a single unit (ED A0 80 to 1, spec 3); agreement on the other byte classes is coincidental overlap of the two algorithms. Drafted as UPSTREAM-ISSUES.md issue 6 with the probe table. On the document side lexbor keeps raw invalid bytes unchanged in the DOM (the same stance as the Tag Processor), so raw doc bytes match nothing in either engine. The differential needs no n/a gating for the invalid-utf8 bucket: the worker hands lexbor a canonical re-render of the post-scrub AST (pure ASCII), the same mechanism that sidesteps lexbor's other byte-level parsing bugs, so the bucket compares normally. NEXT-STEPS.md: the deferred scrub-coverage item resolves as implemented (bucket, splice, probe); the handoff's optional metamorphic relation parse(s) === parse(scrub(s)) is recorded as deliberately skipped (no public path bypasses from_selectors(), so it is near-tautological). New small open item from review: gen_chaos()'s whole-codepoint unicode branch is dead code (string-vs-key comparison), and its byte-sliced fallback is what makes chaos emit invalid UTF-8 organically (~15% of chaos cases) — making the branch live is a behavior decision now that deliberate ill-formed coverage exists. COVERAGE.md regenerated at the 3000-seed window: 396/424 = 93.4% raw, 408/424 = 96.2% effective. All 28 unreached lines accounted for: 12 phpdbg case-label artifacts, 12 defensive guards, the 2-line escape-decoder invalid-byte arm the scrub made unreachable through from_selectors() (pinned by PHPUnit), and 2 reachable lines this window misses (witnesses verified under phpdbg: '[' reaches the attribute length guard, '[a="b' reaches the string-to-EOF break). Stale 93.8%/96.8% references in NEXT-STEPS.md and FINDINGS.md updated to point at COVERAGE.md as the source of truth. Adversarial review: the same three hostile reviewers, all approved after two correction rounds. The issue-6 table was reproduced 10/10 rows against the pinned harness by two reviewers independently, the WHATWG column confirmed against an independent spec-transcribed decoder, and the ED-restricts-its-first-continuation-to-9F subpart reasoning checked against the Encoding Standard's ranges. The coverage table and uncovered-line list reproduced exactly; the case-label artifact demonstrated mechanistically (executable-but-unloggable label lines with executing bodies); the un-normalized-only claim verified in both directions (direct parse_ident hits the arm, from_selectors never does). Corrections from review: the issue-6 legend mislabeled U+FFFD counts as byte counts; stale coverage numbers contradicted the regenerated report; chaos's organic invalid UTF-8 was misattributed to pool corruption (it byte-slices its unicode alphabet); an 'all production moves' claim ignored mutated's residual organic corruption (~2% of pre-splice mutated cases). Gates: self-check OK; docs-only change — code identical to 8333b93, whose 5000-seed run was clean.

…ssion

sirreal added 30 commits December 5, 2024 22:52

WIP class skeleton

0e8c4fb

Document class

2d3d283

Do not support namespaced selectors

40222d3

Flesh out stuff

6092642

Starting to actually parse

3e3b2b2

Add ident tests

967557f

Fix ident non-ascii bug

2ec1db3

Use class after defined

ee2c7ce

Fix some char stuff

0f708ba

Improve tests

3cb455d

Housekeeping

5609e50

Require new file in WP

4f25bc2

Fix offset type

943293f

Add more tests and invalid tests

24c9744

Fix wrong offset var usage

a7c10b9

comment tweak

dd718b7

Implement codepoint escape with strspn

5884aca

Test with UPPER HEX

a9a077f

Add ID tests

5f53e0a

Improve tests

effbbbe

Add class selector tests

62ec5bb

Add class selector

153f009

Simplify id selector parse

fcc6401

Improve ident tests

21c67e5

Add type selector tests

728d798

Add docs and remove unreachable line

e1e8e09

Add type selector class

13ac3c1

Add attribute selector tests

a3c25e8

improve attr tests

ad5c600

Fix expectation argument order

6758704

sirreal added 30 commits June 12, 2026 21:17

Merge branch 'html-api/add-css-selector-parser' into html-css-fuzz

881f7b6

Merge branch 'html-api/add-css-selector-parser' into html-css-fuzz

2ff392a

Merge branch 'html-api/add-css-selector-parser' into html-css-fuzz

158e307

Merge branch 'trunk' into html-css-fuzz

3a28bf8

Merge branch 'html-api/add-css-selector-parser' into html-css-fuzz

8c82955

CSS selector fuzz: note the canary red suite for the UTF-8 policy item

9c2cb64

The invalid-UTF-8 escape-decode note now points at the mb_substitute_character canary fixture and pins landed in wpCssSelectorParserMatcher.php, which serve as the ready-made red suite for the item-5 U+FFFD fix.

Merge branch 'html-api/add-css-selector-parser' into html-css-fuzz

0a5b157

Build lexbor harness from master

42c267a

CSS selector fuzz: harden fuzzer follow-ups

9d1129c

CSS selector fuzz: Update moved selector-fix references

202905a

Merge branch 'trunk' into html-api/add-css-selector-parser

9b194d3

Merge branch 'html-api/add-css-selector-parser' into html-css-fuzz

4b1c95d

CSS Tokenizer: Add CSS token processor

f9e6771

CSS Tokenizer: [token processor bug] Add string backslash EOF regression

e3d9a9e

CSS Tokenizer: [token processor bug] Add escaped newline string regre…

3b8dcb4

…ssion

CSS Tokenizer: [token processor bug] Decode string escapes at EOF

3988b49

CSS Tokenizer: Parse selectors from CSS token stream

4555e60

CSS Tokenizer: [token processor bug] Port string escape decoding fix

2c69506

CSS Tokenizer: [token processor bug] Return null for bad string values

574d650

CSS Tokenizer: [token processor bug] Scrub decoded UTF-8 ranges

1be64d1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Html css fuzz css token processor#63

Html css fuzz css token processor#63
sirreal wants to merge 213 commits into
trunkfrom
html-css-fuzz-css-token-processor

sirreal commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sirreal commented Jun 12, 2026

Use of AI Tools

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants