Html css fuzz#57
Draft
sirreal wants to merge 205 commits into
Draft
Conversation
Reviewers flagged overclaims in the prose. Corrected with measured numbers, no behavior change: - 'most match assertions non-vacuous' -> true only within the path-directed bucket (~67% non-vacuous); aggregate across all buckets is ~62% vacuous [] == [] by design (negative-oriented and parse-focused buckets are intentionally empty-set). README, FINDINGS, NEXT-STEPS now state both numbers and the actual point (the combinator walker is exercised with real depth). - 'three oracles agree on every compared case' -> kept, but disclose ~half of compared cases are themselves vacuous. - Quirks-mode class/ID matching has NO independent third oracle (lexbor WordPress#368 excludes quirks docs), and ReferenceMatcher shares WP's ASCII-fold reading by construction -> documented as the weakest-covered behavior rather than implied-covered. - The determinism digest covers the WP-under-test surface but not the lexbor oracle's own output -> noted.
Reviewer-1 #10 (class-value decode boundary): ReferenceMatcher::class_matches now folds NUL -> U+FFFD per class token, matching WP_HTML_Tag_Processor:: class_list(). Previously the reference left raw NUL, so it would have diverged from select() on a class value containing NUL (a reference gap, not a WP bug). Pinned by five deterministic decode-boundary cases in self-check.php (NUL->FFFD, trailing NUL, FF separator) that put the two independent class tokenizers against each other and against select(). Randomized document-side NUL injection is deliberately NOT added: a hot-path PRNG draw perturbs self-check.php's fixed seed space enough to surface the known Bug 3, which would first require decoupling self-check from the unfixed core bugs. Documented as a scoped future improvement in README 'Known oracle limitations', which also distinguishes the independent class-value path from the shared get_attribute() attribute path per the reviewer's sharpening. Reviewer-2 nitpick (fid sanitization asymmetric on match path): collect_matches now routes select() fids through TreeCapture::sanitize_fid, identical to the tree-row and lexbor sides, so a control-char fid can never produce a false lexbor-divergence (unreachable today; fids are integers). Reviewer-3 nitpick: minimize.php header docstring no longer implies --seed works universally. Validated: self-check OK (incl. new cases), unpatched detects the 3 bugs, patched 2000-seed run clean (927 lexbor comparisons), all 3 bugs still reproduce.
Reviewer-1 forward-looking note (not a blocker): self-check.php's parse-expectation loop runs over a fixed seed window that dodges the three known core bugs only by seed luck, so any future generator change can collide with them (as the deferred class-NUL injection already does). Record decoupling self-check from the unfixed core bugs as a standalone hardening item in NEXT-STEPS — worth doing independently and the prerequisite for randomized class-NUL injection.
All three FINDINGS.md bugs are now fixed on this branch by the
'CSS selector:' commits, each with PHPUnit regression tests, and a
post-fix 5000-seed run is clean. Update FINDINGS.md and the NEXT-STEPS
regression anchors accordingly: the minimal repros must now NOT trigger.
Corrections and additions surfaced by the fix-review panels:
- Bug 2's claimed substr_compare negative-length edge does not exist
(-strlen('') is 0, and PHP clamps out-of-range negative offsets); the
landed fix also excludes ~= from the guard since a whitespace-delimited
list never yields an empty item.
- README: note lexbor's [x~=""] whitespace-value divergence (matches
where Selectors-4/Chrome/WP say no match) as a latent noise source and
candidate upstream report.
- NEXT-STEPS: candidate finding 4 — \ at EOF is a valid escape per CSS
Syntax §4.3.8 and should decode to U+FFFD in ident context, but WP
rejects it ('.foo\' parses to null). Unverified, low severity.
- NEXT-STEPS: self-check/known-bug decoupling reframed now that this
branch's core is fixed; hazard remains for unfixed-core runs.
Three spec-conformance bugs in liblexbor v3.0.0, found via the differential oracle and re-verified directly against the harness: 1. [x~=""] matches whitespace-only attribute values (empty operand must never match, Selectors-4 §6.1). 2. Uppercase I/S attribute-selector modifiers are parse errors (Selectors-4 §6.3 defines them case-insensitively). 3. Non-ASCII ident code points below U+00F8 (U+00B7, U+00C0-U+00F6) are rejected; the table appears to start at U+00F8 (CSS Syntax 3 §4.2). Each comes with verified repro tables, control cases bracketing the boundary, spec citations, a self-contained C repro skeleton, and instructions for an agent to re-verify at lexbor master, dedupe against existing issues, and file upstream. lexbor WordPress#368 is already filed and excluded.
Track the core EOF-escape fix (candidate finding 4, now confirmed against lexbor and fixed): - The invalid bucket's lone '\' entry is now a valid selector (type U+FFFD); replace it with "\\n" (backslash before a newline is not a valid escape and stays invalid). - New 'eof-escape' kind in the edge-escape bucket generates '.name\', '#name\', and 'name\' (including empty-name variants) with expected ASTs ending in U+FFFD, exercising both the EOF escape decode and the normalize-input trailing-whitespace handling. NEXT-STEPS.md: mark candidate finding 4 fixed (including the trailing-trim wrong-match-set bug its review surfaced) and record the session decisions: EOF-truncated attribute selectors will be made spec-conformant (auto-close), the HTML case-insensitive attribute value list will be implemented, grammar-level truncations stay invalid, no Trac tickets. Self-check OK; 5000-seed runs clean at seeds 1-5000 and 7000000+ (reviewer-chosen fresh range) with lexbor comparisons active.
Track the core EOF auto-close change:
- New 'eof-truncated' kind in the edge-escape bucket: render an
attribute compound, strip the trailing ']', sometimes also drop a
closing string quote (EOF terminates the string, then the block) and
sometimes append a trailing backslash to the unterminated string
(the 'do nothing' escape arm, keeping that branch exercised after
the '[a="x\' invalid template became valid).
- Invalid bucket reshuffled along the new validity boundary: entries
that EOF auto-close makes valid ('[a', '[ a', '[a=b', '[a="x\',
'[a="b]', "[a='b]", '[a=b i') are replaced with still-invalid
grammar-level truncations ('[a=', '[a= ', '[a~', '[a^', '[a=b x',
'[a=b ix', '[a=b i x', '[5=b', '[a="b<LF>c').
COVERAGE.md regenerated: 384/408 lines (94.1%), effective 396/408
(97.1%). The parse_string() and ident-start EOF guards flipped from
'defensive, unreachable' to genuinely covered ('[a=' now reaches the
value parsers at EOF); next_two_are_valid_escape()'s EOF guard is the
remaining defensive line.
NEXT-STEPS.md: record candidate finding 5 (escaped attribute-selector
modifier idents like '[a=b \69]' are rejected by the byte-wise
modifier switch; Chromium itself is inconsistent — accepts \69/i,
rejects \73/s; fail-safe refusal, not fixed).
Review panel: two approvals; 20k-case generator oracle loop and a
fresh 9M-seed-range fuzz run by reviewers found no oracle mismatches.
Gates: self-check OK, full suite 1631 green, 5000-seed run clean.
Track the core change in the oracle stack: - ReferenceMatcher: independent copy of the 46-name list (the oracle must not share a possible misreading with the implementation under test); attr matching folds when no modifier + html-namespace row + listed name. A new $html_attr_ci flag threads through the matching entry points so the lexbor comparison can model an engine without the rule. - TreeCapture: html-processor rows now carry the element's namespace (svg/math subtrees and foreignObject integration points get correct per-element folding; rows without the field default to html, which matches both the model generator's html-only output and the namespace-blind standalone Tag Processor). - Worker: lexbor does not implement the rule ([rel=NOFOLLOW] fails to match rel="nofollow"); its expectation is now always recomputed with the list disabled, composed with the existing issue-368 quirks fold. Candidate upstream report recorded in NEXT-STEPS.md. - SelectorGenerator: case-flip twists in gen_attr_selector and, more importantly, path_attr_feature — the path-directed bucket pairs attribute name and value from the same real element, so a flipped operand makes the folding rule load-bearing for the mustMatchFid invariant. Mutation-tested: with the core folding branch disabled, a 3000-seed run fires 11 match-mismatch failures (review found the earlier pool-based flip alone was load-bearing in ~1/14k cases). - util: ascii_strtoupper and str_shuffle_case (ASCII-only, multibyte bytes pass through untouched). Review panel: two approvals (spec reviewer machine-diffed both list constants against the live spec; oracle reviewer verified row namespaces against match-time get_namespace() including integration points, the lexbor compensation composition, and util determinism). Gates: self-check OK, suite 1640 green, 5000-seed run clean, fresh 11M-seed-range reviewer run clean.
Issues 4 and 5, both surfaced while verifying the WP conformance fixes
against lexbor and both re-verified directly against the harness:
4. EOF does not auto-close an open attribute selector block:
'[att=val' is a parse error where CSS Syntax §5.4.8 returns the
block ('[att=val]') and Chrome accepts it. Controls confirm
grammar-level truncation ('[att=', '[') is correctly rejected and
escape-at-EOF ('.foo\' -> foo U+FFFD) already works — the gap is
specifically the simple-block auto-close.
5. HTML's case-insensitive attribute value list is not implemented:
'[rel=nofollow]' does not match rel="NOFOLLOW" where the HTML
spec and Chrome fold the 46 listed attributes' values. Controls
confirm explicit i/s modifiers and unlisted attributes behave.
Includes the namespace-scoping caveat (spec scopes to HTML
elements; Chrome folds SVG too).
Both are compensated for in this fuzzer's differential (issue 4 never
reaches lexbor because the differential compares canonical re-renders;
issue 5 is compensated like WordPress#368 by comparing lexbor against the
reference run with the list disabled). Same filing-agent protocol as
issues 1-3: re-verify at master, dedupe, one self-contained C repro
per issue.
- Add the missing IMPLEMENTED entry for EOF auto-close (the session
decision block still said 'will be made spec-conformant').
- Record the invalid-UTF-8 escape-decode wart both review panels
flagged ('\' + invalid byte decodes to mb_substitute_character,
'?', instead of U+FFFD) and tie it to the open handoff item 5
contract decision.
- Point the lexbor gaps at their now-drafted UPSTREAM-ISSUES.md
entries (issues 4 and 5) instead of 'candidate upstream report'.
- Note the mutation-test result for the path-directed case-flip and
the two minor review leftovers (namespace-defaulting dead helpers,
s-modifier differential coverage).
- Fix the stale repo-state paragraph (the tooling has been committed
on this branch since 2026-06-10) and list what remains open.
Move the perf follow-up out of the still-open list and record the outcome: _wp_scan_utf8-based in-place sizing, byte-identical behavior (74M differential cases), linear scaling, and the deliberately-kept quadratic mb_substr fallback for escaped invalid bytes pending the invalid-UTF-8 policy decision.
The invalid-UTF-8 escape-decode note now points at the mb_substitute_character canary fixture and pins landed in wpCssSelectorParserMatcher.php, which serve as the ready-made red suite for the item-5 U+FFFD fix.
The core scrub change makes from_selectors() report _doing_it_wrong() once per parse of an invalid-UTF-8 selector. The worker's notice invariants assumed zero notices for parseable selectors and exactly two for unparseable ones; the chaos/mutated buckets organically produce invalid-UTF-8 selectors, so a 5000-seed run failed 108 cases against the new core behavior (94 doing-it-wrong-unexpected, 12 doing-it-wrong-missing, 2 case-determinism). Worker changes: - flush_select_parse_caches(): both select() implementations memoize the most recently parsed selector string in a function static, so whether a call re-parses — and therefore whether the parse-time scrub notice fires — depended on worker history, breaking the case-determinism re-run. Parsing a sentinel (#-fuzz-cache-flush-) through both processors before each notice-assertion window makes exactly one parse happen inside it. The flush works even for unparseable sentinels (the cache assigns before the null check) and precedes reset_doing_it_wrong(), so it cannot pollute recordings. - check_select_matches() expects exactly one scrub notice — named WP_CSS_Compound_Selector_List::from_selectors for the tag target, WP_CSS_Complex_Selector_List::from_selectors for html — iff wp_is_valid_utf8() rejects the selector string, and nothing else. Review verified the predicate is exactly equivalent to "the scrub changed the input" (exhaustive 1-2-byte strings plus 2M random). - check_select_rejection() expects the two per-call select() notices (those fire on cache hits too) plus one leading scrub notice for invalid-UTF-8 selectors, order- and name-exact via notices_match(). Stale comments updated now that parsed ASTs are valid UTF-8 by construction: Metamorph's variants() guard and the lexbor differential's skipped-utf8 state are kept as defense in depth (a nonzero skipped-utf8 tally now indicates a normalization bypass), and the invariant glossary describes the expected-set semantics. NEXT-STEPS.md: the invalid-UTF-8 policy item is resolved as scrub (decision history, the linked value-getter pin obligation, the optional parse()-visibility follow-up); the O(1) decode entry notes its mb_substr() fallback was since removed; the still-open list points at the deferred coverage work (dedicated invalid-UTF-8 generator bucket, raw-byte mutation class, explicit lexbor probe — handoff drafted) and records that the chaos/mutated buckets already exercise the scrub organically, with lexbor agreeing across clean 5000- and 10000-seed runs. Gates: self-check OK, 5000 seeds 0 failures, plus independent reviewer runs (2x2000 determinism-checked, 8000 additional seeds, all clean).
The core scrub (598ed6f) decodes selector input before parsing: each maximal subpart of an ill-formed UTF-8 sequence becomes one U+FFFD (CSS Syntax 3 §3.2 via the WHATWG decoder). Until now only the chaos/mutated buckets exercised that path, organically and without AST expectations. The new bucket (weight 5 in both maps) injects one raw ill-formed sequence into a class/ID/attribute-name ident or quoted attribute string operand — lead/mid/trail/whole position, optionally behind a span type — and carries the post-scrub AST as its expectation. The per-class U+FFFD counts are pinned in INVALID_UTF8_CLASSES independently of wp_scrub_utf8(), so the AST round-trip is a real differential against the core scrub: lone continuation, truncated 2/3/4-byte leads, and invalid leads F5/FF decode to 1; overlong C0 80 / C1 BF to 2; surrogate half ED A0 80 to 3; beyond-max F4 90 80 80 to 4. An injected sequence is always followed by ASCII or end of input, so a continuation byte can never complete a truncated lead and shift the subpart boundaries. self-check gains a forced-bucket section (150 seeds): the selector must be invalid UTF-8, parse in both grammars, and parse to exactly the pinned AST; variety assertions require all subpart counts {1,2,3,4}, all four injection sites, and all ten byte classes. The class names and byte values are duplicated in the test deliberately — tallying from the generator's own table would shrink the assertion with a deleted entry and self-validate a drifted byte value (both demonstrated live in review). Adversarial review: three hostile reviewers. The spec reviewer verified the count table against an independently written WHATWG decoder (960 table contexts x 3 oracles; all 2880 site/position/class render combinations decode to the assumed post-scrub string; key-order-exact ASTs in both grammars; 3000-seed sweep clean). The test reviewer ran nine mutations — count drift, raw-byte expectations, suffix-guarantee removal, core scrub no-op, per-byte core scrub (killed exclusively by the two truncated classes that discriminate maximal-subpart from per-byte replacement), class deletion and de-selection, byte drift — all killed after two hardening rounds; two disclosed low-severity survivors remain (lone-continuation substring ambiguity; a class added to the table alone gets no variety pin). The integration reviewer confirmed the scrub-notice contract cannot flip (5200 constructed cases all invalid UTF-8), the lexbor differential stays live via the canonical re-render (zero skipped-utf8), digest determinism on every in-bucket seed in 1-400, and replay/minimizer behavior on raw-byte selectors. Gates: self-check OK; 5000 seeds, 0 failures (241 invalid-utf8 cases).
The mutated bucket's operations drew from a pure-ASCII alphabet, so the only ill-formed UTF-8 it produced came from delete/duplicate corrupting the pools' few multibyte characters (leads C3/CE/E2/F0 only). A new mutation kind (weight 12) splices one raw sequence from INVALID_UTF8_CLASSES at an arbitrary byte offset — possibly splitting an existing multibyte character or landing where a following continuation byte re-validates the string. These cases carry no AST expectation; they exercise crash, scrub-notice, and differential paths, and they make the worker's invalid-UTF-8 rejection branch hot (an unparseable invalid-UTF-8 selector expects scrub + two select() notices), which no bucket reached before: the invalid-utf8 bucket always parses and the chaos alphabets are valid UTF-8. self-check asserts the operation fires: at least 10 of 200 forced mutated seeds must contain a marker byte C0/C1/ED/F4/F5/FF (currently 28). The marker set is exactly the sound subset: those bytes cannot occur in any clean render (C0/C1/F5/FF never appear in valid UTF-8; ED/F4 only for U+D000-D7FF / above U+FFFFF, which no pool emits), while the four marker-free sequences (80, C3, E2 8C, F0 9F 82) reuse bytes that legitimate pool characters contain. Adversarial review: the same three hostile reviewers, all approved. Spec: splice arithmetic verified at every boundary (empty selector, at=0/length, cross-round corruption), a 20000-seed crash sweep with warnings escalated to exceptions came back clean, marker exclusivity confirmed against the pre-change generator (0 hits in 20000 seeds; the red loop reproduced exactly). Test adequacy: dead arm, dead weighted entry, and empty payload all collapse to 0/200 against the 28/200 baseline (threshold ~4 sigma below the mean under PRNG reshuffles); the one survivor (dropping only marker-free payloads) is probe diversity, not verification, and the bucket commit pins all ten classes. Integration: 5000 seeds 0 failures with byte-identical bucket distribution to the pre-change baseline, the notice contract verified self-keyed on the final byte string under 11 adversarial splice shapes including validity-restoring ones, determinism on every mutated seed in 1-400. Gates: self-check OK; 5000 seeds, 0 failures.
The handoff's open question — what does lexbor do with raw ill-formed UTF-8 in selectors — is resolved empirically: lexbor v3.0.0 accepts the bytes (no parse error) and replaces them with U+FFFD, but not per the WHATWG maximal-subpart rule CSS Syntax 3 §3.2 invokes. Truncated multi-byte sequences decode to one U+FFFD per byte (E2 8C to 2, spec 1; F0 9F 82 to 3, spec 1) and UTF-8-encoded surrogate halves decode permissively as a single unit (ED A0 80 to 1, spec 3); agreement on the other byte classes is coincidental overlap of the two algorithms. Drafted as UPSTREAM-ISSUES.md issue 6 with the probe table. On the document side lexbor keeps raw invalid bytes unchanged in the DOM (the same stance as the Tag Processor), so raw doc bytes match nothing in either engine. The differential needs no n/a gating for the invalid-utf8 bucket: the worker hands lexbor a canonical re-render of the post-scrub AST (pure ASCII), the same mechanism that sidesteps lexbor's other byte-level parsing bugs, so the bucket compares normally. NEXT-STEPS.md: the deferred scrub-coverage item resolves as implemented (bucket, splice, probe); the handoff's optional metamorphic relation parse(s) === parse(scrub(s)) is recorded as deliberately skipped (no public path bypasses from_selectors(), so it is near-tautological). New small open item from review: gen_chaos()'s whole-codepoint unicode branch is dead code (string-vs-key comparison), and its byte-sliced fallback is what makes chaos emit invalid UTF-8 organically (~15% of chaos cases) — making the branch live is a behavior decision now that deliberate ill-formed coverage exists. COVERAGE.md regenerated at the 3000-seed window: 396/424 = 93.4% raw, 408/424 = 96.2% effective. All 28 unreached lines accounted for: 12 phpdbg case-label artifacts, 12 defensive guards, the 2-line escape-decoder invalid-byte arm the scrub made unreachable through from_selectors() (pinned by PHPUnit), and 2 reachable lines this window misses (witnesses verified under phpdbg: '[' reaches the attribute length guard, '[a="b' reaches the string-to-EOF break). Stale 93.8%/96.8% references in NEXT-STEPS.md and FINDINGS.md updated to point at COVERAGE.md as the source of truth. Adversarial review: the same three hostile reviewers, all approved after two correction rounds. The issue-6 table was reproduced 10/10 rows against the pinned harness by two reviewers independently, the WHATWG column confirmed against an independent spec-transcribed decoder, and the ED-restricts-its-first-continuation-to-9F subpart reasoning checked against the Encoding Standard's ranges. The coverage table and uncovered-line list reproduced exactly; the case-label artifact demonstrated mechanistically (executable-but-unloggable label lines with executing bodies); the un-normalized-only claim verified in both directions (direct parse_ident hits the arm, from_selectors never does). Corrections from review: the issue-6 legend mislabeled U+FFFD counts as byte counts; stale coverage numbers contradicted the regenerated report; chaos's organic invalid UTF-8 was misattributed to pool corruption (it byte-slices its unicode alphabet); an 'all production moves' claim ignored mutated's residual organic corruption (~2% of pre-splice mutated cases). Gates: self-check OK; docs-only change — code identical to 8333b93, whose 5000-seed run was clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Trac ticket:
Use of AI Tools
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.