Skip to content

HTML API: Fix nested anchors in MathML text integration points#41

Open
sirreal wants to merge 5 commits into
trunkfrom
fix/html-api-mathml-nested-anchor
Open

HTML API: Fix nested anchors in MathML text integration points#41
sirreal wants to merge 5 commits into
trunkfrom
fix/html-api-mathml-nested-anchor

Conversation

@sirreal

@sirreal sirreal commented Jun 9, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add focused HTML API breadcrumb regressions for nested anchors inside MathML text integration points.
  • Stop treating non-current open-elements removals as normal pop events.
  • Queue the removed anchor's virtual closer once the public breadcrumb stack unwinds to its original depth, including fragment and full-parser EOF.

Testing

  • composer install
  • WP_TESTS_SKIP_INSTALL=1 vendor/bin/phpunit --group html-api --filter 'test_reports_nested_anchor_breadcrumbs_inside_mathml_text_integration_point|test_removes_outer_anchor_breadcrumb_after_mathml_text_integration_point_closes|test_visits_outer_anchor_virtual_closer_after_mathml_text_integration_point_closes|test_visits_outer_anchor_virtual_closer_at_end_of_fragment|test_visits_outer_anchor_virtual_closer_before_full_parser_eof_closers'
  • WP_TESTS_SKIP_INSTALL=1 vendor/bin/phpunit --group html-api tests/phpunit/tests/html-api/wpHtmlProcessorBreadcrumbs.php
  • WP_TESTS_SKIP_INSTALL=1 vendor/bin/phpunit --group html-api
  • WP_TESTS_SKIP_INSTALL=1 vendor/bin/phpunit --group html-api-html5lib-tests
  • vendor/bin/phpcs --standard=phpcs.xml.dist src/wp-includes/html-api/class-wp-html-open-elements.php src/wp-includes/html-api/class-wp-html-processor.php tests/phpunit/tests/html-api/wpHtmlProcessorBreadcrumbs.php
  • git diff --check trunk..HEAD

Notes

The original fuzzer replay artifact referenced in the handoff was removed during worktree cleanup, so verification is covered by direct unit regressions for the minimized input and the related fragment/full-parser EOF paths.

@sirreal sirreal marked this pull request as ready for review June 9, 2026 21:41
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@sirreal

sirreal commented Jun 10, 2026

Copy link
Copy Markdown
Owner Author

Code review

Found 1 issue:

  1. The breadcrumb-depth scan matches by node name and latches onto same-named foreign elements, regressing trailing-content breadcrumbs. When the removed anchor's depth is recorded, the scan walks down from the top of the breadcrumbs to the first crumb named A. A foreign a element (MathML or SVG a — which never participates in the active formatting elements and so is not the removed node) sitting between the removed HTML anchor and the integration point matches first, storing the wrong depth. The virtual closer then never fires (the injection guards correctly suppress it at the foreign element's depth), and the removed anchor's breadcrumb persists for the remainder of the document.

    Reproduction (fragment parser): <a><math><a><mtext>x<a>y</a></mtext></a></math>z<span>t — the third <a> is in a MathML text integration point (mtext), triggering the same non-LIFO removal of the outer anchor as this PR's <mi> test cases. The stored breadcrumb_depth is 5 (the foreign a) instead of 3 (the removed outer anchor). With this branch, the trailing text and span report [HTML, BODY, A, #text] and [HTML, BODY, A, SPAN], and matches_breadcrumbs( array( 'A', 'SPAN' ) ) falsely matches — but per spec (verified against a lexbor-backed HTML5 parser) the outer a was removed from the stack of open elements, so z and the span are siblings of it under BODY. Trunk gets these trailing breadcrumbs right ([HTML, BODY, #text], [HTML, BODY, SPAN]), because its spurious immediate pop happens to rebalance the count — so this is a behavioral regression for everything after the foreign subtree, not just a missed fix. The SVG variant <a><svg><a><foreignObject>x<a>y</a></foreignObject></a></svg>z<span>t fails identically. Consider recording the removed node's actual breadcrumb index (derivable from its stack position at removal time) instead of scanning by name.

$breadcrumb_depth = count( $this->breadcrumbs );
while ( 0 < $breadcrumb_depth && $this->breadcrumbs[ $breadcrumb_depth - 1 ] !== $item->node_name ) {
--$breadcrumb_depth;
}
$this->non_lifo_breadcrumb_removals[] = array(
'token' => $item,
'breadcrumb_depth' => $breadcrumb_depth,
);

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Scanning breadcrumbs by node name latched onto same-named foreign
elements (MathML or SVG A) between the removed HTML anchor and the
integration point, storing the wrong depth. The virtual closer then
never fired and the stale anchor breadcrumb persisted for the rest of
the document. Record the removed node's position in the stack of open
elements instead, accounting for the fragment parser's context crumb.
@sirreal

sirreal commented Jun 10, 2026

Copy link
Copy Markdown
Owner Author

Following up on the automated review — one actionable item, nothing blocking.

Add a test for a new same-name opener immediately after the subtree closes. The new tests cover the removed outer <a>'s virtual closer when followed by text, at fragment EOF, and before full-parser EOF closers — but not when the next token is a fresh <a> opener:

<a><math><mi>x<a>y</a></mi></math><a>z

I ran this; it behaves correctly (the outer <a> virtual closer fires before the new <a> opens). But it's the single input that exercises the get_adjusted_current_node() guard and the same-name $next_event lookahead together, and it's exactly the case that looks broken on a first read. Closest existing test to extend:

public function test_visits_outer_anchor_virtual_closer_after_mathml_text_integration_point_closes() {
$processor = WP_HTML_Processor::create_fragment( '<a><math><mi>x<a>y</a></mi></math>z' );

Worth pinning so a future refactor of those guards can't silently regress it.

Minor: the matching in queue_virtual_closer_after_non_lifo_removal() is keyed on node_name and reads as identity-unsafe at a glance — the $next_event->token !== $removed_token && ...->node_name === ...->node_name lookahead is what actually recovers identity. A one-line comment stating the invariant (only one pending non-LIFO removal matches at this depth; identity is disambiguated via the queued POP) would save the next reader the same double-take that snagged the review.

// At EOF, normal stack pops may be queued and processed after the stack is empty.
$adjusted_current_node = $this->get_adjusted_current_node();
if ( isset( $adjusted_current_node ) && end( $this->breadcrumbs ) === $adjusted_current_node->node_name ) {
return false;
}
$next_event = reset( $this->element_queue );
if (
false !== $next_event &&
WP_HTML_Stack_Event::POP === $next_event->operation &&
$next_event->token !== $removed_token &&
$next_event->token->node_name === $removed_token->node_name
) {
return false;

@sirreal

sirreal commented Jun 10, 2026

Copy link
Copy Markdown
Owner Author

Note: a nearby integration-point behavior that looks like a bug but is correct

While fuzzing this branch, the differential oracle flagged this input as a tree mismatch:

<h2><math><mi>a</h2><x-0>b</x-0>

It is not a bug. The HTML API output matches the spec and Chromium exactly — </h2> is ignored and <x-0> is placed inside the MathML mi:

HTML
└─ BODY
   └─ H2
      └─ math
         └─ mi
            ├─ "a"
            └─ x-0
               └─ "b"

Why: </h2> arrives with mi (a MathML text integration point) as the current node. End tags in an integration point are dispatched through the foreign-content rules, which walk up to the HTML-namespace h2 and hand off to the "in body" heading end-tag steps. Those require the heading to be in scope, but a MathML mi is a scope boundary, so h2 is not in scope and the token is ignored. <x-0> (a start tag in a text integration point → HTML rules) is then inserted into mi.

Live DOM viewer (matches this output): https://software.hixie.ch/utilities/js/live-dom-viewer/?%3Ch2%3E%3Cmath%3E%3Cmi%3Ea%3C%2Fh2%3E%3Cx-0%3Eb%3C%2Fx-0%3E

The disagreement came from PHP's Dom\HTMLDocument (the fuzzer's oracle), which reparents <x-0> out of mi as a sibling of h2 — a known divergence of that parser from the spec/browsers, not an HTML API defect.

Added a regression test pinning this so it can't silently change: test_heading_end_tag_in_mathml_text_integration_point_is_ignored in wpHtmlProcessorBreadcrumbs.php.

A heading end tag (</h2>) inside a MathML text integration point (MI) is
ignored because MI is a scope boundary, so following content stays inside
the integration point rather than becoming a sibling of the heading. This
matches the spec and browsers; add a regression test so it can't silently
change.
sirreal added a commit that referenced this pull request Jun 10, 2026
sirreal added a commit that referenced this pull request Jun 10, 2026
A new A opener immediately after the foreign subtree closes is the one
input where the adjusted-current-node guard and the same-name next-event
lookahead must cooperate: the removed outer A's virtual closer must fire
before the new same-name element opens as a sibling. Add a regression
test walking every A visit, and document the identity invariant that the
next-event lookahead recovers in the queueing guard.
@sirreal

sirreal commented Jun 10, 2026

Copy link
Copy Markdown
Owner Author

Both items addressed in 10b3976:

  • Same-name opener test: test_visits_outer_anchor_virtual_closer_before_same_name_opener walks every A visit for <a><math><mi>x<a>y</a></mi></math><a>z and asserts the full sequence — inner open/close inside MI, the removed outer A's virtual closer at [HTML, BODY], then the new opener at [HTML, BODY, A] as a sibling, and its EOF closer.
  • Identity invariant comment: added above the $next_event lookahead in queue_virtual_closer_after_non_lifo_removal(), stating that the depth/name checks can't distinguish same-named elements at the same depth and that the queued POP recovers identity.

sirreal added a commit that referenced this pull request Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant