Skip to content

Parse XML-style <?target data?> processing instructions#12118

Open
foolip wants to merge 24 commits into
mainfrom
foolip/pi-parsing
Open

Parse XML-style <?target data?> processing instructions#12118
foolip wants to merge 24 commits into
mainfrom
foolip/pi-parsing

Conversation

@foolip

@foolip foolip commented Jan 31, 2026

Copy link
Copy Markdown
Member

Notably behavior and their rationale:

  • While <? opens a PI, a > always closes it. This to match the bogus
    comment tokenizer behavior, so that exactly the same characters are
    consumed as bogus comments in existing parsers. Any deviation would
    make this a much riskier parser change.
  • The target must start with ASCII alpha, but after that almost anything
    goes, matching tag names. This doesn't have to be this way, but
    there's no obvious other precedent to follow.
  • The target is ASCII-lowercased by the tokenizer, just like tag and
    attribute names. This is for consistency, and there's precedent from
    the SVG-in-HTML parser behavior.
  • If target is a case-insensitive match for "xml" or "xml-stylesheet",
    it's instead treated as a bogus comment. <?xml?> is not a valid PI
    in XML, and <?xml-stylesheet href="style.css"?> would otherwise
    start loading stylesheets in HTML documents, which we don't want.
  • A ? that's not followed by a > becomes part of the data (never
    the target). This is to match XML for a <?t ???> where data is "??".
  • For <?t?d?> (invalid XML) the target is "t" and data is "?d". This
    behavior results from handling ? the same way wherever it's
    encountered. The example would serialize back as <?t ?d?>.
  • <?> and <? followed by EOF are treated as bogus comments, because
    this is the most conservative choice, and also avoids empty target.

(See WHATWG Working Mode: Changes for more details.)


/index.html ( diff )
/infrastructure.html ( diff )
/parsing.html ( diff )
/syntax.html ( diff )

Comment thread source Outdated
Comment thread source Outdated
Comment thread source Outdated
Comment thread source Outdated
Comment thread source Outdated
Comment thread source
Notably behavior and their rationale:

- While `<?` opens a PI, a `>` always closes it. This to match the bogus
  comment tokenizer behavior, so that exactly the same characters are
  consumed as bogus comments in existing parsers. Any deviation would
  make this a much riskier parser change.
- The target must start with ASCII alpha, but after that almost anything
  goes, matching tag names. This doesn't have to be this way, but
  there's no obvious precedent to follow.
- The target is ASCII-lowercased by the tokenizer, just like tag and
  attribute names. This is for consistency, and there's precedent from
  the SVG-in-HTML parser behavior.
- A `?` that isn't followed by `>` is treated like whitespace, similar
  to `/` in opening tags. This is also the simplest, with a single
  tokenizer state used whenever `?` is encountered.
- `<?>` and `<?` followed by EOF are treated as bogus comments, because
  this is the most conservative choice, and also avoids empty target.
@foolip foolip force-pushed the foolip/pi-parsing branch from f806ae3 to a26fe99 Compare January 31, 2026 08:09
Comment thread source Outdated
Comment thread source Outdated
Comment thread source Outdated
Comment thread source
Comment thread source Outdated
@foolip foolip marked this pull request as ready for review January 31, 2026 21:15
@foolip

foolip commented Jan 31, 2026

Copy link
Copy Markdown
Member Author

I've polished some more, this is close to ready now. Needs impl interest and tests, of course.

@noamr

noamr commented Jan 31, 2026

Copy link
Copy Markdown
Contributor

I've polished some more, this is close to ready now. Needs impl interest and tests, of course.

Nice! I can have tests ready within a few days.

@hsivonen hsivonen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks.

Comment thread source Outdated
Comment thread source Outdated
Comment thread source Outdated
Comment thread source Outdated
@zcorpan

zcorpan commented Feb 2, 2026

Copy link
Copy Markdown
Member
  • For <?t?d?> (invalid XML) the target is "t" and data is "?d". This
    behavior results from handling ? the same way wherever it's
    encountered. The example would serialize back as <?t ?d?>.

Nit: As specified in https://html.spec.whatwg.org/#serialising-html-fragments it would serialize as <?t ?d> (without trailing ?)

@foolip

foolip commented Feb 3, 2026

Copy link
Copy Markdown
Member Author

@zcorpan you're absolutely right, I was testing in Chrome and didn't realize the spec says otherwise.

@foolip

foolip commented Feb 3, 2026

Copy link
Copy Markdown
Member Author

Test for serialization: web-platform-tests/wpt#57486

@zcorpan

zcorpan commented Feb 3, 2026

Copy link
Copy Markdown
Member

Ah, I did not know Chromium and WebKit serialize with ?>... After discussing with @hsivonen a bit, we think it makes sense to change Gecko to serialize with ?> also.

This opens up for making the ? required for document conformance.

Pros to making it required:

  • It's a clear deliberate ending of the PI, so that an accidental > can be highlighted as a syntax error.
  • Matches XML (except HTML PI can't contain >).

Cons:

  • It's an extra character that's not strictly necessary.

@zcorpan

zcorpan commented Feb 3, 2026

Copy link
Copy Markdown
Member

Also, maybe we can just support xml-stylesheet in HTML (but do nothing for XSLT)? It's already supported in the DOM: https://software.hixie.ch/utilities/js/live-dom-viewer/saved/14486

I checked these 6 pages https://docs.google.com/spreadsheets/d/1o04eP_BwH1u7X8CyyLUvxOsntZNmfrDalfhU7ldVqlU/edit?gid=233477192#gid=233477192&range=C12 , only one has a PI pointing to external CSS but that file is empty.

@noamr

noamr commented Feb 3, 2026

Copy link
Copy Markdown
Contributor

Also, maybe we can just support xml-stylesheet in HTML (but do nothing for XSLT)? It's already supported in the DOM: https://software.hixie.ch/utilities/js/live-dom-viewer/saved/14486

I checked these 6 pages https://docs.google.com/spreadsheets/d/1o04eP_BwH1u7X8CyyLUvxOsntZNmfrDalfhU7ldVqlU/edit?gid=233477192#gid=233477192&range=C12 , only one has a PI pointing to external CSS but that file is empty.

Yea we could probably also parse that PI for <?xml> without that doing too much damage as nobody is looking at it.

@foolip

foolip commented Feb 4, 2026

Copy link
Copy Markdown
Member Author

But since we have to constrain the syntax for compat, I think the safest bet is what @noamr suggested, starts with alpha and continues with alphanumeric or hyphen.

Starts with ASCII alpha and continues with ASCII alphanumeric or hyphen seems OK.

I've made this change in 64f133a.

Comment thread source
Comment thread source Outdated
Comment thread source Outdated
Comment thread source Outdated
Comment thread source Outdated
Comment thread source
Comment thread source
Comment thread source
Comment thread source
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this pull request Feb 11, 2026
The parser now recognizes <?target data> as a ProcessingInstruction and
adds it to the DOM instead of a bogus comment.

As per spec PR:
- xml/xml-stylesheet are blocklisted, and stay a bogus comment.
  We can add more of these if there are compat issues.
- A PI can appear wherever a comment appears.
- ?> at the end ignores the ?

Currently in this CL, PI targets are constrained to
/^[A-Za-z][A-Za-z0-9-]*$/.

Added a VTS that keeps current behavior, so that we don't lose some of
the existing html5lib tests while this is in development.

See spec PR: whatwg/html#12118

I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com
Bug: 481087638
Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0
ajperel pushed a commit to chromium/chromium that referenced this pull request Feb 11, 2026
The parser now recognizes <?target data> as a ProcessingInstruction and
adds it to the DOM instead of a bogus comment.

As per spec PR:
- xml/xml-stylesheet are blocklisted, and stay a bogus comment.
  We can add more of these if there are compat issues.
- A PI can appear wherever a comment appears.
- ?> at the end ignores the ?

Currently in this CL, PI targets are constrained to
/^[A-Za-z][A-Za-z0-9-]*$/.

Added a VTS that keeps current behavior, so that we don't lose some of
the existing html5lib tests while this is in development.

See spec PR: whatwg/html#12118

I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com
Bug: 481087638
Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7532085
Commit-Queue: Noam Rosenthal <nrosenthal@google.com>
Reviewed-by: Philip Jägenstedt <foolip@chromium.org>
Reviewed-by: Dominic Farolino <dom@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1583351}
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this pull request Feb 11, 2026
The parser now recognizes <?target data> as a ProcessingInstruction and
adds it to the DOM instead of a bogus comment.

As per spec PR:
- xml/xml-stylesheet are blocklisted, and stay a bogus comment.
  We can add more of these if there are compat issues.
- A PI can appear wherever a comment appears.
- ?> at the end ignores the ?

Currently in this CL, PI targets are constrained to
/^[A-Za-z][A-Za-z0-9-]*$/.

Added a VTS that keeps current behavior, so that we don't lose some of
the existing html5lib tests while this is in development.

See spec PR: whatwg/html#12118

I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com
Bug: 481087638
Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7532085
Commit-Queue: Noam Rosenthal <nrosenthal@google.com>
Reviewed-by: Philip Jägenstedt <foolip@chromium.org>
Reviewed-by: Dominic Farolino <dom@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1583351}
chromium-wpt-export-bot pushed a commit to web-platform-tests/wpt that referenced this pull request Feb 11, 2026
The parser now recognizes <?target data> as a ProcessingInstruction and
adds it to the DOM instead of a bogus comment.

As per spec PR:
- xml/xml-stylesheet are blocklisted, and stay a bogus comment.
  We can add more of these if there are compat issues.
- A PI can appear wherever a comment appears.
- ?> at the end ignores the ?

Currently in this CL, PI targets are constrained to
/^[A-Za-z][A-Za-z0-9-]*$/.

Added a VTS that keeps current behavior, so that we don't lose some of
the existing html5lib tests while this is in development.

See spec PR: whatwg/html#12118

I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com
Bug: 481087638
Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7532085
Commit-Queue: Noam Rosenthal <nrosenthal@google.com>
Reviewed-by: Philip Jägenstedt <foolip@chromium.org>
Reviewed-by: Dominic Farolino <dom@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1583351}
@foolip

foolip commented Feb 13, 2026

Copy link
Copy Markdown
Member Author

We need to decide what to do when we encounter EOF between <? and >. That will determine whether and how much we need to use the temporary buffer, and in my opinion these are the most reasonable options:

Drop PIs on EOF: This is what we do with tags, suggested by @zcorpan in #12118 (comment). This combined with making the target case-preserving alphanumeric+hyphens means we will know at the first non-alphanumeric-or-hyphen if this will be a PI or comment. The characters up to that point will become either the PI target or the start of a bogus comment.

Emit a comment on EOF: This preserves existing behavior in every case that doesn't produce a PI. This will require the temporary buffer for at least the whitespace between target and data, but would be easiest to explain as buffering everything up to >. If we do this, we could easily lowercase the target as well, but they'd still need to be alphanumeric+hyphens so that <?lit$$123456789$> creates a comment, not a PI.

I'll be OOO next week, but it would be great to make a decision. My preference is to drop PIs on EOF, and the others with opinions are probably @annevk, @hsivonen, and @zcorpan.

@zcorpan

zcorpan commented Feb 13, 2026

Copy link
Copy Markdown
Member

For start tags, it was changed in be72d87

Email: https://lists.w3.org/Archives/Public/public-html/2009Apr/0233.html in reply to https://lists.w3.org/Archives/Public/public-html/2009Mar/0260.html

Since PIs have pseudo-attributes and are part of UA processing, it seems to me similar arguments apply.

I guess creating a comment should be safe, but it requires more buffering and adds complexity. I think it shouldn't be needed for web compat.

@noamr

noamr commented Feb 13, 2026

Copy link
Copy Markdown
Contributor

For start tags, it was changed in be72d87

Email: https://lists.w3.org/Archives/Public/public-html/2009Apr/0233.html in reply to https://lists.w3.org/Archives/Public/public-html/2009Mar/0260.html

Since PIs have pseudo-attributes and are part of UA processing, it seems to me similar arguments apply.

I guess creating a comment should be safe, but it requires more buffering and adds complexity. I think it shouldn't be needed for web compat.

Yea creating a comment here makes the arguments from the original email not too relevant because the PI itself is never created. Creating a comment here is slightly more compatible with current behavior and makes debugging slightly easier. The buffering complexity is manageable IMO.

(Though it's also not a huge deal to drop it for simplicity)

lando-worker Bot pushed a commit to mozilla-firefox/firefox that referenced this pull request Feb 16, 2026
…TML parser, a=testonly

Automatic update from web-platform-tests
Support processing instructions in the HTML parser

The parser now recognizes <?target data> as a ProcessingInstruction and
adds it to the DOM instead of a bogus comment.

As per spec PR:
- xml/xml-stylesheet are blocklisted, and stay a bogus comment.
  We can add more of these if there are compat issues.
- A PI can appear wherever a comment appears.
- ?> at the end ignores the ?

Currently in this CL, PI targets are constrained to
/^[A-Za-z][A-Za-z0-9-]*$/.

Added a VTS that keeps current behavior, so that we don't lose some of
the existing html5lib tests while this is in development.

See spec PR: whatwg/html#12118

I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com
Bug: 481087638
Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0
Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7532085
Commit-Queue: Noam Rosenthal <nrosenthal@google.com>
Reviewed-by: Philip Jägenstedt <foolip@chromium.org>
Reviewed-by: Dominic Farolino <dom@chromium.org>
Cr-Commit-Position: refs/heads/main@{#1583351}

--

wpt-commits: 7f5ee752ac61d861ff3481791e1db9da25378061
wpt-pr: 57716
@foolip

foolip commented Mar 4, 2026

Copy link
Copy Markdown
Member Author

I've sent whatwg/dom#1454 with the API changes for ProcessingInstruction to treat the PI data as attributes. The data string and an attribute map are kept in sync whenever one of them changes.

@justinfagnani

Copy link
Copy Markdown

Meta: Is there an issue open to discuss features and design, or is this PR the best place?

Forging ahead anyway: A lot of use cases for PIs involve using them as a virtual container to mark ranges of nodes within the same parent. Sort of like an element that doesn't affect styling, layout, or the element tree at all.

Would there be a place in the parser for a generalized start and end notation, somewhat akin to elements, so that it would be easy and fast to retrieve these pairs of PIs?

They would follow the rules of well-formed element start and end tags: pairs have to have the same parent and must be balanced. Pairs would have references to each other. Any unclosed, unopened, or otherwise unbalanced PIs would omit the reference.

@noamr

noamr commented Mar 4, 2026

Copy link
Copy Markdown
Contributor

Meta: Is there an issue open to discuss features and design, or is this PR the best place?

If this is for the use of PIs for marking ranges for streaming, see #11542
For any kind of more brainstormy incubation, please open an issue in https://github.com/wicG/declarative-partial-updates/.

Forging ahead anyway: A lot of use cases for PIs involve using them as a virtual container to mark ranges of nodes within the same parent. Sort of like an element that doesn't affect styling, layout, or the element tree at all.

Right

Would there be a place in the parser for a generalized start and end notation, somewhat akin to elements, so that it would be easy and fast to retrieve these pairs of PIs?

Probably not in the parser, but #11542 discusses using those ranges of <?start> and <?end> PIs and also some sort of JS API to retrieve them.

They would follow the rules of well-formed element start and end tags: pairs have to have the same parent and must be balanced. Pairs would have references to each other. Any unclosed, unopened, or otherwise unbalanced PIs would omit the reference.

Currently you can have a <?start> without an <?end>, it continues until the end of the element.
#11542 is the place to discuss this, as this PR is one level below - defining how PI parsing happens without describing how they form ranges.

@zcorpan zcorpan left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread source Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

6 participants