Parse XML-style <?target data?> processing instructions by foolip · Pull Request #12118 · whatwg/html

foolip · 2026-01-31T00:22:11Z

Notably behavior and their rationale:

While <? opens a PI, a > always closes it. This to match the bogus
comment tokenizer behavior, so that exactly the same characters are
consumed as bogus comments in existing parsers. Any deviation would
make this a much riskier parser change.
The target must start with ASCII alpha, but after that almost anything
goes, matching tag names. This doesn't have to be this way, but
there's no obvious other precedent to follow.
The target is ASCII-lowercased by the tokenizer, just like tag and
attribute names. This is for consistency, and there's precedent from
the SVG-in-HTML parser behavior.
If target is a case-insensitive match for "xml" or "xml-stylesheet",
it's instead treated as a bogus comment. <?xml?> is not a valid PI
in XML, and <?xml-stylesheet href="style.css"?> would otherwise
start loading stylesheets in HTML documents, which we don't want.
A ? that's not followed by a > becomes part of the data (never
the target). This is to match XML for a <?t ???> where data is "??".
For <?t?d?> (invalid XML) the target is "t" and data is "?d". This
behavior results from handling ? the same way wherever it's
encountered. The example would serialize back as <?t ?d?>.
<?> and <? followed by EOF are treated as bogus comments, because
this is the most conservative choice, and also avoids empty target.

At least two implementers are interested (and none opposed):
- Chromium (@foolip + @noamr)
- …
Tests are written and can be reviewed and commented upon at:
- https://github.com/web-platform-tests/wpt/blob/master/html/syntax/parsing/parse-processing-instruction.tentative.html
- TODO: PR to make them non-tentative
Implementation bugs are filed:
- Chromium: …
- Gecko: …
- WebKit: …
- Deno (only for timers, structured clone, base64 utils, channel messaging, module resolution, web workers, and web storage): …
- Node.js (only for timers, structured clone, base64 utils, channel messaging, and module resolution): …
Corresponding HTML AAM & ARIA in HTML issues & PRs:
MDN issue is filed: …
The top of this comment includes a clear commit message to use.

(See WHATWG Working Mode: Changes for more details.)

/index.html ( diff )
/infrastructure.html ( diff )
/parsing.html ( diff )
/syntax.html ( diff )

Notably behavior and their rationale: - While `<?` opens a PI, a `>` always closes it. This to match the bogus comment tokenizer behavior, so that exactly the same characters are consumed as bogus comments in existing parsers. Any deviation would make this a much riskier parser change. - The target must start with ASCII alpha, but after that almost anything goes, matching tag names. This doesn't have to be this way, but there's no obvious precedent to follow. - The target is ASCII-lowercased by the tokenizer, just like tag and attribute names. This is for consistency, and there's precedent from the SVG-in-HTML parser behavior. - A `?` that isn't followed by `>` is treated like whitespace, similar to `/` in opening tags. This is also the simplest, with a single tokenizer state used whenever `?` is encountered. - `<?>` and `<?` followed by EOF are treated as bogus comments, because this is the most conservative choice, and also avoids empty target.

foolip · 2026-01-31T21:16:20Z

I've polished some more, this is close to ready now. Needs impl interest and tests, of course.

noamr · 2026-01-31T21:18:20Z

I've polished some more, this is close to ready now. Needs impl interest and tests, of course.

Nice! I can have tests ready within a few days.

hsivonen

LGTM. Thanks.

zcorpan · 2026-02-02T15:33:12Z

For <?t?d?> (invalid XML) the target is "t" and data is "?d". This
behavior results from handling ? the same way wherever it's
encountered. The example would serialize back as <?t ?d?>.

Nit: As specified in https://html.spec.whatwg.org/#serialising-html-fragments it would serialize as <?t ?d> (without trailing ?)

foolip · 2026-02-03T08:40:16Z

@zcorpan you're absolutely right, I was testing in Chrome and didn't realize the spec says otherwise.

foolip · 2026-02-03T08:42:16Z

Test for serialization: web-platform-tests/wpt#57486

zcorpan · 2026-02-03T10:55:11Z

Ah, I did not know Chromium and WebKit serialize with ?>... After discussing with @hsivonen a bit, we think it makes sense to change Gecko to serialize with ?> also.

This opens up for making the ? required for document conformance.

Pros to making it required:

It's a clear deliberate ending of the PI, so that an accidental > can be highlighted as a syntax error.
Matches XML (except HTML PI can't contain >).

Cons:

It's an extra character that's not strictly necessary.

zcorpan · 2026-02-03T11:14:11Z

Also, maybe we can just support xml-stylesheet in HTML (but do nothing for XSLT)? It's already supported in the DOM: https://software.hixie.ch/utilities/js/live-dom-viewer/saved/14486

I checked these 6 pages https://docs.google.com/spreadsheets/d/1o04eP_BwH1u7X8CyyLUvxOsntZNmfrDalfhU7ldVqlU/edit?gid=233477192#gid=233477192&range=C12 , only one has a PI pointing to external CSS but that file is empty.

noamr · 2026-02-03T12:33:09Z

Also, maybe we can just support xml-stylesheet in HTML (but do nothing for XSLT)? It's already supported in the DOM: https://software.hixie.ch/utilities/js/live-dom-viewer/saved/14486

I checked these 6 pages https://docs.google.com/spreadsheets/d/1o04eP_BwH1u7X8CyyLUvxOsntZNmfrDalfhU7ldVqlU/edit?gid=233477192#gid=233477192&range=C12 , only one has a PI pointing to external CSS but that file is empty.

Yea we could probably also parse that PI for <?xml> without that doing too much damage as nobody is looking at it.

foolip · 2026-02-04T22:36:46Z

But since we have to constrain the syntax for compat, I think the safest bet is what @noamr suggested, starts with alpha and continues with alphanumeric or hyphen.

Starts with ASCII alpha and continues with ASCII alphanumeric or hyphen seems OK.

I've made this change in 64f133a.

…r discard)

The parser now recognizes <?target data> as a ProcessingInstruction and adds it to the DOM instead of a bogus comment. As per spec PR: - xml/xml-stylesheet are blocklisted, and stay a bogus comment. We can add more of these if there are compat issues. - A PI can appear wherever a comment appears. - ?> at the end ignores the ? Currently in this CL, PI targets are constrained to /^[A-Za-z][A-Za-z0-9-]*$/. Added a VTS that keeps current behavior, so that we don't lose some of the existing html5lib tests while this is in development. See spec PR: whatwg/html#12118 I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com Bug: 481087638 Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0

The parser now recognizes <?target data> as a ProcessingInstruction and adds it to the DOM instead of a bogus comment. As per spec PR: - xml/xml-stylesheet are blocklisted, and stay a bogus comment. We can add more of these if there are compat issues. - A PI can appear wherever a comment appears. - ?> at the end ignores the ? Currently in this CL, PI targets are constrained to /^[A-Za-z][A-Za-z0-9-]*$/. Added a VTS that keeps current behavior, so that we don't lose some of the existing html5lib tests while this is in development. See spec PR: whatwg/html#12118 I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com Bug: 481087638 Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7532085 Commit-Queue: Noam Rosenthal <nrosenthal@google.com> Reviewed-by: Philip Jägenstedt <foolip@chromium.org> Reviewed-by: Dominic Farolino <dom@chromium.org> Cr-Commit-Position: refs/heads/main@{#1583351}

foolip · 2026-02-13T08:32:47Z

We need to decide what to do when we encounter EOF between <? and >. That will determine whether and how much we need to use the temporary buffer, and in my opinion these are the most reasonable options:

Drop PIs on EOF: This is what we do with tags, suggested by @zcorpan in #12118 (comment). This combined with making the target case-preserving alphanumeric+hyphens means we will know at the first non-alphanumeric-or-hyphen if this will be a PI or comment. The characters up to that point will become either the PI target or the start of a bogus comment.

Emit a comment on EOF: This preserves existing behavior in every case that doesn't produce a PI. This will require the temporary buffer for at least the whitespace between target and data, but would be easiest to explain as buffering everything up to >. If we do this, we could easily lowercase the target as well, but they'd still need to be alphanumeric+hyphens so that <?lit$$123456789$> creates a comment, not a PI.

I'll be OOO next week, but it would be great to make a decision. My preference is to drop PIs on EOF, and the others with opinions are probably @annevk, @hsivonen, and @zcorpan.

zcorpan · 2026-02-13T09:09:35Z

For start tags, it was changed in be72d87

Email: https://lists.w3.org/Archives/Public/public-html/2009Apr/0233.html in reply to https://lists.w3.org/Archives/Public/public-html/2009Mar/0260.html

Since PIs have pseudo-attributes and are part of UA processing, it seems to me similar arguments apply.

I guess creating a comment should be safe, but it requires more buffering and adds complexity. I think it shouldn't be needed for web compat.

noamr · 2026-02-13T13:17:56Z

For start tags, it was changed in be72d87

Email: https://lists.w3.org/Archives/Public/public-html/2009Apr/0233.html in reply to https://lists.w3.org/Archives/Public/public-html/2009Mar/0260.html

Since PIs have pseudo-attributes and are part of UA processing, it seems to me similar arguments apply.

I guess creating a comment should be safe, but it requires more buffering and adds complexity. I think it shouldn't be needed for web compat.

Yea creating a comment here makes the arguments from the original email not too relevant because the PI itself is never created. Creating a comment here is slightly more compatible with current behavior and makes debugging slightly easier. The buffering complexity is manageable IMO.

(Though it's also not a huge deal to drop it for simplicity)

…TML parser, a=testonly Automatic update from web-platform-tests Support processing instructions in the HTML parser The parser now recognizes <?target data> as a ProcessingInstruction and adds it to the DOM instead of a bogus comment. As per spec PR: - xml/xml-stylesheet are blocklisted, and stay a bogus comment. We can add more of these if there are compat issues. - A PI can appear wherever a comment appears. - ?> at the end ignores the ? Currently in this CL, PI targets are constrained to /^[A-Za-z][A-Za-z0-9-]*$/. Added a VTS that keeps current behavior, so that we don't lose some of the existing html5lib tests while this is in development. See spec PR: whatwg/html#12118 I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com Bug: 481087638 Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7532085 Commit-Queue: Noam Rosenthal <nrosenthal@google.com> Reviewed-by: Philip Jägenstedt <foolip@chromium.org> Reviewed-by: Dominic Farolino <dom@chromium.org> Cr-Commit-Position: refs/heads/main@{#1583351} -- wpt-commits: 7f5ee752ac61d861ff3481791e1db9da25378061 wpt-pr: 57716

zcorpan · 2026-02-19T14:39:04Z

Need to allow PIs here

https://html.spec.whatwg.org/#:~:text=srcdoc%2E-,The,whitespace%2E,-The%20above

and

https://html.spec.whatwg.org/#writing:~:text=Documents%20must%20consist,whitespace%2E,-The%20various
https://html.spec.whatwg.org/#writing:~:text=may%20consist%20of%20any%20text%2C%20character%20references%2C%20elements%2C%20and%20comments
https://html.spec.whatwg.org/#writing:~:text=can%20have%20text%2C%20character%20references%2C%20CDATA%20sections%2C%20other%20elements%2C%20and%20comments
https://html.spec.whatwg.org/#writing:~:text=can%20have%20text%2C%20character%20references%2C%20other%20elements%2C%20and%20comments

Need to list PIs alongside comments in https://html.spec.whatwg.org/#optional-tags

foolip · 2026-03-04T15:32:55Z

I've sent whatwg/dom#1454 with the API changes for ProcessingInstruction to treat the PI data as attributes. The data string and an attribute map are kept in sync whenever one of them changes.

justinfagnani · 2026-03-04T17:02:22Z

Meta: Is there an issue open to discuss features and design, or is this PR the best place?

Forging ahead anyway: A lot of use cases for PIs involve using them as a virtual container to mark ranges of nodes within the same parent. Sort of like an element that doesn't affect styling, layout, or the element tree at all.

Would there be a place in the parser for a generalized start and end notation, somewhat akin to elements, so that it would be easy and fast to retrieve these pairs of PIs?

They would follow the rules of well-formed element start and end tags: pairs have to have the same parent and must be balanced. Pairs would have references to each other. Any unclosed, unopened, or otherwise unbalanced PIs would omit the reference.

noamr · 2026-03-04T17:06:27Z

Meta: Is there an issue open to discuss features and design, or is this PR the best place?

If this is for the use of PIs for marking ranges for streaming, see #11542
For any kind of more brainstormy incubation, please open an issue in https://github.com/wicG/declarative-partial-updates/.

Forging ahead anyway: A lot of use cases for PIs involve using them as a virtual container to mark ranges of nodes within the same parent. Sort of like an element that doesn't affect styling, layout, or the element tree at all.

Right

Would there be a place in the parser for a generalized start and end notation, somewhat akin to elements, so that it would be easy and fast to retrieve these pairs of PIs?

Probably not in the parser, but #11542 discusses using those ranges of <?start> and <?end> PIs and also some sort of JS API to retrieve them.

They would follow the rules of well-formed element start and end tags: pairs have to have the same parent and must be balanced. Pairs would have references to each other. Any unclosed, unopened, or otherwise unbalanced PIs would omit the reference.

Currently you can have a <?start> without an <?end>, it continues until the end of the element.
#11542 is the place to discuss this, as this PR is one level below - defining how PI parsing happens without describing how they form ranges.

zcorpan

See #12118 (comment)

annevk reviewed Jan 31, 2026

View reviewed changes

Comment thread source Outdated

Comment thread source Outdated

Comment thread source Outdated

Comment thread source Outdated

Comment thread source Outdated

Comment thread source

foolip force-pushed the foolip/pi-parsing branch from f806ae3 to a26fe99 Compare January 31, 2026 08:09

Emit end-of-file tokens

d785561

annevk reviewed Jan 31, 2026

View reviewed changes

Comment thread source Outdated

Comment thread source Outdated

Comment thread source Outdated

Comment thread source

noamr reviewed Jan 31, 2026

View reviewed changes

Comment thread source Outdated

foolip added 6 commits January 31, 2026 12:53

Use temporary buffer to disallow <?xml and <?xml-stylesheet

f9315d9

Update the syntax section

3e24e4e

Shorten code points in prose

06dfd68

Add an easter egg

f87fd45

Drop the unexpected-question-mark-in-pi error; append the ? like in XML

6b9f624

Expand on syntax; make xml and xml-stylesheet targets an error

6b4c1e8

foolip marked this pull request as ready for review January 31, 2026 21:15

Expand error names (still not the longest)

205c1d7

foolip mentioned this pull request Feb 2, 2026

Out of order HTML streaming ("patching") #11542

Open

hsivonen approved these changes Feb 2, 2026

View reviewed changes

Comment thread source Outdated

s/Othwerwise/Otherwise/

b8d6d03

zcorpan reviewed Feb 2, 2026

View reviewed changes

Comment thread source Outdated

Comment thread source Outdated

Comment thread source Outdated

s/comment/processing instruction/

e1f8182

zcorpan mentioned this pull request Feb 3, 2026

Test serialization of processing instructions in HTML web-platform-tests/wpt#57486

Open

noamr mentioned this pull request Feb 3, 2026

Addressable comments (a very small DOM parts subset) WICG/webcomponents#1116

Open

otherdaniel mentioned this pull request Feb 3, 2026

Processing Instructions WICG/sanitizer-api#370

Closed

Require PI target to be alphanumeric+hyphens or it becomes a comment

64f133a

foolip added 2 commits February 6, 2026 08:51

Editorial: use the temporary buffer until token type is decided (neve…

5aebf7e

…r discard)

Preserve PI target case and replace with comment on error

e7be184

annevk reviewed Feb 8, 2026

View reviewed changes

Comment thread source

Comment thread source Outdated

Comment thread source Outdated

Comment thread source Outdated

Comment thread source Outdated

Comment thread source

Comment thread source

Comment thread source

Comment thread source

foolip added 3 commits February 10, 2026 21:17

Make target case-sensitive in syntax and error desc

4deed12

leave out "but has no effect"

85030c6

lowercase after assert

65216c4

chromium-wpt-export-bot mentioned this pull request Feb 11, 2026

Support processing instructions in the HTML parser web-platform-tests/wpt#57716

Merged

foolip mentioned this pull request Mar 9, 2026

Add <template for> for declarative out-of-order streaming #11818

Open

6 tasks

Lowercase PI target in tokenizer (needs temporary buffer)

214190b

zcorpan requested changes Mar 18, 2026

View reviewed changes

zcorpan reviewed Mar 18, 2026

View reviewed changes

Comment thread source Outdated

zcorpan mentioned this pull request Mar 18, 2026

New ATTRIBUTES block w3c/webvtt#523

Closed

kkoyung mentioned this pull request May 28, 2026

Implement Sanitizer API servo/servo#43948

Open

24 tasks

foolip added 3 commits June 15, 2026 11:34

Allow underscore in PI target

4e67263

Update note about HTML/XML differences

f5b4b18

Don't lowercase PI target

070f538

Conversation

foolip commented Jan 31, 2026 • edited by pr-preview Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

foolip commented Jan 31, 2026

Uh oh!

noamr commented Jan 31, 2026

Uh oh!

hsivonen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zcorpan commented Feb 2, 2026

Uh oh!

foolip commented Feb 3, 2026

Uh oh!

foolip commented Feb 3, 2026

Uh oh!

zcorpan commented Feb 3, 2026

Uh oh!

zcorpan commented Feb 3, 2026

Uh oh!

noamr commented Feb 3, 2026

Uh oh!

foolip commented Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

foolip commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zcorpan commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

noamr commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zcorpan commented Feb 19, 2026

Uh oh!

foolip commented Mar 4, 2026

Uh oh!

justinfagnani commented Mar 4, 2026

Uh oh!

noamr commented Mar 4, 2026

Uh oh!

zcorpan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

foolip commented Jan 31, 2026 •

edited by pr-preview Bot

Loading

foolip commented Feb 13, 2026 •

edited

Loading

zcorpan commented Feb 13, 2026 •

edited

Loading

noamr commented Feb 13, 2026 •

edited

Loading