Parse XML-style <?target data?> processing instructions#12118
Conversation
Notably behavior and their rationale: - While `<?` opens a PI, a `>` always closes it. This to match the bogus comment tokenizer behavior, so that exactly the same characters are consumed as bogus comments in existing parsers. Any deviation would make this a much riskier parser change. - The target must start with ASCII alpha, but after that almost anything goes, matching tag names. This doesn't have to be this way, but there's no obvious precedent to follow. - The target is ASCII-lowercased by the tokenizer, just like tag and attribute names. This is for consistency, and there's precedent from the SVG-in-HTML parser behavior. - A `?` that isn't followed by `>` is treated like whitespace, similar to `/` in opening tags. This is also the simplest, with a single tokenizer state used whenever `?` is encountered. - `<?>` and `<?` followed by EOF are treated as bogus comments, because this is the most conservative choice, and also avoids empty target.
f806ae3 to
a26fe99
Compare
|
I've polished some more, this is close to ready now. Needs impl interest and tests, of course. |
Nice! I can have tests ready within a few days. |
Nit: As specified in https://html.spec.whatwg.org/#serialising-html-fragments it would serialize as |
|
@zcorpan you're absolutely right, I was testing in Chrome and didn't realize the spec says otherwise. |
|
Test for serialization: web-platform-tests/wpt#57486 |
|
Ah, I did not know Chromium and WebKit serialize with This opens up for making the Pros to making it required:
Cons:
|
|
Also, maybe we can just support I checked these 6 pages https://docs.google.com/spreadsheets/d/1o04eP_BwH1u7X8CyyLUvxOsntZNmfrDalfhU7ldVqlU/edit?gid=233477192#gid=233477192&range=C12 , only one has a PI pointing to external CSS but that file is empty. |
Yea we could probably also parse that PI for |
The parser now recognizes <?target data> as a ProcessingInstruction and adds it to the DOM instead of a bogus comment. As per spec PR: - xml/xml-stylesheet are blocklisted, and stay a bogus comment. We can add more of these if there are compat issues. - A PI can appear wherever a comment appears. - ?> at the end ignores the ? Currently in this CL, PI targets are constrained to /^[A-Za-z][A-Za-z0-9-]*$/. Added a VTS that keeps current behavior, so that we don't lose some of the existing html5lib tests while this is in development. See spec PR: whatwg/html#12118 I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com Bug: 481087638 Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0
The parser now recognizes <?target data> as a ProcessingInstruction and adds it to the DOM instead of a bogus comment. As per spec PR: - xml/xml-stylesheet are blocklisted, and stay a bogus comment. We can add more of these if there are compat issues. - A PI can appear wherever a comment appears. - ?> at the end ignores the ? Currently in this CL, PI targets are constrained to /^[A-Za-z][A-Za-z0-9-]*$/. Added a VTS that keeps current behavior, so that we don't lose some of the existing html5lib tests while this is in development. See spec PR: whatwg/html#12118 I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com Bug: 481087638 Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7532085 Commit-Queue: Noam Rosenthal <nrosenthal@google.com> Reviewed-by: Philip Jägenstedt <foolip@chromium.org> Reviewed-by: Dominic Farolino <dom@chromium.org> Cr-Commit-Position: refs/heads/main@{#1583351}
The parser now recognizes <?target data> as a ProcessingInstruction and adds it to the DOM instead of a bogus comment. As per spec PR: - xml/xml-stylesheet are blocklisted, and stay a bogus comment. We can add more of these if there are compat issues. - A PI can appear wherever a comment appears. - ?> at the end ignores the ? Currently in this CL, PI targets are constrained to /^[A-Za-z][A-Za-z0-9-]*$/. Added a VTS that keeps current behavior, so that we don't lose some of the existing html5lib tests while this is in development. See spec PR: whatwg/html#12118 I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com Bug: 481087638 Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7532085 Commit-Queue: Noam Rosenthal <nrosenthal@google.com> Reviewed-by: Philip Jägenstedt <foolip@chromium.org> Reviewed-by: Dominic Farolino <dom@chromium.org> Cr-Commit-Position: refs/heads/main@{#1583351}
The parser now recognizes <?target data> as a ProcessingInstruction and adds it to the DOM instead of a bogus comment. As per spec PR: - xml/xml-stylesheet are blocklisted, and stay a bogus comment. We can add more of these if there are compat issues. - A PI can appear wherever a comment appears. - ?> at the end ignores the ? Currently in this CL, PI targets are constrained to /^[A-Za-z][A-Za-z0-9-]*$/. Added a VTS that keeps current behavior, so that we don't lose some of the existing html5lib tests while this is in development. See spec PR: whatwg/html#12118 I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com Bug: 481087638 Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7532085 Commit-Queue: Noam Rosenthal <nrosenthal@google.com> Reviewed-by: Philip Jägenstedt <foolip@chromium.org> Reviewed-by: Dominic Farolino <dom@chromium.org> Cr-Commit-Position: refs/heads/main@{#1583351}
|
We need to decide what to do when we encounter EOF between Drop PIs on EOF: This is what we do with tags, suggested by @zcorpan in #12118 (comment). This combined with making the target case-preserving alphanumeric+hyphens means we will know at the first non-alphanumeric-or-hyphen if this will be a PI or comment. The characters up to that point will become either the PI target or the start of a bogus comment. Emit a comment on EOF: This preserves existing behavior in every case that doesn't produce a PI. This will require the temporary buffer for at least the whitespace between target and data, but would be easiest to explain as buffering everything up to I'll be OOO next week, but it would be great to make a decision. My preference is to drop PIs on EOF, and the others with opinions are probably @annevk, @hsivonen, and @zcorpan. |
|
For start tags, it was changed in be72d87 Email: https://lists.w3.org/Archives/Public/public-html/2009Apr/0233.html in reply to https://lists.w3.org/Archives/Public/public-html/2009Mar/0260.html Since PIs have pseudo-attributes and are part of UA processing, it seems to me similar arguments apply. I guess creating a comment should be safe, but it requires more buffering and adds complexity. I think it shouldn't be needed for web compat. |
Yea creating a comment here makes the arguments from the original email not too relevant because the PI itself is never created. Creating a comment here is slightly more compatible with current behavior and makes debugging slightly easier. The buffering complexity is manageable IMO. (Though it's also not a huge deal to drop it for simplicity) |
…TML parser, a=testonly Automatic update from web-platform-tests Support processing instructions in the HTML parser The parser now recognizes <?target data> as a ProcessingInstruction and adds it to the DOM instead of a bogus comment. As per spec PR: - xml/xml-stylesheet are blocklisted, and stay a bogus comment. We can add more of these if there are compat issues. - A PI can appear wherever a comment appears. - ?> at the end ignores the ? Currently in this CL, PI targets are constrained to /^[A-Za-z][A-Za-z0-9-]*$/. Added a VTS that keeps current behavior, so that we don't lose some of the existing html5lib tests while this is in development. See spec PR: whatwg/html#12118 I2P: https://groups.google.com/a/chromium.org/d/msgid/blink-dev/6981ee47.050a0220.baa59.0100.GAE%40google.com Bug: 481087638 Change-Id: I1dd22c09f0b2961d07e8d73a1de1c10c91655be0 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/7532085 Commit-Queue: Noam Rosenthal <nrosenthal@google.com> Reviewed-by: Philip Jägenstedt <foolip@chromium.org> Reviewed-by: Dominic Farolino <dom@chromium.org> Cr-Commit-Position: refs/heads/main@{#1583351} -- wpt-commits: 7f5ee752ac61d861ff3481791e1db9da25378061 wpt-pr: 57716
|
I've sent whatwg/dom#1454 with the API changes for |
|
Meta: Is there an issue open to discuss features and design, or is this PR the best place? Forging ahead anyway: A lot of use cases for PIs involve using them as a virtual container to mark ranges of nodes within the same parent. Sort of like an element that doesn't affect styling, layout, or the element tree at all. Would there be a place in the parser for a generalized start and end notation, somewhat akin to elements, so that it would be easy and fast to retrieve these pairs of PIs? They would follow the rules of well-formed element start and end tags: pairs have to have the same parent and must be balanced. Pairs would have references to each other. Any unclosed, unopened, or otherwise unbalanced PIs would omit the reference. |
If this is for the use of PIs for marking ranges for streaming, see #11542
Right
Probably not in the parser, but #11542 discusses using those ranges of
Currently you can have a |
Notably behavior and their rationale:
<?opens a PI, a>always closes it. This to match the boguscomment tokenizer behavior, so that exactly the same characters are
consumed as bogus comments in existing parsers. Any deviation would
make this a much riskier parser change.
goes, matching tag names. This doesn't have to be this way, but
there's no obvious other precedent to follow.
attribute names. This is for consistency, and there's precedent from
the SVG-in-HTML parser behavior.
it's instead treated as a bogus comment.
<?xml?>is not a valid PIin XML, and
<?xml-stylesheet href="style.css"?>would otherwisestart loading stylesheets in HTML documents, which we don't want.
?that's not followed by a>becomes part of the data (neverthe target). This is to match XML for a
<?t ???>where data is "??".<?t?d?>(invalid XML) the target is "t" and data is "?d". Thisbehavior results from handling
?the same way wherever it'sencountered. The example would serialize back as
<?t ?d?>.<?>and<?followed by EOF are treated as bogus comments, becausethis is the most conservative choice, and also avoids empty target.
(See WHATWG Working Mode: Changes for more details.)
/index.html ( diff )
/infrastructure.html ( diff )
/parsing.html ( diff )
/syntax.html ( diff )