I’m reading it at commit 8636551.
86: must have the scheme present
TWUS: Describes in the 4.2 URL parsing section how a parser should
accept URLs without a scheme.
IIRC the TWUS parser only accepts input without a scheme when there’s a base URL. The input is relative, in these cases.
86 has this grammar, which seems equivalent?
URI-reference = URI / relative-ref
It also there divides parsers into "Non-web-browser implementations"
without specifying how to make that distinction.
In this specific instance, I think "Non-web-browser" means anything that doesn’t also implement https://w3c.github.io/FileAPI/ since the difference between "basic URL parser" and "URL parser" is all about blob: URLs.
TWUS: says a parser must accept one to an infinite amount of slashes
I think this is really not a big deal. It could just as well be 5 max, but 5 is arbitrary and less theoretically pleasing than http://www.catb.org/jargon/html/Z/Zero-One-Infinity-Rule.html
Real world: 32 bit numbers occur, and are automagically supported if
typical OS level name resolver funcitons
When I looked into it, it seemed hard to choose to not support it in such functions. (The most a program could do is recognize such "exotic" IPv4 syntax and reject them with a parse error, if it doesn’t want to resolve the IP address.)
TWUS: Doesn't specify IDNA 2003 nor 2008, but somehow that's still clear
It specified Unicode TR46, which fully defines algorithms independently of IDNA 2003 or 2008. (Though it is based on the Punycode RFC.)
Real world: at least curl and wget2 ignore "rubbish" entered after the number all the way to the next component divider
Personal opinion: it sounds problematic to silently ignore part of the input?
A TWUS URL thus needs other magic to know where a URL ends.
For example in <a href="…"> HTML syntax defines exactly where the value href attribute ends, so there is no need for magic.
If URLs need to be found in the middle of a free-form paragraph of text without any markup, there’s a lot more magic (and heuristics) required than splitting on spaces. I think defining this does not belong in an URL spec.
TWUS has a test suite (that only runs in javacript-enabled browsers).
Part (arguably the most important part) of this test suite has its test cases in a JSON file that can be used without JavaScript (and is in rust-url).
I’m reading it at commit 8636551.
IIRC the TWUS parser only accepts input without a scheme when there’s a base URL. The input is relative, in these cases.
86 has this grammar, which seems equivalent?
In this specific instance, I think "Non-web-browser" means anything that doesn’t also implement https://w3c.github.io/FileAPI/ since the difference between "basic URL parser" and "URL parser" is all about blob: URLs.
I think this is really not a big deal. It could just as well be 5 max, but 5 is arbitrary and less theoretically pleasing than http://www.catb.org/jargon/html/Z/Zero-One-Infinity-Rule.html
When I looked into it, it seemed hard to choose to not support it in such functions. (The most a program could do is recognize such "exotic" IPv4 syntax and reject them with a parse error, if it doesn’t want to resolve the IP address.)
It specified Unicode TR46, which fully defines algorithms independently of IDNA 2003 or 2008. (Though it is based on the Punycode RFC.)
Personal opinion: it sounds problematic to silently ignore part of the input?
For example in
<a href="…">HTML syntax defines exactly where the value href attribute ends, so there is no need for magic.If URLs need to be found in the middle of a free-form paragraph of text without any markup, there’s a lot more magic (and heuristics) required than splitting on spaces. I think defining this does not belong in an URL spec.
Part (arguably the most important part) of this test suite has its test cases in a JSON file that can be used without JavaScript (and is in rust-url).