Add Unicode ToASCII fallback for ASCII domains#914
Conversation
Unfortunately implementing Unicode ToASCII ends up rejecting websites such as - http://xn--72czcrhaj7cpt0ed1dxb4mb1s1.blogspot.com/2018/11/blog-post_60.html - https://xn--8i7caa.famitei.net/sekou/all - https://xn--board-ngr.palungjit.org/members/ which is not really acceptable. So when ToASCII fails, domain is all ASCII, and there are no forbidden domain code points, we just return the lowercased domain. If an implementation does not surface validation errors this is indistinguishable from having an ASCII fast path. Tests: web-platform-tests/wpt#60476
https://bugs.webkit.org/show_bug.cgi?id=316554 rdar://177686282 Reviewed by Alex Christensen. Unfortunately web compatibility is forcing our hand here to favor pure ASCII inputs, even when ToUnicode cannot do anything meaningful with them. Specification change: whatwg/url#914 Tests: web-platform-tests/wpt#60476 Canonical link: https://commits.webkit.org/314820@main
domenic
left a comment
There was a problem hiding this comment.
It might be worth considering making the fast-path normative, since that's what implementations are going to want anyway. Right now it appears like the spec code is trying to avoid calling Unicode ToASCII more than once, but that's not really a useful "optimization" since the actual implementation code will be quite different.
1. If beStrict is false and domain is an ASCII string:
a. If Unicode ToASCII(domain, flags(false)) is failure, domain-to-ASCII validation error // obvious validation-error-only step
b. result = ASCII-lowercase(domain)
2. Otherwise:
a. result = Unicode ToASCII(domain, flags(beStrict))
b. If result is failure: domain-to-ASCII validation error; return failure
3. If beStrict is false:
a. If result is "": domain-to-ASCII validation error; return failure
b. If result contains a forbidden domain code point: domain-invalid-code-point validation error; return failure
4. Return result
domenic
left a comment
There was a problem hiding this comment.
This looks good to me, modulo some wording suggestions.
Two more options for your consideration, both based around trying to make the "beStrict vs. not-beStrict" divergence more obvious (since non-validating parsers only ever implement beStrict = false):
- Let result be null.
- If beStrict is true:
- Set result to the result of running Unicode ToASCII with ...
- If result is a failure value, domain-to-ASCII validation error, return failure.
- Otherwise:
- If domain is an ASCII string:
- If running Unicode ToASCII with ..., domain-to-ASCII validation error.
- Set result to domain, lowercased.
- Otherwise:
- Set result to the result of running Unicode ToASCII with ...
- If result is a failure value, domain-to-ASCII validation error, return failure.
- If result is the empty string, domain-to-ASCII validation error, return failure.
- If result contains a forbidden domain code point, domain-invalid-code-point validation error, return failure.
- If domain is an ASCII string:
- Assert: result is not the empty string and does not contain a forbidden domain code point.
- Return result.
- If beStrict is true:
- Let result be the result of running Unicode ToASCII with ...
- If result is a failure value, domain-to-ASCII validation error, return failure.
- Return result.
- Let result be null.
- If domain is an ASCII string:
- If running Unicode ToASCII with ..., domain-to-ASCII validation error.
- Set result to domain, lowercased.
- Otherwise:
- Set result to the result of running Unicode ToASCII with ...
- If result is a failure value, domain-to-ASCII validation error, return failure.
- If result is the empty string, domain-to-ASCII validation error, return failure.
- If result contains a forbidden domain code point, domain-invalid-code-point validation error, return failure.
- Return result.
(drops the assert in the beStrict case)
|
This seems to make domains that violate the Bidi rule like 2.xn--mgbb work. Is that intended? I get it sidesteps the phishing issue, but that seems weird. |
|
Yes, unfortunately we learned that there are enough domain labels out there starting with |
|
My condolences. |
|
After further thought: I really like this. I'm making a Rust crate for URLs with an emphasis on manipulating them according to the semantics they actually have. Relevantly there's a DomainHost type with methods like It's been a real pain and I was just starting on actually hooking it up into the setters when I found this, and now I just... don't have to do that. All the code I wrote this past week is still useful for implementing ToUnicode but now the parts I've been dreading, the parts I was going to start working on tonight, can just be skipped. All I have to do is revert a few minor changes and tweak my domain encoder. Edit: Okay there was a bit of serious plumbing I had to rip out so it's more like 80% of my work wasn't wasted. My point and opinion stand. I maintain my condolences because this has got to be a really frustrating thing to have to further mangle the spec over but I want you to know that there's at least one person who genuinely benefits from this. You deserve to know your work is appreciated. I assume this is basically guaranteed to be merged mostly as-is and I can just immediately pivot my code to using it? I'm not concerned about being ahead of the spec since I have no users and am still in beta, so when this is merged isn't important to me. |
|
Just checking, and sorry if I'm being/have been annoying/weird, but does and should this provide an equivalent fallback for Domain to Unicode? Technically having it return failure still sidesteps the phishing issue but it'd be really awkward to have DomainHost::decode be fallible. |
I think so, we need a fallback. If we don't do this, we may end up in a situation where
To solve this problem, I would suggest changing the second step of
A few examples:
|
|
Thanks, well spotted! I decided to always return the domain as-is when ToUnicode records an error and also stop reporting redundant validation errors. This matches Safari, seems on the safe side, and since it's not exactly normative and implementations are allowed to deviate it seems reasonable. |
https://bugs.webkit.org/show_bug.cgi?id=317539 Reviewed by NOBODY (OOPS!). A potential roundtripping issue surfaced in whatwg/url#914 that we did not have test coverage for. In particular because after 314820@main to ASCII succeeds more often, there's a risk that ToUnicode returns something that does not roundtrip. We do not have this issue because when to Unicode records an error, we return the input as-is. Document this with a test so we won't accidentally regress it. This is an API test because to Unicode is not exposed to the web at the moment.
https://bugs.webkit.org/show_bug.cgi?id=317539 Reviewed by Alex Christensen. A potential roundtripping issue surfaced in whatwg/url#914 that we did not have test coverage for. In particular because after 314820@main to ASCII succeeds more often, there's a risk that ToUnicode returns something that does not roundtrip. We do not have this issue because when to Unicode records an error, we return the input as-is. Document this with a test so we won't accidentally regress it. This is an API test because to Unicode is not exposed to the web at the moment. Canonical link: https://commits.webkit.org/315690@main
|
I can see why this is being done, but I don't like this, since this introduces a different whole-domain mode, which means that implementations that with the previous spec text could produce output as they did a single pass over the domain now need a different API shape. It looks to me like domain to ASCII and domain to Unicode still reject different domains. I think both operations should accept or reject consistently. (Of course, in actual browsers, the domain to Unicode operation does not exist and the operation is really domain to display, but we should keep the implementation-defined stuff to whether domain to display chooses to output Punycode for a label that domain to Unicode outputs Unicode for. I think we shouldn't have implementation-dependent behavior other than that.) It looks to me like the new spec text bypasses forbidden domain code point enforcement when the input is all-ASCII. That should be fixed. |
When beStrict is false and we have to handle forbidden domain code points, they are handled consistently in step 6 of https://whatpr.org/url/914.html#concept-domain-to-ascii for both ASCII and non-ASCII inputs alike. |
|
Sorry, somehow I thought I saw an early return at step 3, but it was just setting result for later processing. |
I don't really understand this feedback as these are quite different operations from the URL standard perspective. Domain to ASCII takes a string and turns it into failure or a domain, whereas domain to Unicode takes a domain and potentially transforms it into a Unicode form or keeps it as-is (and notably never rejects). |
Is that kind of definition of 'domain to Unicode' good, though? Before this PR, the URL Standard just said how to set the UTS 46 flags and provided an ASCII deny list that's not representable as UTS 46 flags. Arguably, it's a spec bug that in the spec currently, "domain to Unicode" with beStrict set to false doesn't enforce the ASCII deny list. (In the This means that in the spec currently if you fix the spec bug of "domain to Unicode" with beStrict set to false doesn't enforce the ASCII deny list and there's a DoS-avoidance length limit on labels that also has the effect of the Punycode encode step in UTS 46 ToASCII never failing, "domain to ASCII" and "domain to Unicode" accept/reject the same set of inputs. (AFAICT, in UTS 46, VerifyDnsLength and overflows in Punycode encode are the possible sources of differences between "ToASCII" and "ToUnicode", and when beStrict is false, as it normally is, the URL Standard passes false for VerifyDnsLength, and the Punycode encode failure issue doesn't arise with reasonable label lengths.) So it's currently the case that both the processing mode that goes towards DNS resolution in a browser and the processing that goes towards the UI in a browser agree on what's rejected and what's accepted. It seems to me that changing that opens the opportunity for bugs. I don't have an end-to-end scenario: Breaking the identity just looks on its face like and opportunity for stuff to go wrong. Even if the new thing that's easy to define in the "domain to ASCII" case is harder to define in an equivalent way in the "domain to Unicode" case, I think it's worthwhile to make the effort define "domain to Unicode" in a way that makes both accept and reject the same set of inputs when beStrict is false and the labels have reasonable length (what exactly reasonable length is is a bit fuzzy). |
|
I think it's good that domain to Unicode always operates on the non-failure output of domain to ASCII, yes. That means we have a single entry point to get a domain. And then a single entry point for potentially displaying that domain differently. Instead of two points to parse something into a domain and displaying it somehow. This was already the case before this PR. (As we concluded on Chat domain to Unicode shouldn't need beStrict. We can just always use false. I'll fix that in a follow-up PR along with somewhat clearer names that don't suggest symmetry.) |
https://bugs.webkit.org/show_bug.cgi?id=317539 Reviewed by Alex Christensen. A potential roundtripping issue surfaced in whatwg/url#914 that we did not have test coverage for. In particular because after 314820@main to ASCII succeeds more often, there's a risk that ToUnicode returns something that does not roundtrip. We do not have this issue because when to Unicode records an error, we return the input as-is. Document this with a test so we won't accidentally regress it. This is an API test because to Unicode is not exposed to the web at the moment. Canonical link: https://commits.webkit.org/315690@main
Unfortunately implementing Unicode ToASCII ends up rejecting websites such as
which is not really acceptable. So when ToASCII fails, domain is all ASCII, and there are no forbidden domain code points, we just return the lowercased domain. If an implementation does not surface validation errors this is indistinguishable from having an ASCII fast path.
Tests: web-platform-tests/wpt#60476
(See WHATWG Working Mode: Changes for more details.)
Preview | Diff