Skip to content

Add Unicode ToASCII fallback for ASCII domains#914

Merged
annevk merged 6 commits into
mainfrom
annevk/idna-ascii
Jun 25, 2026
Merged

Add Unicode ToASCII fallback for ASCII domains#914
annevk merged 6 commits into
mainfrom
annevk/idna-ascii

Conversation

@annevk

@annevk annevk commented Jun 8, 2026

Copy link
Copy Markdown
Member

Unfortunately implementing Unicode ToASCII ends up rejecting websites such as

which is not really acceptable. So when ToASCII fails, domain is all ASCII, and there are no forbidden domain code points, we just return the lowercased domain. If an implementation does not surface validation errors this is indistinguishable from having an ASCII fast path.

Tests: web-platform-tests/wpt#60476

(See WHATWG Working Mode: Changes for more details.)


Preview | Diff

Unfortunately implementing Unicode ToASCII ends up rejecting websites such as

- http://xn--72czcrhaj7cpt0ed1dxb4mb1s1.blogspot.com/2018/11/blog-post_60.html
- https://xn--8i7caa.famitei.net/sekou/all
- https://xn--board-ngr.palungjit.org/members/

which is not really acceptable. So when ToASCII fails, domain is all ASCII, and there are no forbidden domain code points, we just return the lowercased domain. If an implementation does not surface validation errors this is indistinguishable from having an ASCII fast path.

Tests: web-platform-tests/wpt#60476
webkit-commit-queue pushed a commit to annevk/WebKit that referenced this pull request Jun 9, 2026
https://bugs.webkit.org/show_bug.cgi?id=316554
rdar://177686282

Reviewed by Alex Christensen.

Unfortunately web compatibility is forcing our hand here to favor pure
ASCII inputs, even when ToUnicode cannot do anything meaningful with
them.

Specification change: whatwg/url#914

Tests: web-platform-tests/wpt#60476
Canonical link: https://commits.webkit.org/314820@main

@domenic domenic left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth considering making the fast-path normative, since that's what implementations are going to want anyway. Right now it appears like the spec code is trying to avoid calling Unicode ToASCII more than once, but that's not really a useful "optimization" since the actual implementation code will be quite different.

1. If beStrict is false and domain is an ASCII string:
   a. If Unicode ToASCII(domain, flags(false)) is failure, domain-to-ASCII validation error // obvious validation-error-only step
   b. result = ASCII-lowercase(domain)
2. Otherwise:
   a. result = Unicode ToASCII(domain, flags(beStrict))
   b. If result is failure: domain-to-ASCII validation error; return failure
3. If beStrict is false:
   a. If result is "": domain-to-ASCII validation error; return failure
   b. If result contains a forbidden domain code point: domain-invalid-code-point validation error; return failure
4. Return result

Comment thread url.bs Outdated

@domenic domenic left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, modulo some wording suggestions.

Two more options for your consideration, both based around trying to make the "beStrict vs. not-beStrict" divergence more obvious (since non-validating parsers only ever implement beStrict = false):


  1. Let result be null.
  2. If beStrict is true:
    1. Set result to the result of running Unicode ToASCII with ...
    2. If result is a failure value, domain-to-ASCII validation error, return failure.
  3. Otherwise:
    1. If domain is an ASCII string:
      1. If running Unicode ToASCII with ..., domain-to-ASCII validation error.
      2. Set result to domain, lowercased.
    2. Otherwise:
      1. Set result to the result of running Unicode ToASCII with ...
      2. If result is a failure value, domain-to-ASCII validation error, return failure.
    3. If result is the empty string, domain-to-ASCII validation error, return failure.
    4. If result contains a forbidden domain code point, domain-invalid-code-point validation error, return failure.
  4. Assert: result is not the empty string and does not contain a forbidden domain code point.
  5. Return result.

  1. If beStrict is true:
    1. Let result be the result of running Unicode ToASCII with ...
    2. If result is a failure value, domain-to-ASCII validation error, return failure.
    3. Return result.
  2. Let result be null.
  3. If domain is an ASCII string:
    1. If running Unicode ToASCII with ..., domain-to-ASCII validation error.
    2. Set result to domain, lowercased.
  4. Otherwise:
    1. Set result to the result of running Unicode ToASCII with ...
    2. If result is a failure value, domain-to-ASCII validation error, return failure.
  5. If result is the empty string, domain-to-ASCII validation error, return failure.
  6. If result contains a forbidden domain code point, domain-invalid-code-point validation error, return failure.
  7. Return result.

(drops the assert in the beStrict case)

Comment thread url.bs Outdated
@Scripter17

Copy link
Copy Markdown

This seems to make domains that violate the Bidi rule like 2.xn--mgbb work. Is that intended? I get it sidesteps the phishing issue, but that seems weird.

@annevk

annevk commented Jun 15, 2026

Copy link
Copy Markdown
Member Author

Yes, unfortunately we learned that there are enough domain labels out there starting with xn-- that do not actually follow the Punycode rules.

@Scripter17

Copy link
Copy Markdown

My condolences.

@Scripter17

Scripter17 commented Jun 16, 2026

Copy link
Copy Markdown

After further thought: I really like this.

I'm making a Rust crate for URLs with an emphasis on manipulating them according to the semantics they actually have. Relevantly there's a DomainHost type with methods like set_(prefix|middle|suffix)(_segment|_segment_range)? and a major pain point I've been working on the past week is keeping track of each label's Bidi compliance and blocking things like setting the middle of xn--mgbb to 2 (2.xn--mgbb).

It's been a real pain and I was just starting on actually hooking it up into the setters when I found this, and now I just... don't have to do that. All the code I wrote this past week is still useful for implementing ToUnicode but now the parts I've been dreading, the parts I was going to start working on tonight, can just be skipped. All I have to do is revert a few minor changes and tweak my domain encoder.

Edit: Okay there was a bit of serious plumbing I had to rip out so it's more like 80% of my work wasn't wasted. My point and opinion stand.

I maintain my condolences because this has got to be a really frustrating thing to have to further mangle the spec over but I want you to know that there's at least one person who genuinely benefits from this. You deserve to know your work is appreciated.

I assume this is basically guaranteed to be merged mostly as-is and I can just immediately pivot my code to using it?

I'm not concerned about being ahead of the spec since I have no users and am still in beta, so when this is merged isn't important to me.

@Scripter17

Copy link
Copy Markdown

Just checking, and sorry if I'm being/have been annoying/weird, but does and should this provide an equivalent fallback for Domain to Unicode?

Technically having it return failure still sidesteps the phishing issue but it'd be really awkward to have DomainHost::decode be fallible.

@rmisev

rmisev commented Jun 20, 2026

Copy link
Copy Markdown
Member

Just checking, and sorry if I'm being/have been annoying/weird, but does and should this provide an equivalent fallback for Domain to Unicode?

I think so, we need a fallback. If we don't do this, we may end up in a situation where domain to ASCII succeeds, but domain to Unicode returns an invalid or altered domain. For example:

  1. Changes domain:
    domain_to_ASCII("xn--8i7caa.famitei.net", false) // xn--8i7caa.famitei.net
    domain_to_ASCII(domain_to_Unicode("xn--8i7caa.famitei.net", false), false) // www.famitei.net
  2. Returns an invalid domain:
    domain_to_ASCII("xn--board-ngr.org", false) // xn--board-ngr.org
    domain_to_ASCII(domain_to_Unicode("xn--board-ngr.org", false), false) // Failure

To solve this problem, I would suggest changing the second step of domain to Unicode as follows:

  1. Signify domain-to-Unicode validation errors for any returned errors. If an error is returned, beStrict is false, and domain is an ASCII string, return the lowercased domain. Otherwise, return result.

A few examples:

domain domain to Unicode (current) domain to Unicode (proposed)
xn--8i7caa.famitei.net www.famitei.net xn--8i7caa.famitei.net
xn--board-ngr.org ิboard.org xn--board-ngr.org
XN--0.XN--NXA xn--0.β xn--0.xn--nxa
XN--0.ABC xn--0.abc xn--0.abc
XN--ABC- abc xn--abc-

@annevk

annevk commented Jun 21, 2026

Copy link
Copy Markdown
Member Author

Thanks, well spotted! I decided to always return the domain as-is when ToUnicode records an error and also stop reporting redundant validation errors. This matches Safari, seems on the safe side, and since it's not exactly normative and implementations are allowed to deviate it seems reasonable.

annevk added a commit to annevk/WebKit that referenced this pull request Jun 23, 2026
https://bugs.webkit.org/show_bug.cgi?id=317539

Reviewed by NOBODY (OOPS!).

A potential roundtripping issue surfaced in
whatwg/url#914 that we did not have test
coverage for. In particular because after 314820@main to ASCII succeeds
more often, there's a risk that ToUnicode returns something that does
not roundtrip.

We do not have this issue because when to Unicode records an error, we
return the input as-is. Document this with a test so we won't
accidentally regress it.

This is an API test because to Unicode is not exposed to the web at
the moment.
webkit-commit-queue pushed a commit to annevk/WebKit that referenced this pull request Jun 23, 2026
https://bugs.webkit.org/show_bug.cgi?id=317539

Reviewed by Alex Christensen.

A potential roundtripping issue surfaced in
whatwg/url#914 that we did not have test
coverage for. In particular because after 314820@main to ASCII succeeds
more often, there's a risk that ToUnicode returns something that does
not roundtrip.

We do not have this issue because when to Unicode records an error, we
return the input as-is. Document this with a test so we won't
accidentally regress it.

This is an API test because to Unicode is not exposed to the web at
the moment.

Canonical link: https://commits.webkit.org/315690@main
@hsivonen

Copy link
Copy Markdown
Member

I can see why this is being done, but I don't like this, since this introduces a different whole-domain mode, which means that implementations that with the previous spec text could produce output as they did a single pass over the domain now need a different API shape.

It looks to me like domain to ASCII and domain to Unicode still reject different domains. I think both operations should accept or reject consistently. (Of course, in actual browsers, the domain to Unicode operation does not exist and the operation is really domain to display, but we should keep the implementation-defined stuff to whether domain to display chooses to output Punycode for a label that domain to Unicode outputs Unicode for. I think we shouldn't have implementation-dependent behavior other than that.)

It looks to me like the new spec text bypasses forbidden domain code point enforcement when the input is all-ASCII. That should be fixed.

@annevk

annevk commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

It looks to me like the new spec text bypasses forbidden domain code point enforcement when the input is all-ASCII.

When beStrict is false and we have to handle forbidden domain code points, they are handled consistently in step 6 of https://whatpr.org/url/914.html#concept-domain-to-ascii for both ASCII and non-ASCII inputs alike.

@hsivonen

Copy link
Copy Markdown
Member

Sorry, somehow I thought I saw an early return at step 3, but it was just setting result for later processing.

@annevk

annevk commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

It looks to me like domain to ASCII and domain to Unicode still reject different domains.

I don't really understand this feedback as these are quite different operations from the URL standard perspective. Domain to ASCII takes a string and turns it into failure or a domain, whereas domain to Unicode takes a domain and potentially transforms it into a Unicode form or keeps it as-is (and notably never rejects).

@hsivonen

Copy link
Copy Markdown
Member

It looks to me like domain to ASCII and domain to Unicode still reject different domains.

I don't really understand this feedback as these are quite different operations from the URL standard perspective. Domain to ASCII takes a string and turns it into failure or a domain, whereas domain to Unicode takes a domain and potentially transforms it into a Unicode form or keeps it as-is (and notably never rejects).

Is that kind of definition of 'domain to Unicode' good, though?

Before this PR, the URL Standard just said how to set the UTS 46 flags and provided an ASCII deny list that's not representable as UTS 46 flags. Arguably, it's a spec bug that in the spec currently, "domain to Unicode" with beStrict set to false doesn't enforce the ASCII deny list. (In the idna crate, it does.)

This means that in the spec currently if you fix the spec bug of "domain to Unicode" with beStrict set to false doesn't enforce the ASCII deny list and there's a DoS-avoidance length limit on labels that also has the effect of the Punycode encode step in UTS 46 ToASCII never failing, "domain to ASCII" and "domain to Unicode" accept/reject the same set of inputs. (AFAICT, in UTS 46, VerifyDnsLength and overflows in Punycode encode are the possible sources of differences between "ToASCII" and "ToUnicode", and when beStrict is false, as it normally is, the URL Standard passes false for VerifyDnsLength, and the Punycode encode failure issue doesn't arise with reasonable label lengths.)

So it's currently the case that both the processing mode that goes towards DNS resolution in a browser and the processing that goes towards the UI in a browser agree on what's rejected and what's accepted.

It seems to me that changing that opens the opportunity for bugs. I don't have an end-to-end scenario: Breaking the identity just looks on its face like and opportunity for stuff to go wrong.

Even if the new thing that's easy to define in the "domain to ASCII" case is harder to define in an equivalent way in the "domain to Unicode" case, I think it's worthwhile to make the effort define "domain to Unicode" in a way that makes both accept and reject the same set of inputs when beStrict is false and the labels have reasonable length (what exactly reasonable length is is a bit fuzzy).

@annevk

annevk commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

I think it's good that domain to Unicode always operates on the non-failure output of domain to ASCII, yes. That means we have a single entry point to get a domain. And then a single entry point for potentially displaying that domain differently. Instead of two points to parse something into a domain and displaying it somehow.

This was already the case before this PR.

(As we concluded on Chat domain to Unicode shouldn't need beStrict. We can just always use false. I'll fix that in a follow-up PR along with somewhat clearer names that don't suggest symmetry.)

@annevk annevk merged commit a8d5ca3 into main Jun 25, 2026
2 checks passed
@annevk annevk deleted the annevk/idna-ascii branch June 25, 2026 13:04
codedbycupidity pushed a commit to codedbycupidity/WebKit that referenced this pull request Jun 26, 2026
https://bugs.webkit.org/show_bug.cgi?id=317539

Reviewed by Alex Christensen.

A potential roundtripping issue surfaced in
whatwg/url#914 that we did not have test
coverage for. In particular because after 314820@main to ASCII succeeds
more often, there's a risk that ToUnicode returns something that does
not roundtrip.

We do not have this issue because when to Unicode records an error, we
return the input as-is. Document this with a test so we won't
accidentally regress it.

This is an API test because to Unicode is not exposed to the web at
the moment.

Canonical link: https://commits.webkit.org/315690@main
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

5 participants