Add Unicode ToASCII fallback for ASCII domains by annevk · Pull Request #914 · whatwg/url

annevk · 2026-06-08T19:02:15Z

Unfortunately implementing Unicode ToASCII ends up rejecting websites such as

which is not really acceptable. So when ToASCII fails, domain is all ASCII, and there are no forbidden domain code points, we just return the lowercased domain. If an implementation does not surface validation errors this is indistinguishable from having an ASCII fast path.

Tests: web-platform-tests/wpt#60476

At least two implementers are interested (and none opposed):
- Chromium presumably
- WebKit
Tests are written and can be reviewed and commented upon at:
- See commit message.
Implementation bugs are filed:
- Chromium: N/A
- Gecko: https://bugzilla.mozilla.org/show_bug.cgi?id=2049177
- WebKit: https://bugs.webkit.org/show_bug.cgi?id=316554
- Deno: Add Unicode ToASCII fallback for ASCII domains denoland/deno#35411
- Node.js: Add Unicode ToASCII fallback for ASCII domains nodejs/node#64045
MDN issue is filed: N/A
The top of this comment includes a clear commit message to use.

(See WHATWG Working Mode: Changes for more details.)

Preview | Diff

Unfortunately implementing Unicode ToASCII ends up rejecting websites such as - http://xn--72czcrhaj7cpt0ed1dxb4mb1s1.blogspot.com/2018/11/blog-post_60.html - https://xn--8i7caa.famitei.net/sekou/all - https://xn--board-ngr.palungjit.org/members/ which is not really acceptable. So when ToASCII fails, domain is all ASCII, and there are no forbidden domain code points, we just return the lowercased domain. If an implementation does not surface validation errors this is indistinguishable from having an ASCII fast path. Tests: web-platform-tests/wpt#60476

https://bugs.webkit.org/show_bug.cgi?id=316554 rdar://177686282 Reviewed by Alex Christensen. Unfortunately web compatibility is forcing our hand here to favor pure ASCII inputs, even when ToUnicode cannot do anything meaningful with them. Specification change: whatwg/url#914 Tests: web-platform-tests/wpt#60476 Canonical link: https://commits.webkit.org/314820@main

domenic

It might be worth considering making the fast-path normative, since that's what implementations are going to want anyway. Right now it appears like the spec code is trying to avoid calling Unicode ToASCII more than once, but that's not really a useful "optimization" since the actual implementation code will be quite different.

1. If beStrict is false and domain is an ASCII string:
   a. If Unicode ToASCII(domain, flags(false)) is failure, domain-to-ASCII validation error // obvious validation-error-only step
   b. result = ASCII-lowercase(domain)
2. Otherwise:
   a. result = Unicode ToASCII(domain, flags(beStrict))
   b. If result is failure: domain-to-ASCII validation error; return failure
3. If beStrict is false:
   a. If result is "": domain-to-ASCII validation error; return failure
   b. If result contains a forbidden domain code point: domain-invalid-code-point validation error; return failure
4. Return result

domenic

This looks good to me, modulo some wording suggestions.

Two more options for your consideration, both based around trying to make the "beStrict vs. not-beStrict" divergence more obvious (since non-validating parsers only ever implement beStrict = false):

Let result be null.
If beStrict is true:
1. Set result to the result of running Unicode ToASCII with ...
2. If result is a failure value, domain-to-ASCII validation error, return failure.
Otherwise:
1. If domain is an ASCII string:
  1. If running Unicode ToASCII with ..., domain-to-ASCII validation error.
  2. Set result to domain, lowercased.
2. Otherwise:
  1. Set result to the result of running Unicode ToASCII with ...
  2. If result is a failure value, domain-to-ASCII validation error, return failure.
3. If result is the empty string, domain-to-ASCII validation error, return failure.
4. If result contains a forbidden domain code point, domain-invalid-code-point validation error, return failure.
Assert: result is not the empty string and does not contain a forbidden domain code point.
Return result.

If beStrict is true:
1. Let result be the result of running Unicode ToASCII with ...
2. If result is a failure value, domain-to-ASCII validation error, return failure.
3. Return result.
Let result be null.
If domain is an ASCII string:
1. If running Unicode ToASCII with ..., domain-to-ASCII validation error.
2. Set result to domain, lowercased.
Otherwise:
1. Set result to the result of running Unicode ToASCII with ...
2. If result is a failure value, domain-to-ASCII validation error, return failure.
If result is the empty string, domain-to-ASCII validation error, return failure.
If result contains a forbidden domain code point, domain-invalid-code-point validation error, return failure.
Return result.

(drops the assert in the beStrict case)

Scripter17 · 2026-06-15T12:41:45Z

This seems to make domains that violate the Bidi rule like 2.xn--mgbb work. Is that intended? I get it sidesteps the phishing issue, but that seems weird.

annevk · 2026-06-15T13:42:28Z

Yes, unfortunately we learned that there are enough domain labels out there starting with xn-- that do not actually follow the Punycode rules.

Scripter17 · 2026-06-15T13:54:30Z

My condolences.

Scripter17 · 2026-06-16T01:03:41Z

After further thought: I really like this.

I'm making a Rust crate for URLs with an emphasis on manipulating them according to the semantics they actually have. Relevantly there's a DomainHost type with methods like set_(prefix|middle|suffix)(_segment|_segment_range)? and a major pain point I've been working on the past week is keeping track of each label's Bidi compliance and blocking things like setting the middle of xn--mgbb to 2 (2.xn--mgbb).

It's been a real pain and I was just starting on actually hooking it up into the setters when I found this, and now I just... don't have to do that. All the code I wrote this past week is still useful for implementing ToUnicode but now the parts I've been dreading, the parts I was going to start working on tonight, can just be skipped. All I have to do is revert a few minor changes and tweak my domain encoder.

Edit: Okay there was a bit of serious plumbing I had to rip out so it's more like 80% of my work wasn't wasted. My point and opinion stand.

I maintain my condolences because this has got to be a really frustrating thing to have to further mangle the spec over but I want you to know that there's at least one person who genuinely benefits from this. You deserve to know your work is appreciated.

I assume this is basically guaranteed to be merged mostly as-is and I can just immediately pivot my code to using it?

I'm not concerned about being ahead of the spec since I have no users and am still in beta, so when this is merged isn't important to me.

Scripter17 · 2026-06-20T01:31:03Z

Just checking, and sorry if I'm being/have been annoying/weird, but does and should this provide an equivalent fallback for Domain to Unicode?

Technically having it return failure still sidesteps the phishing issue but it'd be really awkward to have DomainHost::decode be fallible.

rmisev · 2026-06-20T18:39:00Z

Just checking, and sorry if I'm being/have been annoying/weird, but does and should this provide an equivalent fallback for Domain to Unicode?

I think so, we need a fallback. If we don't do this, we may end up in a situation where domain to ASCII succeeds, but domain to Unicode returns an invalid or altered domain. For example:

Changes domain:

domain_to_ASCII("xn--8i7caa.famitei.net", false) // xn--8i7caa.famitei.net
domain_to_ASCII(domain_to_Unicode("xn--8i7caa.famitei.net", false), false) // www.famitei.net

Returns an invalid domain:

domain_to_ASCII("xn--board-ngr.org", false) // xn--board-ngr.org
domain_to_ASCII(domain_to_Unicode("xn--board-ngr.org", false), false) // Failure

To solve this problem, I would suggest changing the second step of domain to Unicode as follows:

Signify domain-to-Unicode validation errors for any returned errors. If an error is returned, beStrict is false, and domain is an ASCII string, return the lowercased domain. Otherwise, return result.

A few examples:

domain	`domain to Unicode` (current)	`domain to Unicode` (proposed)
`xn--8i7caa.famitei.net`	`ｗｗｗ.famitei.net`	`xn--8i7caa.famitei.net`
`xn--board-ngr.org`	`ิboard.org`	`xn--board-ngr.org`
`XN--0.XN--NXA`	`xn--0.β`	`xn--0.xn--nxa`
`XN--0.ABC`	`xn--0.abc`	`xn--0.abc`
`XN--ABC-`	`abc`	`xn--abc-`

annevk · 2026-06-21T12:47:03Z

Thanks, well spotted! I decided to always return the domain as-is when ToUnicode records an error and also stop reporting redundant validation errors. This matches Safari, seems on the safe side, and since it's not exactly normative and implementations are allowed to deviate it seems reasonable.

https://bugs.webkit.org/show_bug.cgi?id=317539 Reviewed by NOBODY (OOPS!). A potential roundtripping issue surfaced in whatwg/url#914 that we did not have test coverage for. In particular because after 314820@main to ASCII succeeds more often, there's a risk that ToUnicode returns something that does not roundtrip. We do not have this issue because when to Unicode records an error, we return the input as-is. Document this with a test so we won't accidentally regress it. This is an API test because to Unicode is not exposed to the web at the moment.

https://bugs.webkit.org/show_bug.cgi?id=317539 Reviewed by Alex Christensen. A potential roundtripping issue surfaced in whatwg/url#914 that we did not have test coverage for. In particular because after 314820@main to ASCII succeeds more often, there's a risk that ToUnicode returns something that does not roundtrip. We do not have this issue because when to Unicode records an error, we return the input as-is. Document this with a test so we won't accidentally regress it. This is an API test because to Unicode is not exposed to the web at the moment. Canonical link: https://commits.webkit.org/315690@main

hsivonen · 2026-06-24T15:38:11Z

I can see why this is being done, but I don't like this, since this introduces a different whole-domain mode, which means that implementations that with the previous spec text could produce output as they did a single pass over the domain now need a different API shape.

It looks to me like domain to ASCII and domain to Unicode still reject different domains. I think both operations should accept or reject consistently. (Of course, in actual browsers, the domain to Unicode operation does not exist and the operation is really domain to display, but we should keep the implementation-defined stuff to whether domain to display chooses to output Punycode for a label that domain to Unicode outputs Unicode for. I think we shouldn't have implementation-dependent behavior other than that.)

It looks to me like the new spec text bypasses forbidden domain code point enforcement when the input is all-ASCII. That should be fixed.

annevk · 2026-06-24T15:45:35Z

It looks to me like the new spec text bypasses forbidden domain code point enforcement when the input is all-ASCII.

When beStrict is false and we have to handle forbidden domain code points, they are handled consistently in step 6 of https://whatpr.org/url/914.html#concept-domain-to-ascii for both ASCII and non-ASCII inputs alike.

hsivonen · 2026-06-24T15:58:17Z

Sorry, somehow I thought I saw an early return at step 3, but it was just setting result for later processing.

annevk · 2026-06-24T15:59:31Z

It looks to me like domain to ASCII and domain to Unicode still reject different domains.

I don't really understand this feedback as these are quite different operations from the URL standard perspective. Domain to ASCII takes a string and turns it into failure or a domain, whereas domain to Unicode takes a domain and potentially transforms it into a Unicode form or keeps it as-is (and notably never rejects).

hsivonen · 2026-06-25T09:32:47Z

It looks to me like domain to ASCII and domain to Unicode still reject different domains.

I don't really understand this feedback as these are quite different operations from the URL standard perspective. Domain to ASCII takes a string and turns it into failure or a domain, whereas domain to Unicode takes a domain and potentially transforms it into a Unicode form or keeps it as-is (and notably never rejects).

Is that kind of definition of 'domain to Unicode' good, though?

Before this PR, the URL Standard just said how to set the UTS 46 flags and provided an ASCII deny list that's not representable as UTS 46 flags. Arguably, it's a spec bug that in the spec currently, "domain to Unicode" with beStrict set to false doesn't enforce the ASCII deny list. (In the idna crate, it does.)

This means that in the spec currently if you fix the spec bug of "domain to Unicode" with beStrict set to false doesn't enforce the ASCII deny list and there's a DoS-avoidance length limit on labels that also has the effect of the Punycode encode step in UTS 46 ToASCII never failing, "domain to ASCII" and "domain to Unicode" accept/reject the same set of inputs. (AFAICT, in UTS 46, VerifyDnsLength and overflows in Punycode encode are the possible sources of differences between "ToASCII" and "ToUnicode", and when beStrict is false, as it normally is, the URL Standard passes false for VerifyDnsLength, and the Punycode encode failure issue doesn't arise with reasonable label lengths.)

So it's currently the case that both the processing mode that goes towards DNS resolution in a browser and the processing that goes towards the UI in a browser agree on what's rejected and what's accepted.

It seems to me that changing that opens the opportunity for bugs. I don't have an end-to-end scenario: Breaking the identity just looks on its face like and opportunity for stuff to go wrong.

Even if the new thing that's easy to define in the "domain to ASCII" case is harder to define in an equivalent way in the "domain to Unicode" case, I think it's worthwhile to make the effort define "domain to Unicode" in a way that makes both accept and reject the same set of inputs when beStrict is false and the labels have reasonable length (what exactly reasonable length is is a bit fuzzy).

annevk · 2026-06-25T10:41:59Z

I think it's good that domain to Unicode always operates on the non-failure output of domain to ASCII, yes. That means we have a single entry point to get a domain. And then a single entry point for potentially displaying that domain differently. Instead of two points to parse something into a domain and displaying it somehow.

This was already the case before this PR.

(As we concluded on Chat domain to Unicode shouldn't need beStrict. We can just always use false. I'll fix that in a follow-up PR along with somewhat clearer names that don't suggest symmetry.)

https://bugs.webkit.org/show_bug.cgi?id=317539 Reviewed by Alex Christensen. A potential roundtripping issue surfaced in whatwg/url#914 that we did not have test coverage for. In particular because after 314820@main to ASCII succeeds more often, there's a risk that ToUnicode returns something that does not roundtrip. We do not have this issue because when to Unicode records an error, we return the input as-is. Document this with a test so we won't accidentally regress it. This is an API test because to Unicode is not exposed to the web at the moment. Canonical link: https://commits.webkit.org/315690@main

annevk mentioned this pull request Jun 8, 2026

IDNA cannot fail ASCII domains, even if they start with xn-- WebKit/WebKit#66694

Merged

domenic reviewed Jun 10, 2026

View reviewed changes

Comment thread url.bs Outdated

domenic mentioned this pull request Jun 10, 2026

IDNA cannot fail ASCII domains, even if they start with xn-- web-platform-tests/wpt#60476

Merged

go with Domenic's model which is nicer

d906311

annevk requested a review from domenic June 10, 2026 13:04

shannonbooth mentioned this pull request Jun 10, 2026

LibURL: Avoid IDNA processing for non-strict ASCII domains LadybirdBrowser/ladybird#10039

Merged

domenic approved these changes Jun 12, 2026

View reviewed changes

Comment thread url.bs Outdated

handle beStrict=true first

969eecf

address a comment from Henri

9b998f7

fix domain to Unicode as well

8556a69

nit

e1a97e5

This was referenced Jun 21, 2026

Add Unicode ToASCII fallback for ASCII domains nodejs/node#64045

Open

Add Unicode ToASCII fallback for ASCII domains denoland/deno#35411

Open

Add test coverage for valid Punycode but invalid UTS46 WebKit/WebKit#67551

Merged

hsivonen mentioned this pull request Jun 24, 2026

Implement URL Standard change to pass through bogus xn-- labels if the whole domain is ASCII servo/rust-url#1135

Open

annevk merged commit a8d5ca3 into main Jun 25, 2026
2 checks passed

annevk deleted the annevk/idna-ascii branch June 25, 2026 13:04

codedbycupidity mentioned this pull request Jun 26, 2026

BorderShape: Implement corner-shape path generation and painting WebKit/WebKit#67949

Merged

Uh oh!

Conversation

annevk commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

domenic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

domenic left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Scripter17 commented Jun 15, 2026

Uh oh!

annevk commented Jun 15, 2026

Uh oh!

Scripter17 commented Jun 15, 2026

Uh oh!

Scripter17 commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Scripter17 commented Jun 20, 2026

Uh oh!

rmisev commented Jun 20, 2026

Uh oh!

annevk commented Jun 21, 2026

Uh oh!

hsivonen commented Jun 24, 2026

Uh oh!

annevk commented Jun 24, 2026

Uh oh!

hsivonen commented Jun 24, 2026

Uh oh!

annevk commented Jun 24, 2026

Uh oh!

hsivonen commented Jun 25, 2026

Uh oh!

annevk commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

annevk commented Jun 8, 2026 •

edited

Loading

Scripter17 commented Jun 16, 2026 •

edited

Loading