From 535b83487965c9ef54eea2bcb43e246051919947 Mon Sep 17 00:00:00 2001 From: Anne van Kesteren Date: Tue, 14 Apr 2026 17:09:39 +0200 Subject: [PATCH 1/4] Close gap between URL writing and parser sections 1. Domain-percent-encoded: validation error for percent-encoding in domains (e.g., exam%70le.com). 2. Validation error when strict Unicode ToASCII would fail (e.g., _dmarc.example.com). 3. Invalid-URL-unit: now also emitted for spaces in opaque paths (e.g., foo:bar baz). 4. Writing grammar: non-special absolute URLs now allow opaque paths directly, fixing foo:a:b being incorrectly invalid. Fixes #796. Fixes #797. Helps with #704. --- url.bs | 91 +++++++++++++++++++++++++++++++++------------------------- 1 file changed, 52 insertions(+), 39 deletions(-) diff --git a/url.bs b/url.bs index 5058807..4dec54a 100644 --- a/url.bs +++ b/url.bs @@ -107,8 +107,8 @@ valid input. User agents, especially conformance checkers, are encouraged to rep domain-to-ASCII -

Unicode ToASCII records an error or returns the empty string. - [[UTS46]] +

Unicode ToASCII records an error when CheckHyphens, + UseSTD3ASCIIRules, and VerifyDnsLength are all set to true. [[UTS46]]

If details about Unicode ToASCII errors are recorded, user agents are encouraged to pass those along. Yes
(unless domain is an ASCII string) @@ -126,6 +126,14 @@ valid input. User agents, especially conformance checkers, are encouraged to rep Host parsing + + + domain-percent-encoded + +

The input's host to be processed as a domain contains a + percent-encoded byte. +

"https://exam%70le.org" + · host-invalid-code-point @@ -907,61 +915,45 @@ concepts. steps. They return failure or a domain.

    +
  1. Let strictResult be the result of running domain parser ToASCII with + domain and true. + +

  2. If strictResult is a failure value, domain-to-ASCII + validation error. This step does not return. +

  3. If beStrict is true:

      -
    1. Let result be the result of running - Unicode ToASCII with domain_name set to domain, - CheckHyphens set to true, CheckBidi set to true, CheckJoiners set to true, - UseSTD3ASCIIRules set to true, Transitional_Processing set to false, - VerifyDnsLength set to true, and IgnoreInvalidPunycode set to false. [[!UTS46]] - -

    2. If result is a failure value, domain-to-ASCII validation error, - return failure. +

    3. If strictResult is a failure value, then return failure. -

    4. Return result. +

    5. Return strictResult.

  4. Let result be null.

  5. -

    If domain is an ASCII string: - -

      -
    1. If running Unicode ToASCII with domain_name set to - domain, CheckHyphens set to false, CheckBidi set to true, - CheckJoiners set to true, UseSTD3ASCIIRules set to false, - Transitional_Processing set to false, VerifyDnsLength set to false, and - IgnoreInvalidPunycode set to false is a failure value, domain-to-ASCII - validation error. [[!UTS46]] - -

    2. Set result to domain, lowercased. -

    +

    If domain is an ASCII string, then set result to + domain, lowercased.

    When beStrict is false and domain is an ASCII string, - Unicode ToASCII failures only result in validation errors - (instead of failing the whole algorithm) due to web compatibility. IgnoreInvalidPunycode - is not sufficient on its own, as Punycode can decode successfully yet still fail validity - criteria. E.g., xn--8i7caa decodes to www, whose code points have - status "mapped". [[UTS46]] + the algorithm returns domain lowercased regardless of + Unicode ToASCII's outcome, due to web compatibility. + IgnoreInvalidPunycode is not sufficient on its own, as Punycode can decode successfully + yet still fail validity criteria. E.g., xn--8i7caa decodes to www, + whose code points have status "mapped". [[UTS46]]

  6. Otherwise:

      -
    1. Set result to the result of running - Unicode ToASCII with domain_name set to domain, - CheckHyphens set to false, CheckBidi set to true, CheckJoiners set to true, - UseSTD3ASCIIRules set to false, Transitional_Processing set to false, - VerifyDnsLength set to false, and IgnoreInvalidPunycode set to false. [[!UTS46]] +

    2. Set result to the result of running domain parser ToASCII with + domain and false. -

    3. If result is a failure value, domain-to-ASCII validation error, - return failure. +

    4. If result is a failure value, then return failure.

    -
  7. If result is the empty string, domain-to-ASCII validation error, - return failure. +

  8. If result is the empty string, then return failure.

  9. If result contains a forbidden domain code point, @@ -979,6 +971,16 @@ steps. They return failure or a domain. ☕.example becomes xn--53h.example and not failure. [[UTS46]] [[RFC5890]] +

    +

    The domain parser ToASCII algorithm, given a scalar value string +domain and a boolean beStrict, returns the result of running +Unicode ToASCII with domain_name set to domain, +CheckHyphens set to beStrict, CheckBidi set to true, CheckJoiners +set to true, UseSTD3ASCIIRules set to beStrict, Transitional_Processing +set to false, VerifyDnsLength set to beStrict, and IgnoreInvalidPunycode +set to false. [[!UTS46]] +

    +

    The domain to Unicode algorithm, given a domain domain, runs these steps: @@ -1072,6 +1074,9 @@ false), and then runs these steps. They return failure or a host.

  10. Assert: input is not the empty string. +

  11. If input contains a percent-encoded byte, + domain-percent-encoded validation error. +

  12. Let domain be the result of running UTF-8 decode without BOM on the percent-decoding of input. @@ -1677,10 +1682,15 @@ unified model would be, please file an issue. ✅ file:///C:/ - file://loc%61lhost/ + file://localhost/file:/// + + file://loc%61lhost/ + + ❌ + file:/// https://user:password@example.org/ @@ -2014,7 +2024,8 @@ an absolute-URL string, optionally followed by U+0023 (#) and a URL-fr special scheme and not an ASCII case-insensitive match for "file", followed by U+003A (:) and a scheme-relative-special-URL string

  13. a URL-scheme string that is not an ASCII case-insensitive match for a - special scheme, followed by U+003A (:) and a relative-URL string + special scheme, followed by U+003A (:) and one of: a scheme-relative-URL string, a + path-absolute-URL string, or zero or more URL units

  14. a URL-scheme string that is an ASCII case-insensitive match for "file", followed by U+003A (:) and a scheme-relative-file-URL string @@ -2970,6 +2981,8 @@ and then runs these steps:

    Otherwise, if c is U+0020 SPACE:

      +
    1. Invalid-URL-unit validation error. +

    2. If remaining starts with U+003F (?) or U+0023 (#), then append "%20" to url's path. From 9712877c129242d6716e58baf937d825d46ee190 Mon Sep 17 00:00:00 2001 From: Anne van Kesteren Date: Fri, 26 Jun 2026 09:20:50 +0200 Subject: [PATCH 2/4] nit --- url.bs | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/url.bs b/url.bs index 4dec54a..fabc3e0 100644 --- a/url.bs +++ b/url.bs @@ -111,7 +111,9 @@ valid input. User agents, especially conformance checkers, are encouraged to rep UseSTD3ASCIIRules, and VerifyDnsLength are all set to true. [[UTS46]]

      If details about Unicode ToASCII errors are recorded, user agents are encouraged to pass those along. - Yes
      (unless domain is an ASCII string) + Yes
      (when beStrict is true, or domain is not an + ASCII string and Unicode ToASCII with relaxed parameters + also fails) domain-invalid-code-point From 38517b0b759e812d7a8703bc3e5921ff59ca785f Mon Sep 17 00:00:00 2001 From: Anne van Kesteren Date: Fri, 26 Jun 2026 09:22:05 +0200 Subject: [PATCH 3/4] CI --- url.bs | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/url.bs b/url.bs index fabc3e0..63cd7b4 100644 --- a/url.bs +++ b/url.bs @@ -111,9 +111,9 @@ valid input. User agents, especially conformance checkers, are encouraged to rep UseSTD3ASCIIRules, and VerifyDnsLength are all set to true. [[UTS46]]

      If details about Unicode ToASCII errors are recorded, user agents are encouraged to pass those along. - Yes
      (when beStrict is true, or domain is not an - ASCII string and Unicode ToASCII with relaxed parameters - also fails) + Yes
      (when beStrict is true, or domain is + not an ASCII string and Unicode ToASCII with relaxed + parameters also fails) domain-invalid-code-point From 1901c901a44abf8a416a02498efa505942f7fd71 Mon Sep 17 00:00:00 2001 From: Anne van Kesteren Date: Fri, 26 Jun 2026 12:13:59 +0200 Subject: [PATCH 4/4] domain-invalid-code-point is redundant with beStrict=true --- url.bs | 14 ++++---------- 1 file changed, 4 insertions(+), 10 deletions(-) diff --git a/url.bs b/url.bs index 63cd7b4..e6f65bc 100644 --- a/url.bs +++ b/url.bs @@ -111,20 +111,15 @@ valid input. User agents, especially conformance checkers, are encouraged to rep UseSTD3ASCIIRules, and VerifyDnsLength are all set to true. [[UTS46]]

      If details about Unicode ToASCII errors are recorded, user agents are encouraged to pass those along. - Yes
      (when beStrict is true, or domain is - not an ASCII string and Unicode ToASCII with relaxed - parameters also fails) - - domain-invalid-code-point - -

      The input's host contains a forbidden domain code point.

      Hosts are percent-decoded before being processed when the URL is special, which would result in the following host portion becoming "exa#mple.org" and thus triggering this error.

      "https://exa%23mple.org"

      - Yes + Yes
      (when beStrict is true, or domain is + not an ASCII string and Unicode ToASCII with relaxed + parameters also fails) Host parsing @@ -958,8 +953,7 @@ steps. They return failure or a domain.
    3. If result is the empty string, then return failure.

    4. -

      If result contains a forbidden domain code point, - domain-invalid-code-point validation error, return failure. +

      If result contains a forbidden domain code point, then return failure.

      Due to web compatibility and compatibility with non-DNS-based systems the forbidden domain code points are a subset of those disallowed when UseSTD3ASCIIRules