Skip to content

feat: reach full WHATWG URL conformance against the WPT corpus#14

Merged
OmarAlJarrah merged 2 commits into
mainfrom
feat/whatwg-url-full-conformance
Jun 30, 2026
Merged

feat: reach full WHATWG URL conformance against the WPT corpus#14
OmarAlJarrah merged 2 commits into
mainfrom
feat/whatwg-url-full-conformance

Conversation

@OmarAlJarrah

Copy link
Copy Markdown
Member

Summary

The WHATWG Url profile now passes the entire WPT urltestdata.json corpus
(888/888 cases). The conformance harness was pinned to an older corpus snapshot
(886 cases) with a different pass/fail split; this refreshes it to the upstream
master snapshot and closes the one remaining behavioural gap so every in-scope
case parses exactly as the standard prescribes.

Background

The two ratcheting conformance suites already run every in-scope case through
UrlParser. Refreshing the corpus surfaced 12 divergences, all in the host
parser's IDNA (domain to ASCII) step:

  • An all-ASCII domain must be returned lowercased verbatim. With beStrict
    false, Unicode ToASCII failures are only validation errors, so an invalid
    xn-- label such as xn--pokxncvks is kept rather than rejected
    (IgnoreInvalidPunycode alone is not sufficient, since Punycode can decode
    yet still fail a later validity check).
  • A domain that maps to the empty string (for example a lone soft hyphen) must
    fail.

Changes

  • Add Idna.domainToAsciiForUrl, the WHATWG "domain to ASCII" wrapper (ASCII
    fast-path + empty-result rule), and route the special-scheme host pipeline
    through it. The pure UTS-46 Idna.domainToAscii is unchanged, so it still
    backs the IdnaTestV2/toascii conformance suite.
  • Vendor the corpus at tools/url/urltestdata.json for reproducible
    regeneration and point the generator at it (it previously read an untracked
    reference checkout).
  • Fix the fixture generator's line wrapping so regenerated output is
    formatter-clean and byte-stable.
  • Empty the UrlConformanceTest known-failures baseline; the ratchet now pins
    the suite at full conformance.
  • Add Gradle codegen tasks that wrap the Python generators so fixtures
    regenerate via ./gradlew (developer tooling; not wired into the build).

Verification

./gradlew build passes — ktlint, detekt, explicit-API, binary compatibility
(apiCheck; public API unchanged), the Kover coverage floor, and tests on JVM,
JS (browser + node), Wasm (browser + node), and macOS.

  • Component getters: 888/888.
  • Href round-trip: 621/621 (100%).

Refresh the WPT urltestdata corpus to the upstream master snapshot and close
the remaining domain-to-ASCII gap so every in-scope case parses exactly as the
standard prescribes. The conformance harness now runs all 888 cases; it was
pinned to an older snapshot (886 cases) with a different pass/fail split.

The refreshed corpus surfaced 12 divergences, all rooted in the host parser's
IDNA step:

- An all-ASCII domain must be returned lowercased verbatim: with beStrict
  false, Unicode ToASCII failures are only validation errors, so an invalid
  xn-- label (e.g. xn--pokxncvks) is kept rather than rejected.
- A domain that maps to the empty string (e.g. a lone soft hyphen) must fail.

Add Idna.domainToAsciiForUrl implementing the WHATWG "domain to ASCII" wrapper
(ASCII fast-path plus the empty-result rule) and route the special-scheme host
pipeline through it. The pure UTS-46 Idna.domainToAscii is left unchanged, so
it still backs the IdnaTestV2/toascii conformance suite.

Vendor the corpus at tools/url/urltestdata.json so regeneration is reproducible,
and fix the fixture generator's line wrapping so generated output stays within
the formatter's rules.
Register codegen tasks (generateUrlTestData, generateFixtures, and the
per-table IDNA generators) that shell out to the offline Python tools under
tools/, so fixtures and lookup tables regenerate via ./gradlew rather than a
bare python3 invocation. They are developer tooling, grouped under "codegen"
and intentionally not wired into check/build.
@OmarAlJarrah OmarAlJarrah merged commit 1324301 into main Jun 30, 2026
@OmarAlJarrah OmarAlJarrah deleted the feat/whatwg-url-full-conformance branch June 30, 2026 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant