feat: reach full WHATWG URL conformance against the WPT corpus#14
Merged
Conversation
Refresh the WPT urltestdata corpus to the upstream master snapshot and close the remaining domain-to-ASCII gap so every in-scope case parses exactly as the standard prescribes. The conformance harness now runs all 888 cases; it was pinned to an older snapshot (886 cases) with a different pass/fail split. The refreshed corpus surfaced 12 divergences, all rooted in the host parser's IDNA step: - An all-ASCII domain must be returned lowercased verbatim: with beStrict false, Unicode ToASCII failures are only validation errors, so an invalid xn-- label (e.g. xn--pokxncvks) is kept rather than rejected. - A domain that maps to the empty string (e.g. a lone soft hyphen) must fail. Add Idna.domainToAsciiForUrl implementing the WHATWG "domain to ASCII" wrapper (ASCII fast-path plus the empty-result rule) and route the special-scheme host pipeline through it. The pure UTS-46 Idna.domainToAscii is left unchanged, so it still backs the IdnaTestV2/toascii conformance suite. Vendor the corpus at tools/url/urltestdata.json so regeneration is reproducible, and fix the fixture generator's line wrapping so generated output stays within the formatter's rules.
Register codegen tasks (generateUrlTestData, generateFixtures, and the per-table IDNA generators) that shell out to the offline Python tools under tools/, so fixtures and lookup tables regenerate via ./gradlew rather than a bare python3 invocation. They are developer tooling, grouped under "codegen" and intentionally not wired into check/build.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The WHATWG
Urlprofile now passes the entire WPTurltestdata.jsoncorpus(888/888 cases). The conformance harness was pinned to an older corpus snapshot
(886 cases) with a different pass/fail split; this refreshes it to the upstream
master snapshot and closes the one remaining behavioural gap so every in-scope
case parses exactly as the standard prescribes.
Background
The two ratcheting conformance suites already run every in-scope case through
UrlParser. Refreshing the corpus surfaced 12 divergences, all in the hostparser's IDNA (
domain to ASCII) step:beStrictfalse, Unicode ToASCII failures are only validation errors, so an invalid
xn--label such asxn--pokxncvksis kept rather than rejected(
IgnoreInvalidPunycodealone is not sufficient, since Punycode can decodeyet still fail a later validity check).
fail.
Changes
Idna.domainToAsciiForUrl, the WHATWG "domain to ASCII" wrapper (ASCIIfast-path + empty-result rule), and route the special-scheme host pipeline
through it. The pure UTS-46
Idna.domainToAsciiis unchanged, so it stillbacks the IdnaTestV2/toascii conformance suite.
tools/url/urltestdata.jsonfor reproducibleregeneration and point the generator at it (it previously read an untracked
reference checkout).
formatter-clean and byte-stable.
UrlConformanceTestknown-failures baseline; the ratchet now pinsthe suite at full conformance.
codegentasks that wrap the Python generators so fixturesregenerate via
./gradlew(developer tooling; not wired into the build).Verification
./gradlew buildpasses — ktlint, detekt, explicit-API, binary compatibility(
apiCheck; public API unchanged), the Kover coverage floor, and tests on JVM,JS (browser + node), Wasm (browser + node), and macOS.