fix(ext/web): handle Windows file paths in URL parsing#33097
fix(ext/web): handle Windows file paths in URL parsing#33097renezander030 wants to merge 3 commits into
Conversation
Implement WHATWG URL spec change (url#874) to detect Windows drive letter patterns (e.g., C:\path\file.txt) in the URL parser's scheme start state and automatically convert them to file:/// URLs (file:///C:/path/file.txt). The spec adds a check: when parsing encounters a single ASCII alpha letter as the scheme buffer, followed by ':' and '\', it recognizes this as a Windows drive path rather than a URL scheme. The parser then sets the scheme to "file", the host to empty string, and transitions to path state with backslashes normalized to forward slashes. This is implemented as a preprocessing step in the Rust parse_url function, before the input reaches the rust-url crate parser.
|
Please check this comment: |
|
Thanks for the pointer. I'm aware of that comment. The difference here is that this implements the WHATWG URL spec change (url#874), which is being tracked by Chromium, Gecko, and WebKit as well. The upstream Rust The implementation is intentionally minimal and isolated (one function). Easy to remove once the upstream crate picks it up. Happy to hear from the maintainers on whether they'd prefer to wait for upstream or ship early. |
fibibot
left a comment
There was a problem hiding this comment.
Substance covers the most common Windows-path shape (<letter>:\<rest>) and the basic conversion logic looks right, but a few gaps worth resolving before merge.
What's right
- Drive-letter detection (
bytes[0].is_ascii_alphabetic() && bytes[1] == b':' && bytes[2] == b'\\') correctly requires a single ASCII alpha —Cabc:\pathdoesn't match becausebytes[1]is'a'not':'. ✓ - Drive-letter case preservation:
let drive_letter = bytes[0] as charkeepsc:\pathasfile:///c:/path(lowercase). Worth confirming this matches the WHATWG spec — some path normalizations uppercase the drive letter, others preserve. The PR's test asserts the preservation behavior, so at least it's pinned. - Base-URL precedence: the comment correctly notes that Windows drive letters indicate absolute paths and override any provided base. Spec-conformant.
parse_urlintegration: the conversion happens once, before eitherop_url_parseorop_url_parse_with_basereachrust-url. No path-by-path duplication.
Gaps to address
1. Forward-slash separator not handled
The pattern check is bytes[2] == b'\\' — backslash only. A common JS-code shape is new URL("C:/path/file.txt") (forward slash), which today produces file: URL with host="path"-style breakage. Per WHATWG url#874's "normalize the windows-drive-letter quirk," the conversion should fire for both separators.
One-liner fix:
if bytes.len() >= 3
&& bytes[0].is_ascii_alphabetic()
&& bytes[1] == b':'
&& (bytes[2] == b'\\' || bytes[2] == b'/')Worth adding a test:
assertEquals(new URL("C:/path/file.txt").href, "file:///C:/path/file.txt");Reverting just the bytes[2] == b'/' clause should fail this test.
2. No URL encoding of path content
let path = rest.replace('\\', "/") substitutes separators but doesn't percent-encode special characters. So new URL("C:\\path with space\\file.txt") produces file:///C:/path with space/file.txt — which then goes through Url::options().parse(...) which presumably handles encoding. But it's worth confirming with a test:
assertEquals(
new URL("C:\\path with space\\file.txt").href,
"file:///C:/path%20with%20space/file.txt",
);If rust-url normalizes the spaces, this test passes. If not, the output href would have a literal space which Node would reject. Either way, pinning the behavior is useful.
3. WHATWG spec citation needs verification
The PR body cites whatwg/url#874 but I haven't confirmed:
- Whether the spec change has actually been merged into the WHATWG URL Living Standard
- The exact normalization behavior (uppercase vs preserve drive letter, separator handling)
- Whether other URL implementations (Chrome, Firefox, Node) have shipped this
If the spec change is still in flight or differs from this implementation, Deno could end up with non-standard behavior. Worth either citing the merged spec text or noting in the code that this matches the proposed change rather than current spec.
CI not yet run
Only license/cla (signed) + lint title (pass) reported. Substance shards (test unit for tests/unit/url_test.ts) haven't kicked off — looks like first-time-contributor action_required gating. PR is also stale (created 2026-03-31, ~33 days old). A maintainer-authorize-CI cycle plus the forward-slash case fix would clear the substance gate.
Substance is APPROVE-shape-with-fixes — the basic conversion is right, just needs the forward-slash branch + URL-encoding test + spec citation verification.
fibibot
left a comment
There was a problem hiding this comment.
Right idea (implementing the WHATWG URL spec change url#874 for Windows-drive paths), but the heuristic over-matches and breaks existing valid URLs of the form letter:/... that should be parsed as scheme+authority. WPT surfaces it directly.
Specific failure mode
wpt release linux-x86_64 reports four URL-parsing regressions on web-platform.test/url/url-setters.any.js:
assert_equals: href expected "c:/foo" but got "file:///c:/foo"
assert_equals: href expected "h://." but got "file:///h://"
assert_equals: href expected "w://x:0" but got "file:///w://x:0"
assert_equals: expected "a://example.net" but got "file:///a://example.net"
All four are caused by maybe_convert_windows_path_to_file_url matching letter:/... strings that are actually scheme+authority URLs. The heuristic:
bytes[0].is_ascii_alphabetic()
&& bytes[1] == b':'
&& (bytes[2] == b'\\' || bytes[2] == b'/')catches the legitimate Windows-path cases (C:\path, C:/path) but also catches a://example.net because the : and the first / after it match - even though the second / of :// makes it a scheme+authority URL, not a drive path.
Fix direction
Distinguish C:/path from a://example.net by checking the character at position 3:
if bytes.len() >= 3
&& bytes[0].is_ascii_alphabetic()
&& bytes[1] == b':'
&& (bytes[2] == b'\\'
|| (bytes[2] == b'/' && bytes.get(3).copied() != Some(b'/')))This rejects : followed by // (scheme-relative form) while still accepting the single-/ Windows-drive case. Two of the WPT regressions (h://. and w://x:0) are pure :// forms and would pass with this change.
For the remaining two (c:/foo and a://example.net):
c:/foo- singlec:/is genuinely ambiguous between Windows-drive andc:scheme. Per WHATWG url#874, the spec resolves the ambiguity in favor of Windows-drive treatment but ONLY in path-style contexts, not when the URL is being parsed as a scheme. The simplest implementation that matches the spec is to check whetherc:is a known scheme - if yes, treat as scheme; if no, treat as Windows drive. Or, conservatively: only trigger the conversion for\\(backslash) and require explicitfile://for forward-slash drive paths. The latter is what Node does today.a://example.net-://form, fixed by the position-3 check above.
Substance walk
- Spec reference (
whatwg/url#874) is correctly cited in the comment. The implementation diverges from the spec in the://non-distinction. Worth re-reading the PR's spec text to confirm which forms it actually converts. - Drive-letter detection correctly accepts both upper and lower case (
bytes[0].is_ascii_alphabetic()). - Path normalization (
rest.replace('\\', "/")) correctly converts backslashes in the path portion. - Test coverage - the 8 added test cases all exercise legitimate Windows paths (
C:\path,c:\folder,C:/path, mixed separators, with-base, percent-encoding). None exercise theletter://ambiguity that's regressing. Adding negative test cases (e.g.,assertEquals(new URL("a://example.net").href, "a://example.net")) before fixing the heuristic would lock the fix in.
CI
97 success, 1 skipped, 2 fail. The 2 failures are the WPT regressions above + the ci status rollup. No node_compat or other surface affected - this is purely the URL-parsing regression.
Not blocking the spec direction - just need the heuristic narrowed before this can land cleanly.
Address review feedback (fibibot CHANGES_REQUESTED): the previous forward-slash extension over-matched scheme URLs (`a://example.net`, `c:/foo`, `h://.`, `w://x:0`) and broke 4 WPT url-setters tests. Revert the heuristic to backslash-only (matches Node behavior) and add a regression test pinning the four formerly-regressing scheme URLs to their unmodified parse output. Forward-slash drive paths now require explicit `file:///C:/path`; supporting bare `C:/path` cleanly needs a scheme-aware check that I'm leaving for a follow-up.
02a5b48 to
1257494
Compare
|
Thanks for the catch. The forward-slash extension regressed scheme URLs ( Reverted to backslash-only in 1257494 and added a regression test pinning all four formerly-regressing URLs to their unmodified parse output. Forward-slash drive paths ( Going scheme-aware (so bare On the spec citation point from your earlier review: whatwg/url#874 is still an open PR, not merged into the Living Standard. I've left the PR body reference in place but happy to soften it to "tracks proposed change in whatwg/url#874" if that reads better. |
|
hey, apologies on the late reply. |
fibibot
left a comment
There was a problem hiding this comment.
LGTM — clears my prior REQUEST_CHANGES at 02a5b48c. The four WPT regressions I identified (c:/foo, h://., w://x:0, a://example.net) are fixed by the conservative path I called out as fix direction (b) — narrow the heuristic to backslash-only and rely on explicit file:/// for forward-slash drive paths. CI fully green at this HEAD: 134 success, 2 skipping, 0 fail.
What I verified
The heuristic change (ext/web/url.rs:74-77):
if bytes.len() >= 3
&& bytes[0].is_ascii_alphabetic()
&& bytes[1] == b':'
&& bytes[2] == b'\\' // was: (bytes[2] == b'\\' || bytes[2] == b'/')Dropping the || bytes[2] == b'/' clause closes the entire ambiguity surface — c:/foo, c:something, and letter://... all now flow through to standard URL parsing as scheme URLs, exactly like Node treats them.
The test churn matches the heuristic change:
urlWindowsFilePathForwardSlash— removed (was assertingC:/path/file.txt→file:///, which is no longer the contract).- The forward-slash assertion in
urlWindowsFilePathPercentEncodesSpaces— removed (kept only the backslash one). urlSchemeAuthorityNotMisinterpretedAsDrive— added, asserts the four exact WPT shapes that were regressing (a://example.net,h://.,w://x:0,c:/foo) now round-trip unchanged.
Doc comment (lines 60-65) explicitly calls out the intentional non-conversion of forward-slash drives and points users at file:///C:/path as the explicit form. ✓
Behavior parity check: Node's pathToFileURL only converts backslash-style Windows paths; forward-slash drive paths in raw new URL(...) produce c: scheme URLs there too. So the trimmed scope here matches the upstream contract the PR description claims.
CI
134 success, 2 skipping, 0 fail. WPT release linux-x86_64 (the shard I previously called out as failing four URL-setters assertions) is now green at this HEAD.
No concerns inline.
|
Makes sense. I'll take this to rust-url and revisit here once it lands upstream and a Deno version bump pulls it in. Closing this PR in the meantime. |
Summary
Implements the WHATWG URL spec change (whatwg/url#874) to handle Windows-style file paths in URL parsing.
When
new URL("C:\\path\\file.txt")is called, the parser now detects the Windows drive letter pattern (single ASCII alpha +:+\) and converts it to afile:///URL with normalized forward slashes:file:///C:/path/file.txt.Changes
ext/web/url.rs: Addedmaybe_convert_windows_path_to_file_url()preprocessing function that detects Windows drive letter patterns before the input reachesrust-url's parser. Called fromparse_url()for bothop_url_parseandop_url_parse_with_basecode paths.tests/unit/url_test.ts: Added tests covering basic paths, different drive letters, lowercase drives, mixed separators, paths with base URL, andURL.parse().Spec details
The WHATWG URL spec change adds a check in the "scheme start state": when the parser encounters a single ASCII alpha character as a potential scheme, followed by
:and\, it recognizes this as a Windows drive path rather than a URL scheme. It then sets scheme tofile, host to empty string, and transitions to path state.Test plan
tests/unit/url_test.tsFixes #30363