How to reproduce the behaviour
The --url flag added in #13848 to python -m spacy download cannot succeed under any input. Either the user-supplied URL is silently replaced with the default GitHub URL, or it is rejected by the post-construction guard.
import sys
import importlib
from unittest.mock import patch
import spacy
from spacy import about
# `from spacy.cli import download` resolves to the *function*, not the module,
# because of an alias in cli/__init__.py. Force-import the module:
importlib.import_module("spacy.cli.download")
dl = sys.modules["spacy.cli.download"]
captured = {}
def fake_run(cmd):
captured["url"] = next(arg for arg in cmd if arg.startswith("http"))
# Case 1: --url WITHOUT trailing slash
captured.clear()
with patch.object(dl, "run_command", fake_run), \
patch.object(dl, "_get_pip_install_cmd", lambda: ["pip", "install"]):
dl.download_model("foo-1.0.tar.gz", custom_url="https://my-mirror.example.com/models")
print(captured["url"])
# → https://github.com/explosion/spacy-models/releases/download/foo-1.0.tar.gz
# (the user's mirror was silently discarded)
# Case 2: --url WITH trailing slash
with patch.object(dl, "run_command", fake_run), \
patch.object(dl, "_get_pip_install_cmd", lambda: ["pip", "install"]):
dl.download_model("foo-1.0.tar.gz", custom_url="https://my-mirror.example.com/models/")
# → ValueError: Download from foo-1.0.tar.gz rejected. Was it a relative path?
Root cause
spacy/cli/download.py:180-186:
base_url = custom_url if custom_url else about.__download_url__
# urljoin requires that the path ends with /, or the last path part will be dropped
if not base_url.endswith("/"):
base_url = about.__download_url__ + "/" # ← clobbers custom_url
download_url = urljoin(base_url, filename)
if not download_url.startswith(about.__download_url__):
raise ValueError(f"Download from {filename} rejected. Was it a relative path?")
Two interlocking defects:
- Line 183 unconditionally replaces
base_url with the default URL when the input lacks a trailing slash, discarding the user's choice.
- The line-185
startswith(about.__download_url__) guard rejects any custom URL that is preserved, because a custom mirror by definition does not start with the GitHub URL.
Result: the --url flag cannot reach a non-default URL under any input.
Impact
Users in air-gapped or mirrored environments — the exact users the feature was added for, per #13848 — believe their downloads are local but are silently being dispatched to github.com. Egress that operators thought they had blocked at the network layer is the only remaining safeguard.
Your Environment
- spaCy version: 3.8.13 (also reproduces on
master HEAD, prep for 3.8.14)
- Python version: 3.12
- Platform: macOS / Linux
PR: shaun0927/spaCy@fix/download-custom-url (incoming).
How to reproduce the behaviour
The
--urlflag added in #13848 topython -m spacy downloadcannot succeed under any input. Either the user-supplied URL is silently replaced with the default GitHub URL, or it is rejected by the post-construction guard.Root cause
spacy/cli/download.py:180-186:Two interlocking defects:
base_urlwith the default URL when the input lacks a trailing slash, discarding the user's choice.startswith(about.__download_url__)guard rejects any custom URL that is preserved, because a custom mirror by definition does not start with the GitHub URL.Result: the
--urlflag cannot reach a non-default URL under any input.Impact
Users in air-gapped or mirrored environments — the exact users the feature was added for, per #13848 — believe their downloads are local but are silently being dispatched to
github.com. Egress that operators thought they had blocked at the network layer is the only remaining safeguard.Your Environment
masterHEAD, prep for 3.8.14)PR: shaun0927/spaCy@fix/download-custom-url (incoming).