Skip to content

spacy download --url is silently ignored or rejected — the custom URL flag never works #13963

@shaun0927

Description

@shaun0927

How to reproduce the behaviour

The --url flag added in #13848 to python -m spacy download cannot succeed under any input. Either the user-supplied URL is silently replaced with the default GitHub URL, or it is rejected by the post-construction guard.

import sys
import importlib
from unittest.mock import patch

import spacy
from spacy import about

# `from spacy.cli import download` resolves to the *function*, not the module,
# because of an alias in cli/__init__.py. Force-import the module:
importlib.import_module("spacy.cli.download")
dl = sys.modules["spacy.cli.download"]

captured = {}
def fake_run(cmd):
    captured["url"] = next(arg for arg in cmd if arg.startswith("http"))

# Case 1: --url WITHOUT trailing slash
captured.clear()
with patch.object(dl, "run_command", fake_run), \
     patch.object(dl, "_get_pip_install_cmd", lambda: ["pip", "install"]):
    dl.download_model("foo-1.0.tar.gz", custom_url="https://my-mirror.example.com/models")
print(captured["url"])
# → https://github.com/explosion/spacy-models/releases/download/foo-1.0.tar.gz
#   (the user's mirror was silently discarded)

# Case 2: --url WITH trailing slash
with patch.object(dl, "run_command", fake_run), \
     patch.object(dl, "_get_pip_install_cmd", lambda: ["pip", "install"]):
    dl.download_model("foo-1.0.tar.gz", custom_url="https://my-mirror.example.com/models/")
# → ValueError: Download from foo-1.0.tar.gz rejected. Was it a relative path?

Root cause

spacy/cli/download.py:180-186:

base_url = custom_url if custom_url else about.__download_url__
# urljoin requires that the path ends with /, or the last path part will be dropped
if not base_url.endswith("/"):
    base_url = about.__download_url__ + "/"      # ← clobbers custom_url
download_url = urljoin(base_url, filename)
if not download_url.startswith(about.__download_url__):
    raise ValueError(f"Download from {filename} rejected. Was it a relative path?")

Two interlocking defects:

  1. Line 183 unconditionally replaces base_url with the default URL when the input lacks a trailing slash, discarding the user's choice.
  2. The line-185 startswith(about.__download_url__) guard rejects any custom URL that is preserved, because a custom mirror by definition does not start with the GitHub URL.

Result: the --url flag cannot reach a non-default URL under any input.

Impact

Users in air-gapped or mirrored environments — the exact users the feature was added for, per #13848 — believe their downloads are local but are silently being dispatched to github.com. Egress that operators thought they had blocked at the network layer is the only remaining safeguard.

Your Environment

  • spaCy version: 3.8.13 (also reproduces on master HEAD, prep for 3.8.14)
  • Python version: 3.12
  • Platform: macOS / Linux

PR: shaun0927/spaCy@fix/download-custom-url (incoming).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions