Make test_download.py hermetic; remove dead wget path; harden zip/gz decompress (closes #42)#56
Merged
Merged
Conversation
…#42) test_fetch_decompress hit a live Ensembl FTP server and looped over use_wget_if_available/timeout while the file was only downloaded once — so the result depended on iteration order (the flakiness in #42), and CI was noisy. - Rewrite the tests to serve a local .gz over file:// with the cache redirected to a tmp dir (the pattern test_streaming_download.py already uses): no network, no cross-run cached state. Adds deterministic cache-reuse + force coverage. - Remove the wget download path: subprocess.call's non-zero exit was ignored (silent partial downloads), and the modernization's streaming Python downloader (_stream_to_file, #49) already handles http(s) and ftp. Drops the subprocess/errno imports and the wget threading through the private helpers. - Keep use_wget_if_available on the public fetch_file / Cache.fetch as a deprecated, ignored no-op (warns when explicitly passed) so callers don't break. No version bump — leaving the release/version decision to the maintainer.
Coverage Report for CI Build 27786313738Coverage increased (+5.6%) to 79.043%Details
Uncovered ChangesNo uncovered changes found. Coverage Regressions1 previously-covered line in 1 file lost coverage.
Coverage Stats
💛 - Coveralls |
Addressing review of #42: - Stream the chosen zip member straight to the destination via z.open() + copyfileobj, replacing ZipFile.extract() which wrote into the current working directory and recreated the member's stored path (cwd-pollution + a "../" path-traversal footgun). Member selection (named match, else biggest) is unchanged. - Stream the gunzip to disk instead of reading the whole decompressed file into memory — consistent with the streaming download goal (#49). - Add a .zip decompression test (previously only .gz was covered) that also asserts nothing leaks into the cwd, and pin the deprecation warning message with pytest.warns(match=...).
… no partial Self-review of the prior hardening commit: streaming the zip member / gunzip straight into full_path regressed atomicity. A corrupt or truncated archive (e.g. a gzip CRC failure, detected only at end-of-stream) would leave a partial file at full_path — and fetch_file treats "path exists" as a cache hit, so the next call would silently serve the truncated data. Add _decompress_to_file(): stream the decompressed source into a sibling temp, then move it into place (matching the atomic `move` the non-decompress path already uses); drop the partial on failure. Restores the atomicity the original zip (extract+move) and gz (read-all-then-write) paths had, while keeping the streaming (bounded-memory) behaviour. Adds test_corrupt_gz_leaves_no_partial_cache locking it in.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #42.
The flaky test
test_fetch_decompresshit a live Ensembl FTP server and looped:The file is only downloaded on the first iteration; every later call finds it
cached and skips the download. So whether a broken code path is exercised
depends on iteration order (as #42 notes,
[True, False]vs[False, True]flip the outcome), and the test needs the cached files manually deleted between
runs. Plus the live FTP dependency made CI noisy/slow.
Changes
.gz/.zipover afile://URL and redirect the cache to a tmp dir (same approach as
test_streaming_download.py). No network, no cross-run cached state. Addsexplicit cache-reuse +
forcecoverage in place of the order-dependent loop,plus a
.zipcase (only.gzwas covered before).wgetpath.subprocess.call(...)'s non-zero exit wasignored, so a failed
wgetsilently left a partial/empty file; and the 1.6.0modernization's streaming Python downloader (
_stream_to_file, _download buffers entire response in memory; no streaming/progress hook #49) alreadyhandles both http(s) and ftp (urllib uses passive FTP by default, matching
the old
--passive-ftp). Drops thesubprocess/errnoimports and theuse_wget_if_availablethreading through the private helpers.use_wget_if_availablestays on the publicfetch_file/Cache.fetchas an ignored no-op that emits aDeprecationWarningwhen explicitly passed, so existing callers don't break.Zip members are now streamed to the destination via
z.open()+copyfileobjinstead ofZipFile.extract(), which wrote into the currentworking directory and recreated the member's stored path — a cwd-pollution
and
../path-traversal risk. Gunzip is likewise streamed to disk rather thanread fully into memory (consistent with _download buffers entire response in memory; no streaming/progress hook #49). Member selection (named match,
else biggest) is unchanged.
Behavior change to note (no CHANGELOG in repo)
Cache.fetch'suse_wget_if_availabledefault flippedTrue → None; downloadsthat previously preferred
wgetnow always use the streaming Python downloader.FTP parity is preserved (passive mode by default), and the removed path had the
silent-partial-download bug, so this is a net improvement — but it is a
behavior change for FTP-via-
Cacheworth a release note.Test plan
pytest tests/test_download.py tests/test_streaming_download.py— all green.24 passed../lint.shclean.download.pycoverage 69% → 80%.Open question for the maintainer
I kept
use_wget_if_availableas a deprecated no-op (non-breaking). If you'drather a clean break, I can drop it from the public API entirely — there are no
in-repo callers; the only exposure is downstream openvax packages.