Skip to content

Make test_download.py hermetic; remove dead wget path; harden zip/gz decompress (closes #42)#56

Merged
iskandr merged 4 commits into
masterfrom
fix-flaky-download-test
Jun 18, 2026
Merged

Make test_download.py hermetic; remove dead wget path; harden zip/gz decompress (closes #42)#56
iskandr merged 4 commits into
masterfrom
fix-flaky-download-test

Conversation

@iskandr

@iskandr iskandr commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Closes #42.

The flaky test

test_fetch_decompress hit a live Ensembl FTP server and looped:

for use_wget_if_available in [True, False]:
    for timeout in [None, 10**6]:
        path1 = fetch_file(URL, decompress=True, use_wget_if_available=..., timeout=timeout)
    assert ...

The file is only downloaded on the first iteration; every later call finds it
cached and skips the download. So whether a broken code path is exercised
depends on iteration order (as #42 notes, [True, False] vs [False, True]
flip the outcome), and the test needs the cached files manually deleted between
runs. Plus the live FTP dependency made CI noisy/slow.

Changes

  • Hermetic, deterministic tests. Serve a local .gz/.zip over a file://
    URL and redirect the cache to a tmp dir (same approach as
    test_streaming_download.py). No network, no cross-run cached state. Adds
    explicit cache-reuse + force coverage in place of the order-dependent loop,
    plus a .zip case (only .gz was covered before).
  • Remove the dead wget path. subprocess.call(...)'s non-zero exit was
    ignored, so a failed wget silently left a partial/empty file; and the 1.6.0
    modernization's streaming Python downloader (_stream_to_file, _download buffers entire response in memory; no streaming/progress hook #49) already
    handles both http(s) and ftp (urllib uses passive FTP by default, matching
    the old --passive-ftp). Drops the subprocess/errno imports and the
    use_wget_if_available threading through the private helpers.
  • Non-breaking deprecation. use_wget_if_available stays on the public
    fetch_file / Cache.fetch as an ignored no-op that emits a
    DeprecationWarning when explicitly passed, so existing callers don't break.
  • Hardened decompression (footgun fix surfaced while testing the zip path).
    Zip members are now streamed to the destination via z.open() +
    copyfileobj instead of ZipFile.extract(), which wrote into the current
    working directory
    and recreated the member's stored path — a cwd-pollution
    and ../ path-traversal risk. Gunzip is likewise streamed to disk rather than
    read fully into memory (consistent with _download buffers entire response in memory; no streaming/progress hook #49). Member selection (named match,
    else biggest) is unchanged.

Behavior change to note (no CHANGELOG in repo)

Cache.fetch's use_wget_if_available default flipped True → None; downloads
that previously preferred wget now always use the streaming Python downloader.
FTP parity is preserved (passive mode by default), and the removed path had the
silent-partial-download bug, so this is a net improvement — but it is a
behavior change for FTP-via-Cache worth a release note.

Test plan

  • pytest tests/test_download.py tests/test_streaming_download.py — all green.
  • Full suite: 24 passed. ./lint.sh clean. download.py coverage 69% → 80%.

Open question for the maintainer

I kept use_wget_if_available as a deprecated no-op (non-breaking). If you'd
rather a clean break, I can drop it from the public API entirely — there are no
in-repo callers; the only exposure is downstream openvax packages.

…#42)

test_fetch_decompress hit a live Ensembl FTP server and looped over
use_wget_if_available/timeout while the file was only downloaded once — so the
result depended on iteration order (the flakiness in #42), and CI was noisy.

- Rewrite the tests to serve a local .gz over file:// with the cache redirected
  to a tmp dir (the pattern test_streaming_download.py already uses): no network,
  no cross-run cached state. Adds deterministic cache-reuse + force coverage.
- Remove the wget download path: subprocess.call's non-zero exit was ignored
  (silent partial downloads), and the modernization's streaming Python
  downloader (_stream_to_file, #49) already handles http(s) and ftp. Drops the
  subprocess/errno imports and the wget threading through the private helpers.
- Keep use_wget_if_available on the public fetch_file / Cache.fetch as a
  deprecated, ignored no-op (warns when explicitly passed) so callers don't break.

No version bump — leaving the release/version decision to the maintainer.
@coveralls

coveralls commented Jun 18, 2026

Copy link
Copy Markdown

Coverage Report for CI Build 27786313738

Coverage increased (+5.6%) to 79.043%

Details

  • Coverage increased (+5.6%) from the base build.
  • Patch coverage: 23 of 23 lines across 2 files are fully covered (100%).
  • 1 coverage regression across 1 file.

Uncovered Changes

No uncovered changes found.

Coverage Regressions

1 previously-covered line in 1 file lost coverage.

File Lines Losing Coverage Coverage
datacache/download.py 1 82.05%

Coverage Stats

Coverage Status
Relevant Lines: 439
Covered Lines: 347
Line Coverage: 79.04%
Coverage Strength: 2.37 hits per line

💛 - Coveralls

Addressing review of #42:
- Stream the chosen zip member straight to the destination via z.open() +
  copyfileobj, replacing ZipFile.extract() which wrote into the current working
  directory and recreated the member's stored path (cwd-pollution + a "../"
  path-traversal footgun). Member selection (named match, else biggest) is
  unchanged.
- Stream the gunzip to disk instead of reading the whole decompressed file into
  memory — consistent with the streaming download goal (#49).
- Add a .zip decompression test (previously only .gz was covered) that also
  asserts nothing leaks into the cwd, and pin the deprecation warning message
  with pytest.warns(match=...).
@iskandr iskandr changed the title Make test_download.py hermetic; remove dead wget path (closes #42) Make test_download.py hermetic; remove dead wget path; harden zip/gz decompress (closes #42) Jun 18, 2026
iskandr added 2 commits June 18, 2026 15:53
… no partial

Self-review of the prior hardening commit: streaming the zip member / gunzip
straight into full_path regressed atomicity. A corrupt or truncated archive
(e.g. a gzip CRC failure, detected only at end-of-stream) would leave a partial
file at full_path — and fetch_file treats "path exists" as a cache hit, so the
next call would silently serve the truncated data.

Add _decompress_to_file(): stream the decompressed source into a sibling temp,
then move it into place (matching the atomic `move` the non-decompress path
already uses); drop the partial on failure. Restores the atomicity the original
zip (extract+move) and gz (read-all-then-write) paths had, while keeping the
streaming (bounded-memory) behaviour.

Adds test_corrupt_gz_leaves_no_partial_cache locking it in.
@iskandr iskandr merged commit b8cb225 into master Jun 18, 2026
6 checks passed
@iskandr iskandr deleted the fix-flaky-download-test branch June 18, 2026 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test_fetch_decompress failure

2 participants