Fix #49: stream downloads to disk + optional progress_callback#53
Merged
Conversation
Previously _download read the entire HTTP/FTP response into memory (response.content / response.read()) before writing it to the temp file, so downloading a large reference file (e.g. a ~200 MB HPA archive) spiked RAM by the full file size and gave callers no way to show progress. Replace _download with _stream_to_file, which streams the response to an open file handle one chunk at a time (requests stream=True + iter_content for http, chunked response.read for ftp/file), capping memory at one chunk. Add an optional progress_callback(bytes_downloaded, total_bytes) hook and chunk_size, threaded through _download_to_temp_file, _download_and_decompress_if_necessary and fetch_file, so consumers can drive a tqdm bar without datacache depending on tqdm. Adds network-free regression tests using local file:// URLs. Claude-Session: https://claude.ai/code/session_011bzfZPTzWnhAMVD7msyMg1
Coverage Report for CI Build 27777985749Coverage decreased (-0.5%) to 74.157%Details
Uncovered Changes
Coverage RegressionsNo coverage regressions found. Coverage Stats
💛 - Coveralls |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Closes #49.
_downloadread the entire HTTP/FTP response into memory (response.content/response.read()) and_download_to_temp_filethen wrote that one in-memory blob to disk. For a ~200 MB reference file (e.g. an HPArna_tissue_consensus.tsv.zip) the whole payload sat in RAM, and there was no hook for callers to render a progress bar — the call just appeared to hang.Reported by the tsarina / hitlist folks at pirl-unc, who were routing large transfers around datacache specifically because of this.
Fix
Replace
_downloadwith_stream_to_file, which streams the response straight to an open file handle one chunk at a time:requests.get(..., stream=True)+iter_content(chunk_size=...)response.read(chunk_size)loopMemory is now capped at one
chunk_size(default 1 MB) instead of the full file.Add an optional
progress_callback(bytes_downloaded, total_bytes)hook (andchunk_size), threaded through_download_to_temp_file→_download_and_decompress_if_necessary→fetch_file.total_bytescomes from the server'sContent-Length(orNoneif unknown). This lets consumers drive atqdmbar without datacache taking a hard tqdm dependency. It applies to the Python downloader; the optionalwgetpath is unchanged (wget renders its own progress).The
wgetfallback and the decompression logic are otherwise untouched.Tests
Added
tests/test_streaming_download.pywith network-free regression tests (localfile://URLs) covering: streaming writes the exact bytes, the progress callback fires per-chunk with monotonic non-decreasing counts and a final report equal to the file size, and the callback threads through_download_and_decompress_if_necessary.https://claude.ai/code/session_011bzfZPTzWnhAMVD7msyMg1