Skip to content

Fix #49: stream downloads to disk + optional progress_callback#53

Merged
iskandr merged 1 commit into
masterfrom
fix-49-streaming-download
Jun 18, 2026
Merged

Fix #49: stream downloads to disk + optional progress_callback#53
iskandr merged 1 commit into
masterfrom
fix-49-streaming-download

Conversation

@iskandr

@iskandr iskandr commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Problem

Closes #49.

_download read the entire HTTP/FTP response into memory (response.content / response.read()) and _download_to_temp_file then wrote that one in-memory blob to disk. For a ~200 MB reference file (e.g. an HPA rna_tissue_consensus.tsv.zip) the whole payload sat in RAM, and there was no hook for callers to render a progress bar — the call just appeared to hang.

Reported by the tsarina / hitlist folks at pirl-unc, who were routing large transfers around datacache specifically because of this.

Fix

  • Replace _download with _stream_to_file, which streams the response straight to an open file handle one chunk at a time:

    • http(s): requests.get(..., stream=True) + iter_content(chunk_size=...)
    • ftp/file: chunked response.read(chunk_size) loop

    Memory is now capped at one chunk_size (default 1 MB) instead of the full file.

  • Add an optional progress_callback(bytes_downloaded, total_bytes) hook (and chunk_size), threaded through _download_to_temp_file_download_and_decompress_if_necessaryfetch_file. total_bytes comes from the server's Content-Length (or None if unknown). This lets consumers drive a tqdm bar without datacache taking a hard tqdm dependency. It applies to the Python downloader; the optional wget path is unchanged (wget renders its own progress).

The wget fallback and the decompression logic are otherwise untouched.

Tests

Added tests/test_streaming_download.py with network-free regression tests (local file:// URLs) covering: streaming writes the exact bytes, the progress callback fires per-chunk with monotonic non-decreasing counts and a final report equal to the file size, and the callback threads through _download_and_decompress_if_necessary.

https://claude.ai/code/session_011bzfZPTzWnhAMVD7msyMg1

Previously _download read the entire HTTP/FTP response into memory
(response.content / response.read()) before writing it to the temp file,
so downloading a large reference file (e.g. a ~200 MB HPA archive) spiked
RAM by the full file size and gave callers no way to show progress.

Replace _download with _stream_to_file, which streams the response to an
open file handle one chunk at a time (requests stream=True + iter_content
for http, chunked response.read for ftp/file), capping memory at one
chunk. Add an optional progress_callback(bytes_downloaded, total_bytes)
hook and chunk_size, threaded through _download_to_temp_file,
_download_and_decompress_if_necessary and fetch_file, so consumers can
drive a tqdm bar without datacache depending on tqdm.

Adds network-free regression tests using local file:// URLs.

Claude-Session: https://claude.ai/code/session_011bzfZPTzWnhAMVD7msyMg1
@coveralls

coveralls commented Jun 18, 2026

Copy link
Copy Markdown

Coverage Report for CI Build 27777985749

Coverage decreased (-0.5%) to 74.157%

Details

  • Coverage decreased (-0.5%) from the base build.
  • Patch coverage: 11 uncovered changes across 1 file (20 of 31 lines covered, 64.52%).
  • No coverage regressions found.

Uncovered Changes

File Changed Covered %
datacache/download.py 31 20 64.52%

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 445
Covered Lines: 330
Line Coverage: 74.16%
Coverage Strength: 2.22 hits per line

💛 - Coveralls

@iskandr iskandr merged commit 4bb8898 into master Jun 18, 2026
5 of 6 checks passed
@iskandr iskandr deleted the fix-49-streaming-download branch June 18, 2026 17:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_download buffers entire response in memory; no streaming/progress hook

2 participants