Fix #49: stream downloads to disk + optional progress_callback by iskandr · Pull Request #53 · openvax/datacache

iskandr · 2026-06-18T17:38:01Z

Problem

Closes #49.

_download read the entire HTTP/FTP response into memory (response.content / response.read()) and _download_to_temp_file then wrote that one in-memory blob to disk. For a ~200 MB reference file (e.g. an HPA rna_tissue_consensus.tsv.zip) the whole payload sat in RAM, and there was no hook for callers to render a progress bar — the call just appeared to hang.

Reported by the tsarina / hitlist folks at pirl-unc, who were routing large transfers around datacache specifically because of this.

Fix

Replace _download with _stream_to_file, which streams the response straight to an open file handle one chunk at a time:
- http(s): requests.get(..., stream=True) + iter_content(chunk_size=...)
- ftp/file: chunked response.read(chunk_size) loop
Memory is now capped at one chunk_size (default 1 MB) instead of the full file.
Add an optional progress_callback(bytes_downloaded, total_bytes) hook (and chunk_size), threaded through _download_to_temp_file → _download_and_decompress_if_necessary → fetch_file. total_bytes comes from the server's Content-Length (or None if unknown). This lets consumers drive a tqdm bar without datacache taking a hard tqdm dependency. It applies to the Python downloader; the optional wget path is unchanged (wget renders its own progress).

The wget fallback and the decompression logic are otherwise untouched.

Tests

Added tests/test_streaming_download.py with network-free regression tests (local file:// URLs) covering: streaming writes the exact bytes, the progress callback fires per-chunk with monotonic non-decreasing counts and a final report equal to the file size, and the callback threads through _download_and_decompress_if_necessary.

https://claude.ai/code/session_011bzfZPTzWnhAMVD7msyMg1

Previously _download read the entire HTTP/FTP response into memory (response.content / response.read()) before writing it to the temp file, so downloading a large reference file (e.g. a ~200 MB HPA archive) spiked RAM by the full file size and gave callers no way to show progress. Replace _download with _stream_to_file, which streams the response to an open file handle one chunk at a time (requests stream=True + iter_content for http, chunked response.read for ftp/file), capping memory at one chunk. Add an optional progress_callback(bytes_downloaded, total_bytes) hook and chunk_size, threaded through _download_to_temp_file, _download_and_decompress_if_necessary and fetch_file, so consumers can drive a tqdm bar without datacache depending on tqdm. Adds network-free regression tests using local file:// URLs. Claude-Session: https://claude.ai/code/session_011bzfZPTzWnhAMVD7msyMg1

coveralls · 2026-06-18T17:39:30Z

Coverage Report for CI Build 27777985749

Coverage decreased (-0.5%) to 74.157%

Details

Coverage decreased (-0.5%) from the base build.
Patch coverage: 11 uncovered changes across 1 file (20 of 31 lines covered, 64.52%).
No coverage regressions found.

Uncovered Changes

File	Changed	Covered	%
datacache/download.py	31	20	64.52%

Coverage Regressions

No coverage regressions found.

Coverage Stats


Relevant Lines:	445
Covered Lines:	330
Line Coverage:	74.16%
Coverage Strength:	2.22 hits per line

💛 - Coveralls

iskandr merged commit 4bb8898 into master Jun 18, 2026
5 of 6 checks passed

iskandr deleted the fix-49-streaming-download branch June 18, 2026 17:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix #49: stream downloads to disk + optional progress_callback#53

Fix #49: stream downloads to disk + optional progress_callback#53
iskandr merged 1 commit into
masterfrom
fix-49-streaming-download

iskandr commented Jun 18, 2026

Uh oh!

coveralls commented Jun 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

iskandr commented Jun 18, 2026

Problem

Fix

Tests

Uh oh!

coveralls commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Coverage Report for CI Build 27777985749

Coverage decreased (-0.5%) to 74.157%

Details

Uncovered Changes

Coverage Regressions

Coverage Stats

💛 - Coveralls

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coveralls commented Jun 18, 2026 •

edited

Loading