Skip to content

Extract ZIP archives while downloading#279

Merged
hagenw merged 35 commits into
mainfrom
stream-extract
Feb 2, 2026
Merged

Extract ZIP archives while downloading#279
hagenw merged 35 commits into
mainfrom
stream-extract

Conversation

@hagenw
Copy link
Copy Markdown
Member

@hagenw hagenw commented Jan 8, 2026

Summary

  • Speed up get_archive() by extracting a ZIP file while downloading it.
  • Remove num_workers from get_archive() (introduced in Use workers for file download #271) for required sequential processing of the file chunks. When num_workers > 1 we do not use streaming extraction, but first download the file with multiple workers to a temp file and extract it afterwards using a single worker.

Extraction of archived cannot be speed up by using multiple workers (compare audeering/audeer#186), hence we speed it up indirectly here. Extracting ZIP files in a streaming fashion works for all backends.

Details

TAR.GZ archives are not affected by the changes and are still first downloaded and then extracted.
The biggest downside is that we need additional external dependencies for the implementation, as we use stream-unzip, which depends on the two self-contained packages pycryptodome and stream-inflate.
stream-unzip does not yet support Python 3.14, there we fall back to the old behavior.

Removing num_workers from get_archive() is not a breaking change as we yanked version 2.3.0 of audbackend.

The pull request also adds two new private methods that every backend has to implement:

  • audbackend.backend.Base._get_file_stream()
  • audbackend.backend.Base._size() (for progress bar during streaming ZIP extraction)

Benchmarks

Benchmark results averaged over 10 runs using a single CPU thread, for using Minio.get_archive() on the file /alm/audeering-omni/stage1_2/torch/7289b57d.zip (4.2 GB).

Before After
0:02:18.284 0:01:18.700

Benchmark with audmodel for 7289b57d-1.0.0 (compare audeering/audmodel#38)

Before After
0:02:21.248 0:01:20.057

And using audb to load emodb version 2.0.0 (execution time in seconds)

num_workers Before After
1 0:03:43.810 0:01:30.780
6 0:00:40.120 0:00:24.970

Discussion

  • The current approach would speed up automatically all applications using audbackend that use ZIP files (audb, audmodel)
  • audmodel could be further improved by not storing the model files as ZIP files, as they are already quite compressed. Then we could download the big model file with get_file() using several workers and the remaining model metadata as a ZIP file. But this would of cause require to update how audmodel.publish() stores the files in the first place

@sourcery-ai
Copy link
Copy Markdown
Contributor

sourcery-ai Bot commented Jan 8, 2026

Reviewer's Guide

Implement streaming ZIP extraction in backend.get_archive(), introducing backend streaming/file-size interfaces and cleanup semantics, while removing num_workers support for archives and wiring stream-unzip as an optional dependency.

Sequence diagram for streaming ZIP extraction in get_archive

sequenceDiagram
    participant Caller
    participant Backend as Backend
    participant StreamUnzip as stream_unzip
    participant FS as FileSystem

    Caller->>Backend: get_archive(src_path, dst_root, validate, verbose)
    Backend->>Backend: check_path(src_path)
    Backend->>Backend: detect .zip and STREAM_UNZIP_AVAILABLE
    alt ZIP with streaming
        Backend->>Backend: _get_archive_streaming(src_path, dst_root, validate, verbose)
        Backend->>Backend: _size(src_path) [if verbose]
        Backend->>Backend: create progress_bar
        Backend->>Backend: stream_with_hash()
        loop download chunks
            Backend->>Backend: _get_file_stream(src_path)
            Backend-->>Backend: chunk bytes
            Backend->>Backend: update md5_hash
            Backend->>Backend: pbar.update(len(chunk))
        end
        Backend->>StreamUnzip: stream_unzip(stream_with_hash())
        loop for each entry in archive
            StreamUnzip-->>Backend: file_name, file_size, unzipped_chunks
            Backend->>Backend: skip directories
            Backend->>FS: mkdir(parent(dst_path))
            loop write unzipped_chunks
                Backend->>FS: write chunk to dst_path
            end
            Backend->>Backend: record extracted_files
        end
        alt validate checksum
            Backend->>Backend: checksum(src_path)
            Backend->>Backend: compare expected vs actual
            opt mismatch
                Backend->>Backend: cleanup_on_failure()
                Backend-->>Caller: raise InterruptedError
            end
        end
        Backend-->>Caller: return extracted_files
    else other archives or ZIP without streaming
        Backend->>Backend: create TemporaryDirectory(tmp_root)
        Backend->>Backend: get_file(src_path, local_archive, validate, verbose)
        Backend->>Backend: audeer.extract_archive(local_archive, dst_root, validate, verbose)
        Backend-->>Caller: return extracted_files
    end
Loading

Class diagram for backend streaming interfaces and implementations

classDiagram
    class Backend {
        +get_archive(src_path str, dst_root str, tmp_root str, validate bool, verbose bool) list~str~
        -_get_archive_streaming(src_path str, dst_root str, validate bool, verbose bool) list~str~
        -_get_file(src_path str, dst_path str, verbose bool)
        -_get_file_stream(src_path str) Iterator~bytes~
        -_size(path str) int
        +get_file(src_path str, dst_path str, tmp_root str, num_workers int, validate bool, verbose bool)
    }

    class ArtifactoryBackend {
        -_get_file(src_path str, dst_path str, verbose bool)
        -_get_file_stream(src_path str) Iterator~bytes~
        -_size(path str) int
    }

    class FilesystemBackend {
        -_get_file(src_path str, dst_path str, verbose bool)
        -_get_file_stream(src_path str) Iterator~bytes~
        -_size(path str) int
    }

    class MinioBackend {
        -_download_file(src_path str, dst_path str, verbose bool)
        -_get_file_stream(src_path str) Iterator~bytes~
        -_size(path str) int
    }

    Backend <|-- ArtifactoryBackend
    Backend <|-- FilesystemBackend
    Backend <|-- MinioBackend

    class stream_unzip {
        +stream_unzip(source Iterator~bytes~) Iterator
    }

    Backend ..> stream_unzip : uses for ZIP streaming
Loading

File-Level Changes

Change Details Files
Add streaming ZIP extraction path in BaseBackend.get_archive() using stream-unzip, with checksum validation and robust cleanup behavior.
  • Extend get_archive() to choose between streaming ZIP extraction and the existing download-then-extract path based on file extension and stream-unzip availability.
  • Introduce _get_archive_streaming() to stream ZIP data from backends, feed it into stream_unzip, write files incrementally, and track extracted paths.
  • Compute optional MD5 on the download stream for validate=True and compare with backend.checksum(), cleaning up extracted files and raising InterruptedError on mismatch.
  • Improve error handling by mapping ZIP/streaming errors to a RuntimeError("Broken archive: ...") and ensuring partial extracts are removed while preserving pre-existing directories.
  • Validate tmp_root via TemporaryDirectory to surface consistent errors and adjust get_archive() docstring to describe new behavior and num_workers removal.
audbackend/core/backend/base.py
Introduce streaming and size primitives to backend interfaces and implement them for supported backends.
  • Add abstract _get_file_stream() and _size() methods to the base backend interface for chunked access and file-size queries.
  • Implement _get_file_stream() and _size() for FileSystem backend using local file I/O and os.path.getsize().
  • Implement _get_file_stream() and _size() for Artifactory backend using ArtifactoryPath streaming and stat().
  • Implement _get_file_stream() for Minio backend using MinioClient.get_object() with manual chunked reads and proper resource cleanup.
  • Extend the single-folder test backend with a _get_file_stream() implementation for use in tests.
audbackend/core/backend/base.py
audbackend/core/backend/filesystem.py
audbackend/core/backend/artifactory.py
audbackend/core/backend/minio.py
tests/singlefolder.py
Add backend tests for _size() and streaming-extraction cleanup/behavior.
  • Add backend-specific tests verifying that _size() returns the correct size for uploaded files on FileSystem, Artifactory, and Minio backends.
  • Add tests exercising streaming extraction cleanup when extracting malformed ZIPs into existing directories, ensuring only newly extracted files are removed.
  • Add tests validating that failed checksum validation after streaming extraction removes only extracted files while preserving existing content and destination directories.
  • Add tests confirming that streaming ZIP extraction correctly skips directory entries while extracting nested files, and that returned paths match extracted content.
tests/test_backend_filesystem.py
tests/test_backend_artifactory.py
tests/test_backend_minio.py
Declare stream-unzip as a conditional runtime dependency for supported Python versions.
  • Add stream-unzip >=0.0.99 as a dependency in pyproject.toml, gated to python_version < "3.14" to handle missing support on newer Python versions.
pyproject.toml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Comment thread audbackend/core/backend/base.py
@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 14, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.0%. Comparing base (47afb2b) to head (fd6224d).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
Files with missing lines Coverage Δ
audbackend/core/backend/artifactory.py 100.0% <100.0%> (ø)
audbackend/core/backend/base.py 100.0% <100.0%> (ø)
audbackend/core/backend/filesystem.py 100.0% <100.0%> (ø)
audbackend/core/backend/minio.py 100.0% <100.0%> (ø)
audbackend/core/interface/unversioned.py 100.0% <ø> (ø)
audbackend/core/interface/versioned.py 100.0% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread audbackend/core/backend/base.py
Comment thread audbackend/core/backend/base.py Outdated
Comment thread audbackend/core/backend/base.py Outdated
Comment thread audbackend/core/backend/base.py
Comment thread audbackend/core/backend/base.py Outdated
@hagenw hagenw marked this pull request as ready for review January 14, 2026 16:35
Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • In _get_archive_streaming, cleanup_on_failure() is only invoked for zipfile.BadZipFile, TruncatedDataError, and UnfinishedIterationError; consider broadening the exception handling (or using a try/except Exception around the whole streaming loop) so that partial extractions are also cleaned up on network or unexpected errors.
  • The _get_file_stream implementations in the different backends all hard-code the same chunk_size = 64 * 1024; you might want to centralize this value (e.g., as a constant on the base class) to avoid magic numbers and keep behavior consistent if you ever want to tune it.
  • The three backend-specific test_size tests are nearly identical; consider refactoring them into a shared parametrized test helper to reduce duplication and make it easier to add future backends.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `_get_archive_streaming`, `cleanup_on_failure()` is only invoked for `zipfile.BadZipFile`, `TruncatedDataError`, and `UnfinishedIterationError`; consider broadening the exception handling (or using a `try`/`except Exception` around the whole streaming loop) so that partial extractions are also cleaned up on network or unexpected errors.
- The `_get_file_stream` implementations in the different backends all hard-code the same `chunk_size = 64 * 1024`; you might want to centralize this value (e.g., as a constant on the base class) to avoid magic numbers and keep behavior consistent if you ever want to tune it.
- The three backend-specific `test_size` tests are nearly identical; consider refactoring them into a shared parametrized test helper to reduce duplication and make it easier to add future backends.

## Individual Comments

### Comment 1
<location> `audbackend/core/backend/artifactory.py:282` </location>
<code_context>
+        src_path = self.path(src_path)
+        chunk_size = 64 * 1024  # 64 KB
+
+        with src_path.open() as fp:
+            while data := fp.read(chunk_size):
+                yield data
</code_context>

<issue_to_address>
**issue (bug_risk):** Open the Artifactory file in binary mode to ensure bytes are yielded for hashing and streaming.

`src_path.open()` defaults to text mode, so `_get_file_stream` yields `str`. `_get_archive_streaming` passes these chunks to `hashlib.md5.update`, which requires `bytes`, leading to a type error or unintended encoding. Opening with `src_path.open("rb")` ensures raw bytes are yielded, matching the expectations of the hashing logic and other backends.
</issue_to_address>

### Comment 2
<location> `audbackend/core/backend/minio.py:408` </location>
<code_context>
         src_path = self.path(src_path)
         _download(src_path, dst_path, verbose=verbose)

+    def _get_file_stream(
+        self,
+        src_path: str,
</code_context>

<issue_to_address>
**issue (bug_risk):** Implement `_size` for the MinIO backend or avoid calling it to prevent `NotImplementedError` when using streaming with `verbose=True`.

In `BaseBackend.get_archive`, the streaming path calls `self._size(src_path)` when `verbose=True`, but this backend only adds `_get_file_stream` and still inherits the base `_size` that raises `NotImplementedError`. As a result, streaming from MinIO with `verbose=True` will crash. Please either implement `_size` for MinIO (as in the other backends) or avoid calling `_size` when the backend doesn’t support it and use an indeterminate progress bar instead.
</issue_to_address>

### Comment 3
<location> `tests/test_backend_filesystem.py:195-204` </location>
<code_context>
     assert interface.exists(dst_file, version)
+
+
+@pytest.mark.parametrize(
+    "interface",
+    [(audbackend.backend.Artifactory, audbackend.interface.Versioned)],
</code_context>

<issue_to_address>
**suggestion (testing):** Add a test for the case where `dst_root` is an existing file to ensure `NotADirectoryError` is raised and no partial extraction occurs.

Currently, streaming tests only cover existing directories and cleanup behavior. Please also cover this error path by creating a regular file at `dst_root`, calling `get_archive(..., dst_root=that_file)` with a valid ZIP, asserting `NotADirectoryError`, and verifying that the file is unchanged and no additional files are created nearby.
</issue_to_address>

### Comment 4
<location> `audbackend/core/backend/base.py:528` </location>
<code_context>

         src_path = utils.check_path(src_path)

+        # Validate tmp_root if specified
+        # (use TemporaryDirectory to get consistent error format)
+        if tmp_root is not None:
</code_context>

<issue_to_address>
**issue (complexity):** Consider refactoring `get_archive()` and `_get_archive_streaming()` into smaller helper methods so the main code paths read as simple, linear dispatch and extraction logic.

You can keep all behavior and noticeably reduce complexity by:

---

### 1. Make `get_archive()` a thin dispatcher

Right now `get_archive()` mixes “decide strategy” and “do work”. You can pull the tempfile-based path into a helper so the main method just selects a strategy:

```python
def get_archive(
    self,
    src_path: str,
    dst_root: str,
    *,
    tmp_root: str = None,
    validate: bool = False,
    verbose: bool = False,
) -> list[str]:
    if not self.opened:
        raise RuntimeError(backend_not_opened_error)

    src_path = utils.check_path(src_path)

    # Validate tmp_root if specified (use TemporaryDirectory to get consistent error format)
    if tmp_root is not None:
        with tempfile.TemporaryDirectory(dir=tmp_root):
            pass

    if src_path.lower().endswith(".zip") and STREAM_UNZIP_AVAILABLE:
        return self._get_archive_streaming(
            src_path,
            dst_root,
            validate=validate,
            verbose=verbose,
        )

    return self._get_archive_via_tempfile(
        src_path,
        dst_root,
        tmp_root=tmp_root,
        validate=validate,
        verbose=verbose,
    )

def _get_archive_via_tempfile(
    self,
    src_path: str,
    dst_root: str,
    *,
    tmp_root: str | None,
    validate: bool,
    verbose: bool,
) -> list[str]:
    with tempfile.TemporaryDirectory(dir=tmp_root) as tmp:
        tmp_dir = audeer.path(tmp, os.path.basename(dst_root))
        local_archive = os.path.join(
            tmp_dir,
            os.path.basename(src_path),
        )
        self.get_file(
            src_path,
            local_archive,
            validate=validate,
            verbose=verbose,
        )
        return audeer.extract_archive(
            local_archive,
            dst_root,
            verbose=verbose,
        )
```

This moves all non-streaming logic out of `get_archive()`, making the top-level behavior easier to scan.

---

### 2. Extract checksum + progress handling from `_get_archive_streaming()`

The `stream_with_hash()` inner function mixes three concerns (read, hash, progress). You can move that into a reusable helper, which shrinks `_get_archive_streaming()` and makes the checksum logic easier to test in isolation:

```python
def _stream_with_md5_and_progress(
    self,
    src_path: str,
    *,
    validate: bool,
    verbose: bool,
) -> tuple[Iterator[bytes], hashlib._Hash | None]:
    md5_hash = hashlib.md5() if validate else None
    src_size = self._size(src_path) if verbose else None

    desc = audeer.format_display_message(
        f"Download {os.path.basename(src_path)}",
        pbar=verbose,
    )
    pbar = audeer.progress_bar(total=src_size, desc=desc, disable=not verbose)

    def iterator() -> Iterator[bytes]:
        with pbar:
            for chunk in self._get_file_stream(src_path):
                if md5_hash is not None:
                    md5_hash.update(chunk)
                pbar.update(len(chunk))
                yield chunk

    return iterator(), md5_hash
```

Then `_get_archive_streaming()` becomes more linear:

```python
def _get_archive_streaming(
    self,
    src_path: str,
    dst_root: str,
    *,
    validate: bool = False,
    verbose: bool = False,
) -> list[str]:
    if os.path.exists(dst_root) and not os.path.isdir(dst_root):
        raise NotADirectoryError(errno.ENOTDIR, os.strerror(errno.ENOTDIR), dst_root)

    dst_root_existed = os.path.exists(dst_root)
    audeer.mkdir(dst_root)

    extracted_files: list[str] = []

    stream, md5_hash = self._stream_with_md5_and_progress(
        src_path,
        validate=validate,
        verbose=verbose,
    )

    try:
        for file_name, file_size, unzipped_chunks in stream_unzip(stream):
            # ... existing extraction logic ...
            # (decode name, mkdir, write file, append to extracted_files)
            ...
        if validate:
            expected_checksum = self.checksum(src_path)
            actual_checksum = md5_hash.hexdigest()
            if actual_checksum != expected_checksum:
                self._cleanup_extracted(dst_root, dst_root_existed, extracted_files)
                raise InterruptedError(
                    f"Execution is interrupted because {src_path} has checksum "
                    f"'{actual_checksum}' when the expected checksum is "
                    f"'{expected_checksum}'. The extracted files have been removed."
                )
    except (zipfile.BadZipFile, TruncatedDataError, UnfinishedIterationError) as ex:
        self._cleanup_extracted(dst_root, dst_root_existed, extracted_files)
        raise RuntimeError(f"Broken archive: {src_path}") from ex

    return extracted_files
```

---

### 3. Replace the closure `cleanup_on_failure()` with a small helper

The current inner function captures `dst_root_existed` and `extracted_files`. Making it a method clarifies responsibilities and avoids the closure:

```python
def _cleanup_extracted(
    self,
    dst_root: str,
    dst_root_existed: bool,
    extracted_files: Sequence[str],
) -> None:
    if not dst_root_existed and os.path.exists(dst_root):
        shutil.rmtree(dst_root)
    else:
        for file_name in extracted_files:
            full_path = audeer.path(dst_root, file_name)
            if os.path.exists(full_path):
                os.remove(full_path)
```

Then just call:

```python
self._cleanup_extracted(dst_root, dst_root_existed, extracted_files)
```

where you currently call `cleanup_on_failure()`.

---

These small extractions keep all semantics (including checksum behavior, progress reporting, and cleanup) but significantly flatten `_get_archive_streaming()` and make `get_archive()` easier to reason about.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment thread audbackend/core/backend/artifactory.py Outdated
Comment thread audbackend/core/backend/minio.py
Comment thread tests/test_backend_filesystem.py Outdated
Comment thread audbackend/core/backend/base.py
@hagenw hagenw requested a review from frankenjoe January 15, 2026 10:00
@frankenjoe
Copy link
Copy Markdown
Collaborator

frankenjoe commented Jan 15, 2026

  • could be further improved by not storing the model files as ZIP files, as they are already quite compressed.

I think we once discussed the idea of using ZIP without compression in that case.

@hagenw
Copy link
Copy Markdown
Member Author

hagenw commented Jan 15, 2026

  • could be further improved by not storing the model files as ZIP files, as they are already quite compressed.

I think we once discussed the idea of using ZIP without compression in that case.

Good idea, that would eliminate the need for tracking the filename and we can just download ZIP files. The only downside is that for the old files it will still be faster using streaming ZIP extraction, but for the new ones it will be faster using get_file(num_workers=). So we would need to add a flag on the backend (e.g. in the header file) what kind of ZIP it is.

@frankenjoe
Copy link
Copy Markdown
Collaborator

So we would need to add a flag on the backend (e.g. in the header file) what kind of ZIP it is.

Ok, let's say we somehow know if a file was zipped with or without compression. In case no compression was used, then we don't want to use streaming ZIP extraction and it can make sense to use multiple workers instead. So maybe it makes sense we keep the argument?

@hagenw
Copy link
Copy Markdown
Member Author

hagenw commented Jan 15, 2026

So we would need to add a flag on the backend (e.g. in the header file) what kind of ZIP it is.

Ok, let's say we somehow know if a file was zipped with or without compression. In case no compression was used, then we don't want to use streaming ZIP extraction and it can make sense to use multiple workers instead. So maybe it makes sense we keep the argument?

There are two options:

  • We integrate the different handling for uncompressed and compressed ZIP files already in audbackend. Then get_archive() would check if the file is compressed or not and call get_file() with num_workers or not. The downside here is that we would need to always first download the end part of ZIP file in which its metadata is stored in order to check if it was compressed or not
  • We integrate the different handling in audmodel and store inside the header if the file is uncompressed or not (the header is always first downloaded as it does the decoding UID to path). Then we can call get_file(num_workers=) from inside audmodel and don't need a num_workers argument for get_archive()

At the moment, I'm more in favor of the second approach.


But there is another reason to maybe stay with num_workers. If we do not the streaming ZIP download, e.g. if we have a TAR.GZ archive or have Python 3.14, we still call get_file() and in that case we can simply use num_workers there. For streaming ZIP file, we could then simply ignore num_workers.

@frankenjoe
Copy link
Copy Markdown
Collaborator

I see a third option: if num_workers is to 1 we try to stream otherwise we keep the old behavior.

@hagenw
Copy link
Copy Markdown
Member Author

hagenw commented Jan 16, 2026

The problem is that you will be always slower with num_workers.

When loading the 4.2GB model 7289b57d-1.0.0 with streaming ZIP we get

Before After
0:02:21.248 0:01:20.057

when loading without streaming and using num_workers in get_file() we get (compare audeering/audmodel#35)

num_workers num_iter elapsed(avg) elapsed(std)
1 10 0:02:23.903513 0:00:05.744711
2 10 0:02:14.147027 0:00:07.193578
3 10 0:02:09.759947 0:00:00.829281
4 10 0:02:10.224645 0:00:00.940978
5 10 0:02:12.284520 0:00:03.940332
10 10 0:02:11.610993 0:00:01.676725

Which means in most cases I would not recommend to use num_workers instead of streaming.

I would vote for one of the following solutions:

  • use num_workers only when streaming ZIP extraction is not possible
  • not add num_workers

@frankenjoe
Copy link
Copy Markdown
Collaborator

frankenjoe commented Jan 16, 2026

The problem is that you will be always slower with num_workers.

Ok, maybe I misunderstood, but I thought for TAR.GZ or uncompressed ZIP files using multiple workers is faster.

If this is true, then num_workers allows it to control if streamable ZIP extraction is desired or not. A package that has this information can then set it accordingly for best performance.

@hagenw
Copy link
Copy Markdown
Member Author

hagenw commented Jan 16, 2026

package that has this information can then set it accordingly for best performance.

Yes, but the package can always call get_file(num_workers=) + handling extraction (e.g. with audeer.extract_archive()) instead of calling get_archive(). We don't need to add num_workers to get-archive() for this case.

@frankenjoe
Copy link
Copy Markdown
Collaborator

frankenjoe commented Jan 16, 2026

Yes, but the package can always call get_file(num_workers=) + handling extraction (e.g. with audeer.extract_archive()) instead of calling get_archive(). We don't need to add num_workers to get-archive() for this case.

Yes, but that's more complicated than just doing:

get_archive(..., num_workers=1 if streamable else num_workers)`

As long as there is a use-case for using num_workers (TAR.GZ or uncompressed ZIP files) I would not remove it.

@hagenw
Copy link
Copy Markdown
Member Author

hagenw commented Jan 16, 2026

Yes, but the package can always call get_file(num_workers=) + handling extraction (e.g. with audeer.extract_archive()) instead of calling get_archive(). We don't need to add num_workers to get-archive() for this case.

Yes, but that's more complicated than just doing:

get_archive(..., num_workers=1 if streamable else num_workers)`

As long as there is a use-case for using num_workers (TAR.GZ or uncompressed ZIP files) I would not remove it.

I see the point, but there is one big caveat. As a normal user, I would expect that whenever num_workers is available in a method, using num_workers=6 will be faster than using num_workers=1. But this is not the case here for most of the archives. Hence, by not having num_workers we encourage to not use it. On the other hand it is anyway only used by developers ;)

@frankenjoe
Copy link
Copy Markdown
Collaborator

But there is also the caveat that users continue using get_archive() for all archives instead of using the (not so obvious) combination of get_file() and audeer.extract_archive() when streaming is not supported :)

@hagenw
Copy link
Copy Markdown
Member Author

hagenw commented Jan 20, 2026

I re-added num_workers.


I also started to benchmark downloading an uncompressed ZIP file with multiple workers, but encountered an error with it. So maybe, we still have an issue in get_file() when using multiple workers. I created #284 to have a look at it afterwards. With streaming ZIP extraction I had no error so far.

@hagenw
Copy link
Copy Markdown
Member Author

hagenw commented Jan 21, 2026

I repeated my benchmarks of downloading ~4.2GB file as compressed vs. uncompressed ZIP (with the fix introduced in #285).

Execution time in seconds (average over 10 runs). Only for num_workers=1 we use streaming ZIP extraction.

num_workers compressed execution time / s
1 0:01:26.081266
1 0:01:25.839748
2 0:01:56.109490
2 0:01:05.871747
5 0:01:49.716402
5 0:00:57.485189
10 0:01:51.970323
10 0:00:58.532454

Results are as expected:

  • For a compressed ZIP file, streaming download with num_workers=1 is the fastest
  • For an uncompressed ZIP we can be faster when using more workers (but there seems to be an upper limit)
  • Both methods are faster than not using streaming download or multiple workers (which was around 0:02:18.284, see description of this pull request)

@frankenjoe
Copy link
Copy Markdown
Collaborator

Ok, so our recommendations would be:

  1. for files that can be compressed use compressed ZIP with streaming (i.e. single worker)
  2. for all other files use uncompressed ZIP with ~5 workers

I guess we also need to add an argument to audeer.create_archive() that allows creating uncompressed archives.

@hagenw
Copy link
Copy Markdown
Member Author

hagenw commented Jan 21, 2026

Ok, so our recommendations would be:

  1. for files that can be compressed use compressed ZIP with streaming (i.e. single worker)
  2. for all other files use uncompressed ZIP with ~5 workers

Yes.

I extended the docstring here to reflect this:

image

I guess we also need to add an argument to audeer.create_archive() that allows creating uncompressed archives.

I created audeering/audeer#188

Comment thread audbackend/core/interface/unversioned.py Outdated
Comment thread audbackend/core/interface/unversioned.py Outdated
Comment thread audbackend/core/backend/filesystem.py Outdated
Comment thread audbackend/core/backend/minio.py Outdated
Comment thread tests/singlefolder.py Outdated
Comment thread audbackend/core/backend/artifactory.py Outdated
@hagenw hagenw merged commit 6edfbee into main Feb 2, 2026
19 of 20 checks passed
@hagenw hagenw deleted the stream-extract branch February 2, 2026 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants