Extract ZIP archives while downloading by hagenw · Pull Request #279 · audeering/audbackend

hagenw · 2026-01-08T12:22:35Z

Summary

Speed up get_archive() by extracting a ZIP file while downloading it.
~~Remove num_workers from get_archive() (introduced in Use workers for file download #271) for required sequential processing of the file chunks.~~ When num_workers > 1 we do not use streaming extraction, but first download the file with multiple workers to a temp file and extract it afterwards using a single worker.

Extraction of archived cannot be speed up by using multiple workers (compare audeering/audeer#186), hence we speed it up indirectly here. Extracting ZIP files in a streaming fashion works for all backends.

Details

TAR.GZ archives are not affected by the changes and are still first downloaded and then extracted.
The biggest downside is that we need additional external dependencies for the implementation, as we use stream-unzip, which depends on the two self-contained packages pycryptodome and stream-inflate.
stream-unzip does not yet support Python 3.14, there we fall back to the old behavior.

~~Removing num_workers from get_archive() is not a breaking change as we yanked version 2.3.0 of audbackend.~~

The pull request also adds two new private methods that every backend has to implement:

audbackend.backend.Base._get_file_stream()
audbackend.backend.Base._size() (for progress bar during streaming ZIP extraction)

Benchmarks

Benchmark results averaged over 10 runs using a single CPU thread, for using Minio.get_archive() on the file /alm/audeering-omni/stage1_2/torch/7289b57d.zip (4.2 GB).

Before	After
0:02:18.284	0:01:18.700

Benchmark with audmodel for 7289b57d-1.0.0 (compare audeering/audmodel#38)

Before	After
0:02:21.248	0:01:20.057

And using audb to load emodb version 2.0.0 (execution time in seconds)

num_workers	Before	After
1	0:03:43.810	0:01:30.780
6	0:00:40.120	0:00:24.970

Discussion

The current approach would speed up automatically all applications using audbackend that use ZIP files (audb, audmodel)
audmodel could be further improved by not storing the model files as ZIP files, as they are already quite compressed. Then we could download the big model file with get_file() using several workers and the remaining model metadata as a ZIP file. But this would of cause require to update how audmodel.publish() stores the files in the first place

sourcery-ai · 2026-01-08T12:22:40Z

Reviewer's Guide

Implement streaming ZIP extraction in backend.get_archive(), introducing backend streaming/file-size interfaces and cleanup semantics, while removing num_workers support for archives and wiring stream-unzip as an optional dependency.

Sequence diagram for streaming ZIP extraction in get_archive

sequenceDiagram
    participant Caller
    participant Backend as Backend
    participant StreamUnzip as stream_unzip
    participant FS as FileSystem

    Caller->>Backend: get_archive(src_path, dst_root, validate, verbose)
    Backend->>Backend: check_path(src_path)
    Backend->>Backend: detect .zip and STREAM_UNZIP_AVAILABLE
    alt ZIP with streaming
        Backend->>Backend: _get_archive_streaming(src_path, dst_root, validate, verbose)
        Backend->>Backend: _size(src_path) [if verbose]
        Backend->>Backend: create progress_bar
        Backend->>Backend: stream_with_hash()
        loop download chunks
            Backend->>Backend: _get_file_stream(src_path)
            Backend-->>Backend: chunk bytes
            Backend->>Backend: update md5_hash
            Backend->>Backend: pbar.update(len(chunk))
        end
        Backend->>StreamUnzip: stream_unzip(stream_with_hash())
        loop for each entry in archive
            StreamUnzip-->>Backend: file_name, file_size, unzipped_chunks
            Backend->>Backend: skip directories
            Backend->>FS: mkdir(parent(dst_path))
            loop write unzipped_chunks
                Backend->>FS: write chunk to dst_path
            end
            Backend->>Backend: record extracted_files
        end
        alt validate checksum
            Backend->>Backend: checksum(src_path)
            Backend->>Backend: compare expected vs actual
            opt mismatch
                Backend->>Backend: cleanup_on_failure()
                Backend-->>Caller: raise InterruptedError
            end
        end
        Backend-->>Caller: return extracted_files
    else other archives or ZIP without streaming
        Backend->>Backend: create TemporaryDirectory(tmp_root)
        Backend->>Backend: get_file(src_path, local_archive, validate, verbose)
        Backend->>Backend: audeer.extract_archive(local_archive, dst_root, validate, verbose)
        Backend-->>Caller: return extracted_files
    end

Class diagram for backend streaming interfaces and implementations

classDiagram
    class Backend {
        +get_archive(src_path str, dst_root str, tmp_root str, validate bool, verbose bool) list~str~
        -_get_archive_streaming(src_path str, dst_root str, validate bool, verbose bool) list~str~
        -_get_file(src_path str, dst_path str, verbose bool)
        -_get_file_stream(src_path str) Iterator~bytes~
        -_size(path str) int
        +get_file(src_path str, dst_path str, tmp_root str, num_workers int, validate bool, verbose bool)
    }

    class ArtifactoryBackend {
        -_get_file(src_path str, dst_path str, verbose bool)
        -_get_file_stream(src_path str) Iterator~bytes~
        -_size(path str) int
    }

    class FilesystemBackend {
        -_get_file(src_path str, dst_path str, verbose bool)
        -_get_file_stream(src_path str) Iterator~bytes~
        -_size(path str) int
    }

    class MinioBackend {
        -_download_file(src_path str, dst_path str, verbose bool)
        -_get_file_stream(src_path str) Iterator~bytes~
        -_size(path str) int
    }

    Backend <|-- ArtifactoryBackend
    Backend <|-- FilesystemBackend
    Backend <|-- MinioBackend

    class stream_unzip {
        +stream_unzip(source Iterator~bytes~) Iterator
    }

    Backend ..> stream_unzip : uses for ZIP streaming

File-Level Changes

Change	Details	Files
Add streaming ZIP extraction path in BaseBackend.get_archive() using stream-unzip, with checksum validation and robust cleanup behavior.	Extend get_archive() to choose between streaming ZIP extraction and the existing download-then-extract path based on file extension and stream-unzip availability. Introduce _get_archive_streaming() to stream ZIP data from backends, feed it into stream_unzip, write files incrementally, and track extracted paths. Compute optional MD5 on the download stream for validate=True and compare with backend.checksum(), cleaning up extracted files and raising InterruptedError on mismatch. Improve error handling by mapping ZIP/streaming errors to a RuntimeError("Broken archive: ...") and ensuring partial extracts are removed while preserving pre-existing directories. Validate tmp_root via TemporaryDirectory to surface consistent errors and adjust get_archive() docstring to describe new behavior and num_workers removal.	`audbackend/core/backend/base.py`
Introduce streaming and size primitives to backend interfaces and implement them for supported backends.	Add abstract _get_file_stream() and _size() methods to the base backend interface for chunked access and file-size queries. Implement _get_file_stream() and _size() for FileSystem backend using local file I/O and os.path.getsize(). Implement _get_file_stream() and _size() for Artifactory backend using ArtifactoryPath streaming and stat(). Implement _get_file_stream() for Minio backend using MinioClient.get_object() with manual chunked reads and proper resource cleanup. Extend the single-folder test backend with a _get_file_stream() implementation for use in tests.	`audbackend/core/backend/base.py` `audbackend/core/backend/filesystem.py` `audbackend/core/backend/artifactory.py` `audbackend/core/backend/minio.py` `tests/singlefolder.py`
Add backend tests for _size() and streaming-extraction cleanup/behavior.	Add backend-specific tests verifying that _size() returns the correct size for uploaded files on FileSystem, Artifactory, and Minio backends. Add tests exercising streaming extraction cleanup when extracting malformed ZIPs into existing directories, ensuring only newly extracted files are removed. Add tests validating that failed checksum validation after streaming extraction removes only extracted files while preserving existing content and destination directories. Add tests confirming that streaming ZIP extraction correctly skips directory entries while extracting nested files, and that returned paths match extracted content.	`tests/test_backend_filesystem.py` `tests/test_backend_artifactory.py` `tests/test_backend_minio.py`
Declare stream-unzip as a conditional runtime dependency for supported Python versions.	Add stream-unzip >=0.0.99 as a dependency in pyproject.toml, gated to python_version < "3.14" to handle missing support on newer Python versions.	`pyproject.toml`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

codecov · 2026-01-14T14:40:32Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 100.0%. Comparing base (47afb2b) to head (fd6224d).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

Files with missing lines	Coverage Δ
audbackend/core/backend/artifactory.py	`100.0% <100.0%> (ø)`
audbackend/core/backend/base.py	`100.0% <100.0%> (ø)`
audbackend/core/backend/filesystem.py	`100.0% <100.0%> (ø)`
audbackend/core/backend/minio.py	`100.0% <100.0%> (ø)`
audbackend/core/interface/unversioned.py	`100.0% <ø> (ø)`
audbackend/core/interface/versioned.py	`100.0% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

sourcery-ai

Hey - I've found 4 issues, and left some high level feedback:

In _get_archive_streaming, cleanup_on_failure() is only invoked for zipfile.BadZipFile, TruncatedDataError, and UnfinishedIterationError; consider broadening the exception handling (or using a try/except Exception around the whole streaming loop) so that partial extractions are also cleaned up on network or unexpected errors.
The _get_file_stream implementations in the different backends all hard-code the same chunk_size = 64 * 1024; you might want to centralize this value (e.g., as a constant on the base class) to avoid magic numbers and keep behavior consistent if you ever want to tune it.
The three backend-specific test_size tests are nearly identical; consider refactoring them into a shared parametrized test helper to reduce duplication and make it easier to add future backends.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `_get_archive_streaming`, `cleanup_on_failure()` is only invoked for `zipfile.BadZipFile`, `TruncatedDataError`, and `UnfinishedIterationError`; consider broadening the exception handling (or using a `try`/`except Exception` around the whole streaming loop) so that partial extractions are also cleaned up on network or unexpected errors.
- The `_get_file_stream` implementations in the different backends all hard-code the same `chunk_size = 64 * 1024`; you might want to centralize this value (e.g., as a constant on the base class) to avoid magic numbers and keep behavior consistent if you ever want to tune it.
- The three backend-specific `test_size` tests are nearly identical; consider refactoring them into a shared parametrized test helper to reduce duplication and make it easier to add future backends.

## Individual Comments

### Comment 1
<location> `audbackend/core/backend/artifactory.py:282` </location>
<code_context>
+        src_path = self.path(src_path)
+        chunk_size = 64 * 1024  # 64 KB
+
+        with src_path.open() as fp:
+            while data := fp.read(chunk_size):
+                yield data
</code_context>

<issue_to_address>
**issue (bug_risk):** Open the Artifactory file in binary mode to ensure bytes are yielded for hashing and streaming.

`src_path.open()` defaults to text mode, so `_get_file_stream` yields `str`. `_get_archive_streaming` passes these chunks to `hashlib.md5.update`, which requires `bytes`, leading to a type error or unintended encoding. Opening with `src_path.open("rb")` ensures raw bytes are yielded, matching the expectations of the hashing logic and other backends.
</issue_to_address>

### Comment 2
<location> `audbackend/core/backend/minio.py:408` </location>
<code_context>
         src_path = self.path(src_path)
         _download(src_path, dst_path, verbose=verbose)

+    def _get_file_stream(
+        self,
+        src_path: str,
</code_context>

<issue_to_address>
**issue (bug_risk):** Implement `_size` for the MinIO backend or avoid calling it to prevent `NotImplementedError` when using streaming with `verbose=True`.

In `BaseBackend.get_archive`, the streaming path calls `self._size(src_path)` when `verbose=True`, but this backend only adds `_get_file_stream` and still inherits the base `_size` that raises `NotImplementedError`. As a result, streaming from MinIO with `verbose=True` will crash. Please either implement `_size` for MinIO (as in the other backends) or avoid calling `_size` when the backend doesn’t support it and use an indeterminate progress bar instead.
</issue_to_address>

### Comment 3
<location> `tests/test_backend_filesystem.py:195-204` </location>
<code_context>
     assert interface.exists(dst_file, version)
+
+
+@pytest.mark.parametrize(
+    "interface",
+    [(audbackend.backend.Artifactory, audbackend.interface.Versioned)],
</code_context>

<issue_to_address>
**suggestion (testing):** Add a test for the case where `dst_root` is an existing file to ensure `NotADirectoryError` is raised and no partial extraction occurs.

Currently, streaming tests only cover existing directories and cleanup behavior. Please also cover this error path by creating a regular file at `dst_root`, calling `get_archive(..., dst_root=that_file)` with a valid ZIP, asserting `NotADirectoryError`, and verifying that the file is unchanged and no additional files are created nearby.
</issue_to_address>

### Comment 4
<location> `audbackend/core/backend/base.py:528` </location>
<code_context>

         src_path = utils.check_path(src_path)

+        # Validate tmp_root if specified
+        # (use TemporaryDirectory to get consistent error format)
+        if tmp_root is not None:
</code_context>

<issue_to_address>
**issue (complexity):** Consider refactoring `get_archive()` and `_get_archive_streaming()` into smaller helper methods so the main code paths read as simple, linear dispatch and extraction logic.

You can keep all behavior and noticeably reduce complexity by:

---

### 1. Make `get_archive()` a thin dispatcher

Right now `get_archive()` mixes “decide strategy” and “do work”. You can pull the tempfile-based path into a helper so the main method just selects a strategy:

```python
def get_archive(
    self,
    src_path: str,
    dst_root: str,
    *,
    tmp_root: str = None,
    validate: bool = False,
    verbose: bool = False,
) -> list[str]:
    if not self.opened:
        raise RuntimeError(backend_not_opened_error)

    src_path = utils.check_path(src_path)

    # Validate tmp_root if specified (use TemporaryDirectory to get consistent error format)
    if tmp_root is not None:
        with tempfile.TemporaryDirectory(dir=tmp_root):
            pass

    if src_path.lower().endswith(".zip") and STREAM_UNZIP_AVAILABLE:
        return self._get_archive_streaming(
            src_path,
            dst_root,
            validate=validate,
            verbose=verbose,
        )

    return self._get_archive_via_tempfile(
        src_path,
        dst_root,
        tmp_root=tmp_root,
        validate=validate,
        verbose=verbose,
    )

def _get_archive_via_tempfile(
    self,
    src_path: str,
    dst_root: str,
    *,
    tmp_root: str | None,
    validate: bool,
    verbose: bool,
) -> list[str]:
    with tempfile.TemporaryDirectory(dir=tmp_root) as tmp:
        tmp_dir = audeer.path(tmp, os.path.basename(dst_root))
        local_archive = os.path.join(
            tmp_dir,
            os.path.basename(src_path),
        )
        self.get_file(
            src_path,
            local_archive,
            validate=validate,
            verbose=verbose,
        )
        return audeer.extract_archive(
            local_archive,
            dst_root,
            verbose=verbose,
        )
```

This moves all non-streaming logic out of `get_archive()`, making the top-level behavior easier to scan.

---

### 2. Extract checksum + progress handling from `_get_archive_streaming()`

The `stream_with_hash()` inner function mixes three concerns (read, hash, progress). You can move that into a reusable helper, which shrinks `_get_archive_streaming()` and makes the checksum logic easier to test in isolation:

```python
def _stream_with_md5_and_progress(
    self,
    src_path: str,
    *,
    validate: bool,
    verbose: bool,
) -> tuple[Iterator[bytes], hashlib._Hash | None]:
    md5_hash = hashlib.md5() if validate else None
    src_size = self._size(src_path) if verbose else None

    desc = audeer.format_display_message(
        f"Download {os.path.basename(src_path)}",
        pbar=verbose,
    )
    pbar = audeer.progress_bar(total=src_size, desc=desc, disable=not verbose)

    def iterator() -> Iterator[bytes]:
        with pbar:
            for chunk in self._get_file_stream(src_path):
                if md5_hash is not None:
                    md5_hash.update(chunk)
                pbar.update(len(chunk))
                yield chunk

    return iterator(), md5_hash
```

Then `_get_archive_streaming()` becomes more linear:

```python
def _get_archive_streaming(
    self,
    src_path: str,
    dst_root: str,
    *,
    validate: bool = False,
    verbose: bool = False,
) -> list[str]:
    if os.path.exists(dst_root) and not os.path.isdir(dst_root):
        raise NotADirectoryError(errno.ENOTDIR, os.strerror(errno.ENOTDIR), dst_root)

    dst_root_existed = os.path.exists(dst_root)
    audeer.mkdir(dst_root)

    extracted_files: list[str] = []

    stream, md5_hash = self._stream_with_md5_and_progress(
        src_path,
        validate=validate,
        verbose=verbose,
    )

    try:
        for file_name, file_size, unzipped_chunks in stream_unzip(stream):
            # ... existing extraction logic ...
            # (decode name, mkdir, write file, append to extracted_files)
            ...
        if validate:
            expected_checksum = self.checksum(src_path)
            actual_checksum = md5_hash.hexdigest()
            if actual_checksum != expected_checksum:
                self._cleanup_extracted(dst_root, dst_root_existed, extracted_files)
                raise InterruptedError(
                    f"Execution is interrupted because {src_path} has checksum "
                    f"'{actual_checksum}' when the expected checksum is "
                    f"'{expected_checksum}'. The extracted files have been removed."
                )
    except (zipfile.BadZipFile, TruncatedDataError, UnfinishedIterationError) as ex:
        self._cleanup_extracted(dst_root, dst_root_existed, extracted_files)
        raise RuntimeError(f"Broken archive: {src_path}") from ex

    return extracted_files
```

---

### 3. Replace the closure `cleanup_on_failure()` with a small helper

The current inner function captures `dst_root_existed` and `extracted_files`. Making it a method clarifies responsibilities and avoids the closure:

```python
def _cleanup_extracted(
    self,
    dst_root: str,
    dst_root_existed: bool,
    extracted_files: Sequence[str],
) -> None:
    if not dst_root_existed and os.path.exists(dst_root):
        shutil.rmtree(dst_root)
    else:
        for file_name in extracted_files:
            full_path = audeer.path(dst_root, file_name)
            if os.path.exists(full_path):
                os.remove(full_path)
```

Then just call:

```python
self._cleanup_extracted(dst_root, dst_root_existed, extracted_files)
```

where you currently call `cleanup_on_failure()`.

---

These small extractions keep all semantics (including checksum behavior, progress reporting, and cleanup) but significantly flatten `_get_archive_streaming()` and make `get_archive()` easier to reason about.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

frankenjoe · 2026-01-15T10:29:58Z

could be further improved by not storing the model files as ZIP files, as they are already quite compressed.

I think we once discussed the idea of using ZIP without compression in that case.

hagenw · 2026-01-15T10:34:03Z

could be further improved by not storing the model files as ZIP files, as they are already quite compressed.

I think we once discussed the idea of using ZIP without compression in that case.

Good idea, that would eliminate the need for tracking the filename and we can just download ZIP files. The only downside is that for the old files it will still be faster using streaming ZIP extraction, but for the new ones it will be faster using get_file(num_workers=). So we would need to add a flag on the backend (e.g. in the header file) what kind of ZIP it is.

frankenjoe · 2026-01-15T12:55:13Z

So we would need to add a flag on the backend (e.g. in the header file) what kind of ZIP it is.

Ok, let's say we somehow know if a file was zipped with or without compression. In case no compression was used, then we don't want to use streaming ZIP extraction and it can make sense to use multiple workers instead. So maybe it makes sense we keep the argument?

hagenw · 2026-01-15T15:50:12Z

So we would need to add a flag on the backend (e.g. in the header file) what kind of ZIP it is.

Ok, let's say we somehow know if a file was zipped with or without compression. In case no compression was used, then we don't want to use streaming ZIP extraction and it can make sense to use multiple workers instead. So maybe it makes sense we keep the argument?

There are two options:

We integrate the different handling for uncompressed and compressed ZIP files already in audbackend. Then get_archive() would check if the file is compressed or not and call get_file() with num_workers or not. The downside here is that we would need to always first download the end part of ZIP file in which its metadata is stored in order to check if it was compressed or not
We integrate the different handling in audmodel and store inside the header if the file is uncompressed or not (the header is always first downloaded as it does the decoding UID to path). Then we can call get_file(num_workers=) from inside audmodel and don't need a num_workers argument for get_archive()

At the moment, I'm more in favor of the second approach.

But there is another reason to maybe stay with num_workers. If we do not the streaming ZIP download, e.g. if we have a TAR.GZ archive or have Python 3.14, we still call get_file() and in that case we can simply use num_workers there. For streaming ZIP file, we could then simply ignore num_workers.

frankenjoe · 2026-01-15T18:25:42Z

I see a third option: if num_workers is to 1 we try to stream otherwise we keep the old behavior.

hagenw · 2026-01-16T07:22:34Z

The problem is that you will be always slower with num_workers.

When loading the 4.2GB model 7289b57d-1.0.0 with streaming ZIP we get

Before	After
0:02:21.248	0:01:20.057

when loading without streaming and using num_workers in get_file() we get (compare audeering/audmodel#35)

num_workers	num_iter	elapsed(avg)	elapsed(std)
1	10	0:02:23.903513	0:00:05.744711
2	10	0:02:14.147027	0:00:07.193578
3	10	0:02:09.759947	0:00:00.829281
4	10	0:02:10.224645	0:00:00.940978
5	10	0:02:12.284520	0:00:03.940332
10	10	0:02:11.610993	0:00:01.676725

Which means in most cases I would not recommend to use num_workers instead of streaming.

I would vote for one of the following solutions:

use num_workers only when streaming ZIP extraction is not possible
not add num_workers

frankenjoe · 2026-01-16T07:57:10Z

The problem is that you will be always slower with num_workers.

Ok, maybe I misunderstood, but I thought for TAR.GZ or uncompressed ZIP files using multiple workers is faster.

If this is true, then num_workers allows it to control if streamable ZIP extraction is desired or not. A package that has this information can then set it accordingly for best performance.

hagenw · 2026-01-16T08:03:32Z

package that has this information can then set it accordingly for best performance.

Yes, but the package can always call get_file(num_workers=) + handling extraction (e.g. with audeer.extract_archive()) instead of calling get_archive(). We don't need to add num_workers to get-archive() for this case.

frankenjoe · 2026-01-16T08:09:55Z

Yes, but the package can always call get_file(num_workers=) + handling extraction (e.g. with audeer.extract_archive()) instead of calling get_archive(). We don't need to add num_workers to get-archive() for this case.

Yes, but that's more complicated than just doing:

get_archive(..., num_workers=1 if streamable else num_workers)`

As long as there is a use-case for using num_workers (TAR.GZ or uncompressed ZIP files) I would not remove it.

hagenw · 2026-01-16T10:36:59Z

Yes, but the package can always call get_file(num_workers=) + handling extraction (e.g. with audeer.extract_archive()) instead of calling get_archive(). We don't need to add num_workers to get-archive() for this case.

Yes, but that's more complicated than just doing:
get_archive(..., num_workers=1 if streamable else num_workers)`
As long as there is a use-case for using num_workers (TAR.GZ or uncompressed ZIP files) I would not remove it.

I see the point, but there is one big caveat. As a normal user, I would expect that whenever num_workers is available in a method, using num_workers=6 will be faster than using num_workers=1. But this is not the case here for most of the archives. Hence, by not having num_workers we encourage to not use it. On the other hand it is anyway only used by developers ;)

frankenjoe · 2026-01-16T11:50:34Z

But there is also the caveat that users continue using get_archive() for all archives instead of using the (not so obvious) combination of get_file() and audeer.extract_archive() when streaming is not supported :)

hagenw · 2026-01-20T10:50:05Z

I re-added num_workers.

I also started to benchmark downloading an uncompressed ZIP file with multiple workers, but encountered an error with it. So maybe, we still have an issue in get_file() when using multiple workers. I created #284 to have a look at it afterwards. With streaming ZIP extraction I had no error so far.

hagenw · 2026-01-21T12:15:02Z

I repeated my benchmarks of downloading ~4.2GB file as compressed vs. uncompressed ZIP (with the fix introduced in #285).

Execution time in seconds (average over 10 runs). Only for num_workers=1 we use streaming ZIP extraction.

num_workers	compressed	execution time / s
1	✅	0:01:26.081266
1	❌	0:01:25.839748
2	✅	0:01:56.109490
2	❌	0:01:05.871747
5	✅	0:01:49.716402
5	❌	0:00:57.485189
10	✅	0:01:51.970323
10	❌	0:00:58.532454

Results are as expected:

For a compressed ZIP file, streaming download with num_workers=1 is the fastest
For an uncompressed ZIP we can be faster when using more workers (but there seems to be an upper limit)
Both methods are faster than not using streaming download or multiple workers (which was around 0:02:18.284, see description of this pull request)

This reverts commit eafab4c.

frankenjoe · 2026-01-21T12:58:30Z

Ok, so our recommendations would be:

for files that can be compressed use compressed ZIP with streaming (i.e. single worker)
for all other files use uncompressed ZIP with ~5 workers

I guess we also need to add an argument to audeer.create_archive() that allows creating uncompressed archives.

hagenw · 2026-01-21T13:45:23Z

Ok, so our recommendations would be:

for files that can be compressed use compressed ZIP with streaming (i.e. single worker)

for all other files use uncompressed ZIP with ~5 workers

Yes.

I extended the docstring here to reflect this:

I guess we also need to add an argument to audeer.create_archive() that allows creating uncompressed archives.

I created audeering/audeer#188

Co-authored-by: Johannes Wagner <jwagner@audeering.com>

hagenw mentioned this pull request Jan 9, 2026

Multithreading archive extraction audeering/audeer#186

Closed

hagenw self-assigned this Jan 12, 2026

hagenw force-pushed the stream-extract branch from 85a8ecd to e8c2749 Compare January 12, 2026 13:40

hagenw mentioned this pull request Jan 13, 2026

Use streaming ZIP extraction audeering/audmodel#38

Draft

hagenw commented Jan 13, 2026

View reviewed changes

Comment thread audbackend/core/backend/base.py

hagenw commented Jan 14, 2026

View reviewed changes

Comment thread audbackend/core/backend/base.py

Comment thread audbackend/core/backend/base.py Outdated

Comment thread audbackend/core/backend/base.py Outdated

Comment thread audbackend/core/backend/base.py

hagenw commented Jan 14, 2026

View reviewed changes

Comment thread audbackend/core/backend/base.py Outdated

hagenw marked this pull request as ready for review January 14, 2026 16:35

sourcery-ai Bot reviewed Jan 14, 2026

View reviewed changes

Comment thread audbackend/core/backend/artifactory.py Outdated

Comment thread audbackend/core/backend/minio.py

Comment thread tests/test_backend_filesystem.py Outdated

Comment thread audbackend/core/backend/base.py

hagenw requested a review from frankenjoe January 15, 2026 10:00

hagenw mentioned this pull request Jan 20, 2026

Extracting a ZIP file downloaded with multiple workers might fail #284

Closed

hagenw added 4 commits January 21, 2026 13:56

Extract ZIP archives while downloading

1b34f10

Don't use stream-zip for Python 3.14

d80e651

Fix dependencies

b168224

Ensure we show progress bar

d64b66f

hagenw added 13 commits January 21, 2026 13:56

Another try to fix Windows test

c24ad8a

Revert "Fix test expectations under Windows"

2dacc65

This reverts commit eafab4c.

Try to fix Windows

ed39ad6

Add _size() to base backend

e87c76f

Fix coverage

5c43477

Updates from sourcery feedback

3d9ddfc

Restore size tests

0b02bdb

Fix test_size for filesystem

44481d7

Add tests for _get_file_stream() for all backends

a0cd0d2

Extend comment

aa28b4d

Remove pragma

b968e9f

Readd num_workers

4ffd418

Add missing num_workers

1c5b418

hagenw force-pushed the stream-extract branch from 51dc22a to 1c5b418 Compare January 21, 2026 12:56

hagenw added 2 commits January 21, 2026 14:36

Recommend num_workers for uncompressed ZIP files

39dd915

Extend docstrings for interfaces as well

d7901b8

hagenw mentioned this pull request Jan 21, 2026

Add option to create uncompressed archive to audeer.create_archive() audeering/audeer#188

Open

This was referenced Jan 21, 2026

Add num_workers to extract_archives() audeering/audeer#187

Closed

Publish model files as uncompressed ZIP file audeering/audmodel#39

Open

frankenjoe reviewed Jan 26, 2026

View reviewed changes

hagenw and others added 5 commits January 27, 2026 14:08

Fix typing of return values

a65017e

Co-authored-by: Johannes Wagner <jwagner@audeering.com>

Remove verification statement

59f2c5d

Improve docstring for num_workers

15b4fff

Fix imports

0618602

Improve docstring

fd6224d

frankenjoe approved these changes Feb 2, 2026

View reviewed changes

hagenw merged commit 6edfbee into main Feb 2, 2026
19 of 20 checks passed

hagenw deleted the stream-extract branch February 2, 2026 15:45

Conversation

hagenw commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Benchmarks

Discussion

Uh oh!

sourcery-ai Bot commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for streaming ZIP extraction in get_archive

Class diagram for backend streaming interfaces and implementations

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

Uh oh!

codecov Bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

frankenjoe commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hagenw commented Jan 15, 2026

Uh oh!

frankenjoe commented Jan 15, 2026

Uh oh!

hagenw commented Jan 15, 2026

Uh oh!

frankenjoe commented Jan 15, 2026

Uh oh!

hagenw commented Jan 16, 2026

Uh oh!

frankenjoe commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hagenw commented Jan 16, 2026

Uh oh!

frankenjoe commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hagenw commented Jan 16, 2026

Uh oh!

frankenjoe commented Jan 16, 2026

Uh oh!

hagenw commented Jan 20, 2026

Uh oh!

hagenw commented Jan 21, 2026

Uh oh!

frankenjoe commented Jan 21, 2026

Uh oh!

hagenw commented Jan 21, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

hagenw commented Jan 8, 2026 •

edited

Loading

sourcery-ai Bot commented Jan 8, 2026 •

edited

Loading

codecov Bot commented Jan 14, 2026 •

edited

Loading

frankenjoe commented Jan 15, 2026 •

edited

Loading

frankenjoe commented Jan 16, 2026 •

edited

Loading

frankenjoe commented Jan 16, 2026 •

edited

Loading