Skip to content

Reuse pre-signed URL for multi-threaded downloads#1407

Open
BryanFauble wants to merge 1 commit into
developfrom
synpy-presigned-url-reuse
Open

Reuse pre-signed URL for multi-threaded downloads#1407
BryanFauble wants to merge 1 commit into
developfrom
synpy-presigned-url-reuse

Conversation

@BryanFauble

Copy link
Copy Markdown
Member

Problem

When downloading a file larger than 8 MB (the multi-threaded download path), the client requests two pre-signed URLs from the /fileHandle/batch endpoint:

  1. In download_by_file_handle, to retrieve file handle metadata (concrete type, content size, MD5, storage location) needed to choose the download strategy. Its preSignedURL is discarded on the multi-threaded path.
  2. Again inside _MultithreadedDownloader (via PresignedUrlProvider._get_pre_signed_info), to actually download the chunks.

The first URL is thrown away, so every large-file download generates a redundant pre-signed URL — doubling the pre-signed URL count and polluting download metrics. (Small/single-threaded downloads and cache hits are unaffected — they already use the URL from step 1, or none at all.)

sequenceDiagram
    actor Client
    participant DBFH as download_by_file_handle
    participant API as /fileHandle/batch
    participant MTD as _MultithreadedDownloader
    participant Provider as PresignedUrlProvider

    Client->>DBFH: download (file > 8 MB)
    DBFH->>API: getFileHandleForDownload (#1)
    API-->>DBFH: metadata + preSignedURL
    Note over DBFH: preSignedURL DISCARDED<br/>(only metadata used)
    DBFH->>MTD: download_from_url_multi_threaded()
    MTD->>Provider: get_info()
    Provider->>API: getFileHandleForDownload (#2)
    API-->>Provider: preSignedURL
    Note over API: 2 pre-signed URLs<br/>requested per download
    loop each 8 MB chunk
        MTD->>Provider: get_info() (reuses cached URL)
    end
Loading

Solution

Forward the pre-signed URL already retrieved in step 1 into download_from_url_multi_threaded (the function already accepts a presigned_url parameter, used today by WikiPage). The downloader seeds PresignedUrlProvider._cached_info with it and skips the second batch call.

Crucially, the file handle identifiers (file_handle_id / object_id / object_type) are now also threaded into the DownloadRequest on the seeded-URL path, so the mid-download expiry/refetch flow is preserved: if the seeded URL nears expiry during a long transfer, PresignedUrlProvider.get_info() still refetches a fresh URL exactly as before. Wiki callers omit these identifiers, so their existing (no-refetch) behavior is unchanged.

sequenceDiagram
    actor Client
    participant DBFH as download_by_file_handle
    participant API as /fileHandle/batch
    participant MTD as _MultithreadedDownloader
    participant Provider as PresignedUrlProvider

    Client->>DBFH: download (file > 8 MB)
    DBFH->>API: getFileHandleForDownload (#1)
    API-->>DBFH: metadata + preSignedURL
    Note over DBFH: build PresignedUrlInfo<br/>from the SAME response
    DBFH->>MTD: download_from_url_multi_threaded(presigned_url=..., file_handle_id=...)
    MTD->>Provider: get_info()
    Note over Provider: returns seeded URL<br/>NO second API call
    loop each 8 MB chunk
        MTD->>Provider: get_info() (reuses cached URL)
        alt URL near expiry mid-download
            Provider->>API: getFileHandleForDownload (refetch)
            API-->>Provider: fresh preSignedURL
        end
    end
    Note over API: 1 pre-signed URL<br/>per download (refetch only if expired)
Loading

Testing

  • unit_test_download_functions.py — new tests spy on get_file_handle_for_download_async and download_from_url_multi_threaded to assert the batch endpoint is called exactly once and that the retrieved URL + file handle identifiers are forwarded; plus that the constructed DownloadRequest carries the identifiers (and that the identifier-less wiki path leaves them None).
  • unit_test_download_async.py — new TestSeededPresignedUrlProvider: a seeded unexpired URL is reused with zero refetch calls; a seeded near-expiry URL triggers exactly one refetch keyed on the correct identifiers (not None).
  • unit_test_download.py — updated test_multithread_true_s3_file_handle to reflect the new forwarding behavior.
  • Full regression run of the download, multithread_download, and wiki unit suites: 139 passed. ruff / black / isort clean on changed files.

Multi-threaded downloads (files > 8 MB) requested two pre-signed URLs:
one in download_by_file_handle to choose the download strategy, and a
second inside the multi-threaded downloader. The first URL's preSignedURL
field was discarded, doubling the pre-signed URL count (and download
metrics) for every large-file download.

Forward the already-retrieved URL into download_from_url_multi_threaded
so the downloader reuses it instead of fetching a second one. The file
handle identifiers (file_handle_id/object_id/object_type) are now passed
alongside the seeded URL so PresignedUrlProvider can still refetch a fresh
URL if it expires mid-download. Wiki callers omit these identifiers,
preserving their existing no-refetch behavior.
@BryanFauble BryanFauble requested a review from a team as a code owner June 12, 2026 17:00
Copilot AI review requested due to automatic review settings June 12, 2026 17:00

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces redundant /fileHandle/batch calls on the multi-threaded download path by forwarding the pre-signed URL retrieved during initial file-handle metadata lookup into download_from_url_multi_threaded, while also forwarding file-handle identifiers so the downloader can still refetch a URL mid-download if needed.

Changes:

  • Seed the multi-threaded downloader with a PresignedUrlInfo built from the initial get_file_handle_for_download_async response (avoids a second pre-signed URL fetch).
  • Extend the seeded-URL multi-threaded path to include file_handle_id / object_id / object_type in the constructed DownloadRequest to preserve refetch behavior.
  • Add/adjust unit tests to assert single batch call behavior and correct forwarding/refetch semantics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
synapseclient/core/download/download_functions.py Forwards initial preSignedURL into multi-threaded downloads and threads identifiers into DownloadRequest.
tests/unit/synapseclient/core/unit_test_download.py Updates multi-threaded download test to assert the forwarded PresignedUrlInfo.
tests/unit/synapseclient/core/download/unit_test_download_functions.py Adds tests asserting the batch endpoint is hit once and identifiers + URL are forwarded into multi-threaded downloads.
tests/unit/synapseclient/core/download/unit_test_download_async.py Adds tests for a seeded PresignedUrlProvider reusing unexpired URLs and refetching expired ones using identifiers.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +754 to +758
# Pass through the file handle identifiers so the downloader can refetch a
# fresh pre-signed URL if the seeded one expires mid-download. Callers that
# cannot refetch (e.g. wiki attachments) omit these, leaving them as None.
request = DownloadRequest(
file_handle_id=int(file_handle_id) if file_handle_id is not None else None,

@thomasyu888 thomasyu888 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 Thanks for taking such a quick look at this. My biggest concern is whether this would work for large file downloads, and I think it will.

I'll defer to @linglp / @andrewelamb for final review

# Pass through the file handle identifiers so the downloader can refetch a
# fresh pre-signed URL if the seeded one expires mid-download. Callers that
# cannot refetch (e.g. wiki attachments) omit these, leaving them as None.
request = DownloadRequest(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify or add a comment about the asymmetry here? We are doing:

        request = DownloadRequest(
            file_handle_id=int(file_handle_id) if file_handle_id is not None else None,
            object_id=object_id,
            object_type=object_type,
            path=temp_destination,
            debug=client.debug,
            presigned_url=presigned_url,
        )

But in the else statement, I saw:

        request = DownloadRequest(
            file_handle_id=int(file_handle_id),
            object_id=object_id,
            object_type=object_type,
            path=temp_destination,
            debug=client.debug,
        )

I am curious why None guard is only needed in the if presigned_url is not None branch? Is it because file_handle_id is guaranteed non-None in else?

@linglp linglp left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just one question

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants