Reuse pre-signed URL for multi-threaded downloads#1407
Conversation
Multi-threaded downloads (files > 8 MB) requested two pre-signed URLs: one in download_by_file_handle to choose the download strategy, and a second inside the multi-threaded downloader. The first URL's preSignedURL field was discarded, doubling the pre-signed URL count (and download metrics) for every large-file download. Forward the already-retrieved URL into download_from_url_multi_threaded so the downloader reuses it instead of fetching a second one. The file handle identifiers (file_handle_id/object_id/object_type) are now passed alongside the seeded URL so PresignedUrlProvider can still refetch a fresh URL if it expires mid-download. Wiki callers omit these identifiers, preserving their existing no-refetch behavior.
There was a problem hiding this comment.
Pull request overview
This PR reduces redundant /fileHandle/batch calls on the multi-threaded download path by forwarding the pre-signed URL retrieved during initial file-handle metadata lookup into download_from_url_multi_threaded, while also forwarding file-handle identifiers so the downloader can still refetch a URL mid-download if needed.
Changes:
- Seed the multi-threaded downloader with a
PresignedUrlInfobuilt from the initialget_file_handle_for_download_asyncresponse (avoids a second pre-signed URL fetch). - Extend the seeded-URL multi-threaded path to include
file_handle_id/object_id/object_typein the constructedDownloadRequestto preserve refetch behavior. - Add/adjust unit tests to assert single batch call behavior and correct forwarding/refetch semantics.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
synapseclient/core/download/download_functions.py |
Forwards initial preSignedURL into multi-threaded downloads and threads identifiers into DownloadRequest. |
tests/unit/synapseclient/core/unit_test_download.py |
Updates multi-threaded download test to assert the forwarded PresignedUrlInfo. |
tests/unit/synapseclient/core/download/unit_test_download_functions.py |
Adds tests asserting the batch endpoint is hit once and identifiers + URL are forwarded into multi-threaded downloads. |
tests/unit/synapseclient/core/download/unit_test_download_async.py |
Adds tests for a seeded PresignedUrlProvider reusing unexpired URLs and refetching expired ones using identifiers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Pass through the file handle identifiers so the downloader can refetch a | ||
| # fresh pre-signed URL if the seeded one expires mid-download. Callers that | ||
| # cannot refetch (e.g. wiki attachments) omit these, leaving them as None. | ||
| request = DownloadRequest( | ||
| file_handle_id=int(file_handle_id) if file_handle_id is not None else None, |
thomasyu888
left a comment
There was a problem hiding this comment.
🔥 Thanks for taking such a quick look at this. My biggest concern is whether this would work for large file downloads, and I think it will.
I'll defer to @linglp / @andrewelamb for final review
| # Pass through the file handle identifiers so the downloader can refetch a | ||
| # fresh pre-signed URL if the seeded one expires mid-download. Callers that | ||
| # cannot refetch (e.g. wiki attachments) omit these, leaving them as None. | ||
| request = DownloadRequest( |
There was a problem hiding this comment.
Could you clarify or add a comment about the asymmetry here? We are doing:
request = DownloadRequest(
file_handle_id=int(file_handle_id) if file_handle_id is not None else None,
object_id=object_id,
object_type=object_type,
path=temp_destination,
debug=client.debug,
presigned_url=presigned_url,
)
But in the else statement, I saw:
request = DownloadRequest(
file_handle_id=int(file_handle_id),
object_id=object_id,
object_type=object_type,
path=temp_destination,
debug=client.debug,
)
I am curious why None guard is only needed in the if presigned_url is not None branch? Is it because file_handle_id is guaranteed non-None in else?
Problem
When downloading a file larger than 8 MB (the multi-threaded download path), the client requests two pre-signed URLs from the
/fileHandle/batchendpoint:download_by_file_handle, to retrieve file handle metadata (concrete type, content size, MD5, storage location) needed to choose the download strategy. ItspreSignedURLis discarded on the multi-threaded path._MultithreadedDownloader(viaPresignedUrlProvider._get_pre_signed_info), to actually download the chunks.The first URL is thrown away, so every large-file download generates a redundant pre-signed URL — doubling the pre-signed URL count and polluting download metrics. (Small/single-threaded downloads and cache hits are unaffected — they already use the URL from step 1, or none at all.)
sequenceDiagram actor Client participant DBFH as download_by_file_handle participant API as /fileHandle/batch participant MTD as _MultithreadedDownloader participant Provider as PresignedUrlProvider Client->>DBFH: download (file > 8 MB) DBFH->>API: getFileHandleForDownload (#1) API-->>DBFH: metadata + preSignedURL Note over DBFH: preSignedURL DISCARDED<br/>(only metadata used) DBFH->>MTD: download_from_url_multi_threaded() MTD->>Provider: get_info() Provider->>API: getFileHandleForDownload (#2) API-->>Provider: preSignedURL Note over API: 2 pre-signed URLs<br/>requested per download loop each 8 MB chunk MTD->>Provider: get_info() (reuses cached URL) endSolution
Forward the pre-signed URL already retrieved in step 1 into
download_from_url_multi_threaded(the function already accepts apresigned_urlparameter, used today byWikiPage). The downloader seedsPresignedUrlProvider._cached_infowith it and skips the second batch call.Crucially, the file handle identifiers (
file_handle_id/object_id/object_type) are now also threaded into theDownloadRequeston the seeded-URL path, so the mid-download expiry/refetch flow is preserved: if the seeded URL nears expiry during a long transfer,PresignedUrlProvider.get_info()still refetches a fresh URL exactly as before. Wiki callers omit these identifiers, so their existing (no-refetch) behavior is unchanged.sequenceDiagram actor Client participant DBFH as download_by_file_handle participant API as /fileHandle/batch participant MTD as _MultithreadedDownloader participant Provider as PresignedUrlProvider Client->>DBFH: download (file > 8 MB) DBFH->>API: getFileHandleForDownload (#1) API-->>DBFH: metadata + preSignedURL Note over DBFH: build PresignedUrlInfo<br/>from the SAME response DBFH->>MTD: download_from_url_multi_threaded(presigned_url=..., file_handle_id=...) MTD->>Provider: get_info() Note over Provider: returns seeded URL<br/>NO second API call loop each 8 MB chunk MTD->>Provider: get_info() (reuses cached URL) alt URL near expiry mid-download Provider->>API: getFileHandleForDownload (refetch) API-->>Provider: fresh preSignedURL end end Note over API: 1 pre-signed URL<br/>per download (refetch only if expired)Testing
unit_test_download_functions.py— new tests spy onget_file_handle_for_download_asyncanddownload_from_url_multi_threadedto assert the batch endpoint is called exactly once and that the retrieved URL + file handle identifiers are forwarded; plus that the constructedDownloadRequestcarries the identifiers (and that the identifier-less wiki path leaves themNone).unit_test_download_async.py— newTestSeededPresignedUrlProvider: a seeded unexpired URL is reused with zero refetch calls; a seeded near-expiry URL triggers exactly one refetch keyed on the correct identifiers (notNone).unit_test_download.py— updatedtest_multithread_true_s3_file_handleto reflect the new forwarding behavior.