metadata-timeout-propagation-fix#47529
Open
dibahlfi wants to merge 2 commits into
Open
Conversation
- propagate per-call read_timeout, connection_timeout, and timeout (operation deadline) options across query setup metadata calls (container read, query plan, /pkranges) in sync and async paths - extend test coverage for timeout propagation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Member
Author
|
@sdkReviewAgent-2 |
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes and validates propagation of per-call timeout settings (read_timeout, connection_timeout, and operation timeout/deadline) across the Cosmos query “setup” metadata requests (container read, query plan fetch, and /pkranges) for both sync and async execution paths.
Changes:
- Adds shared helpers in
_base.pyto carry/copy per-call timeout settings (and operation start time) into downstream request kwargs/options. - Wires these helpers into query execution (query plan), container read, and hybrid-search
/pkrangesmetadata flows (sync + async). - Adds unit + live/emulator tests to assert timeout propagation and deadline behavior; updates CHANGELOG entry.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/azure/cosmos/_base.py | Adds helper utilities/constants to propagate per-call timeouts and operation start time into options/kwargs and ensures pk-range options retain them. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_constants.py | Introduces Kwargs.CONNECTION_TIMEOUT constant and updates inline documentation/comments for kwargs constants. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_client_connection.py | Uses the new helper to copy per-call timeouts/start time from options into request kwargs for query feed calls. |
| sdk/cosmos/azure-cosmos/azure/cosmos/aio/_cosmos_client_connection_async.py | Async equivalent: copies per-call timeouts/start time from options into request kwargs for query feed calls. |
| sdk/cosmos/azure-cosmos/azure/cosmos/container.py | Uses shared helper to forward per-call timeouts/start time into container read requests built from options. |
| sdk/cosmos/azure-cosmos/azure/cosmos/aio/_container.py | Async equivalent: forwards per-call timeouts/start time into container read requests. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/execution_dispatcher.py | Updates query-plan dispatch to forward additional per-call timeout/deadline settings via kwargs. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/aio/execution_dispatcher.py | Async equivalent: updates query-plan dispatch timeout/deadline forwarding. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/hybrid_search_aggregator.py | Ensures hybrid-search all-ranges /pkranges path carries per-call timeout/deadline options. |
| sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/aio/hybrid_search_aggregator.py | Async equivalent: carries per-call timeout/deadline options for hybrid-search all-ranges /pkranges. |
| sdk/cosmos/azure-cosmos/tests/test_timeout_propagation_unit.py | Adds mock-light unit tests validating timeout/deadline propagation behavior through key helpers and dispatchers. |
| sdk/cosmos/azure-cosmos/tests/test_metadata_timeout_propagation.py | Adds end-to-end fault-injection tests to confirm timeouts reach metadata setup calls and operation deadlines halt setup. |
| sdk/cosmos/azure-cosmos/tests/test_container_rid_header_unit.py | Extends pk-range option sanitization tests for timeouts and improves mocks to seed required internal sidecars. |
| sdk/cosmos/azure-cosmos/CHANGELOG.md | Documents the timeout propagation bug fix under “Bugs Fixed”. |
Member
Author
|
@sdkReviewAgent-2 |
Member
Author
|
/azp run python - cosmos - tests |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Issue - when a customer passes timeouts to an operation like single query_items(...) call, the SDK was honoring them on data plane operations but it was not consistently enforcing them on behind-the-scenes metadata classs the SDK has to make - reading the container, listing partitions (/pkranges), and fetching the query plan. On those calls the SDK fell back to the client/policy defaults instead of the value client set.
After this change, a per-call timeout bounds the whole query — metadata calls included.
Concrete symptoms this removes
read_timeout ignored on /pkranges. A customer that sets a tight client default and raises read_timeout for one slow query would see that query die on a sub-second routing fetch, with an error naming the wrong number (the client default, not the value they set) — pointing them at the page fetch that never ran.
connection_timeout ignored on all setup calls. A service that sets a fail-fast socket-open budget got the full 5s default on exactly the calls a cold operation is made of.
timeout= deadline didn't cover setup. A hard end-to-end bound meant to keep a slow dependency from tying up a request thread didn't actually bound /pkranges or the query plan, so slow setup could run past it.
What changed:
The per-call timeouts (and the operation's start mark) are carried into the container read, /pkranges, and query-plan calls, so they reach the request layer the same way the page fetch always has.
The query-plan call now omits read_timeout when you didn't set one, instead of forwarding None (which previously disabled that call's read timeout). It now falls back to the policy default like every other call.
The async retry loop runs its after-the-call deadline check on the normal request path, restoring sync/async parity.
No new timers or defaults are added.
Failover stays fast. The forced-short failover probes (3s account probe, 6s recovery probe) are untouched, so nothing a caller passes through a query can slow a regional failover.
No public API change.
Testing:
Unit : value is picked up and handed along at every step; the /pkranges options formatter keeps the timers; an unset timer is never forced to None; sync and async deadline parity.
Fault-injection integration (sync + async): the values actually reach the wire on the container read, query plan, and /pkranges - on a cold client and after a partition-split refresh - the account probe keeps its short budget, and a tight timeout = stops the query during setup.