Skip to content

metadata-timeout-propagation-fix#47529

Open
dibahlfi wants to merge 2 commits into
mainfrom
users/dibahl/metadata-timeout-propagation-fix
Open

metadata-timeout-propagation-fix#47529
dibahlfi wants to merge 2 commits into
mainfrom
users/dibahl/metadata-timeout-propagation-fix

Conversation

@dibahlfi

@dibahlfi dibahlfi commented Jun 16, 2026

Copy link
Copy Markdown
Member

Issue - when a customer passes timeouts to an operation like single query_items(...) call, the SDK was honoring them on data plane operations but it was not consistently enforcing them on behind-the-scenes metadata classs the SDK has to make - reading the container, listing partitions (/pkranges), and fetching the query plan. On those calls the SDK fell back to the client/policy defaults instead of the value client set.
After this change, a per-call timeout bounds the whole query — metadata calls included.

Concrete symptoms this removes
read_timeout ignored on /pkranges. A customer that sets a tight client default and raises read_timeout for one slow query would see that query die on a sub-second routing fetch, with an error naming the wrong number (the client default, not the value they set) — pointing them at the page fetch that never ran.

connection_timeout ignored on all setup calls. A service that sets a fail-fast socket-open budget got the full 5s default on exactly the calls a cold operation is made of.

timeout= deadline didn't cover setup. A hard end-to-end bound meant to keep a slow dependency from tying up a request thread didn't actually bound /pkranges or the query plan, so slow setup could run past it.

What changed:
The per-call timeouts (and the operation's start mark) are carried into the container read, /pkranges, and query-plan calls, so they reach the request layer the same way the page fetch always has.
The query-plan call now omits read_timeout when you didn't set one, instead of forwarding None (which previously disabled that call's read timeout). It now falls back to the policy default like every other call.
The async retry loop runs its after-the-call deadline check on the normal request path, restoring sync/async parity.

No new timers or defaults are added.
Failover stays fast. The forced-short failover probes (3s account probe, 6s recovery probe) are untouched, so nothing a caller passes through a query can slow a regional failover.
No public API change.

Testing:
Unit : value is picked up and handed along at every step; the /pkranges options formatter keeps the timers; an unset timer is never forced to None; sync and async deadline parity.
Fault-injection integration (sync + async): the values actually reach the wire on the container read, query plan, and /pkranges - on a cold client and after a partition-split refresh - the account probe keeps its short budget, and a tight timeout = stops the query during setup.

- propagate per-call read_timeout, connection_timeout, and timeout (operation deadline) options across query setup metadata calls (container read, query plan, /pkranges) in sync and async paths
- extend test coverage for timeout propagation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dibahlfi dibahlfi requested a review from a team as a code owner June 16, 2026 17:50
Copilot AI review requested due to automatic review settings June 16, 2026 17:50
@dibahlfi

Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes and validates propagation of per-call timeout settings (read_timeout, connection_timeout, and operation timeout/deadline) across the Cosmos query “setup” metadata requests (container read, query plan fetch, and /pkranges) for both sync and async execution paths.

Changes:

  • Adds shared helpers in _base.py to carry/copy per-call timeout settings (and operation start time) into downstream request kwargs/options.
  • Wires these helpers into query execution (query plan), container read, and hybrid-search /pkranges metadata flows (sync + async).
  • Adds unit + live/emulator tests to assert timeout propagation and deadline behavior; updates CHANGELOG entry.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
sdk/cosmos/azure-cosmos/azure/cosmos/_base.py Adds helper utilities/constants to propagate per-call timeouts and operation start time into options/kwargs and ensures pk-range options retain them.
sdk/cosmos/azure-cosmos/azure/cosmos/_constants.py Introduces Kwargs.CONNECTION_TIMEOUT constant and updates inline documentation/comments for kwargs constants.
sdk/cosmos/azure-cosmos/azure/cosmos/_cosmos_client_connection.py Uses the new helper to copy per-call timeouts/start time from options into request kwargs for query feed calls.
sdk/cosmos/azure-cosmos/azure/cosmos/aio/_cosmos_client_connection_async.py Async equivalent: copies per-call timeouts/start time from options into request kwargs for query feed calls.
sdk/cosmos/azure-cosmos/azure/cosmos/container.py Uses shared helper to forward per-call timeouts/start time into container read requests built from options.
sdk/cosmos/azure-cosmos/azure/cosmos/aio/_container.py Async equivalent: forwards per-call timeouts/start time into container read requests.
sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/execution_dispatcher.py Updates query-plan dispatch to forward additional per-call timeout/deadline settings via kwargs.
sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/aio/execution_dispatcher.py Async equivalent: updates query-plan dispatch timeout/deadline forwarding.
sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/hybrid_search_aggregator.py Ensures hybrid-search all-ranges /pkranges path carries per-call timeout/deadline options.
sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/aio/hybrid_search_aggregator.py Async equivalent: carries per-call timeout/deadline options for hybrid-search all-ranges /pkranges.
sdk/cosmos/azure-cosmos/tests/test_timeout_propagation_unit.py Adds mock-light unit tests validating timeout/deadline propagation behavior through key helpers and dispatchers.
sdk/cosmos/azure-cosmos/tests/test_metadata_timeout_propagation.py Adds end-to-end fault-injection tests to confirm timeouts reach metadata setup calls and operation deadlines halt setup.
sdk/cosmos/azure-cosmos/tests/test_container_rid_header_unit.py Extends pk-range option sanitization tests for timeouts and improves mocks to seed required internal sidecars.
sdk/cosmos/azure-cosmos/CHANGELOG.md Documents the timeout propagation bug fix under “Bugs Fixed”.

Comment thread sdk/cosmos/azure-cosmos/tests/test_timeout_propagation_unit.py Outdated
Comment thread sdk/cosmos/azure-cosmos/azure/cosmos/_execution_context/execution_dispatcher.py Outdated
@dibahlfi

Copy link
Copy Markdown
Member Author

@sdkReviewAgent-2

@dibahlfi

Copy link
Copy Markdown
Member Author

/azp run python - cosmos - tests

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants