Dipowell/node readiness timing#1208
Conversation
50bdf6b to
7ff146e
Compare
6f41146 to
0b1cf21
Compare
0b1cf21 to
4603b7b
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds independent node-readiness timing to the Python AKS CRUD flow by extending the Kubernetes wait helper to optionally return a readiness timestamp, then running the ARM poller and the K8s readiness wait concurrently so both timings can be recorded for regression analysis.
Changes:
- Extend
KubernetesClient.wait_for_nodes_ready()withreturn_timestampto optionally return(ready_nodes, ready_timestamp). - Add concurrent ARM + K8s readiness execution in
AKSClientand recordnode_readiness_time/command_execution_timemetadata. - Update AKS and Kubernetes client unit tests to cover the new return shape and timing metadata recording.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| modules/python/clients/kubernetes_client.py | Adds return_timestamp option so callers can capture when nodes became Ready. |
| modules/python/clients/aks_client.py | Runs ARM and readiness concurrently and stores separate timing metadata. |
| modules/python/tests/clients/test_kubernetes_client.py | Adds a unit test validating the timestamp-returning behavior. |
| modules/python/tests/clients/test_aks_client.py | Updates tests to expect timestamp return and to assert timing metadata is recorded. |
|
|
||
| return arm_response, ready_nodes, node_readiness_time, command_execution_time | ||
|
|
||
| return asyncio.run(_run()) |
There was a problem hiding this comment.
asyncio requires event loop. With this implementation it embed an implicit requirement that caller must not be in an event loop already. For example if clients want to invoke create_node_pool in an async fashion at higher level, the asyncio.run here will fail as an it is called from within an event loop already. Ideally if we want to use async it would be good to propagate the async chain all the way up but this is much bigger refactoring.
How about just spinning up a small ThreadPoolExecutor similar like this:
def _run_concurrent_arm_and_readiness(self, ...):
with ThreadPoolExecutor(max_workers=2) as ex:
arm_future = ex.submit(...)
k8s_future = ex.submit(...)
wait([arm_future, k8s_future])
|
|
||
| return OperationContext | ||
|
|
||
| def _run_concurrent_arm_and_readiness( |
There was a problem hiding this comment.
I think _instrument_nodepool_provisioning is a better name for the method
|
|
||
| def _run_concurrent_arm_and_readiness( | ||
| self, | ||
| poller, |
There was a problem hiding this comment.
I see only begin_create_or_update is passed in so no need to have a separate poller param
| poller, | ||
| node_count: int, | ||
| label_selector: str, | ||
| start_time: float |
There was a problem hiding this comment.
I think start_time can be captured within this function, no need to be passed from outside
Summary
Adds
node_readiness_timeas a separate metric in the open-source CRUD module to match internal repo behavior. The internal repo captures how long K8s nodes take to become Ready independently from when the ARM API completes. The open-source repo was missing this - it only had combined duration.Azure API says "done" when the control plane finishes, but nodes might not be schedulable yet. Capturing both timestamps separately enables regression analysis:
command_execution_time > node_readiness_time→ ARM layer is the bottlenecknode_readiness_time > command_execution_time→ K8s layer is the bottleneckChanges
kubernetes_client.py
return_timestamp=Falseparameter towait_for_nodes_ready()True, returns(ready_nodes, timestamp)tuple instead of justready_nodesaks_client.py
_run_concurrent_arm_and_readiness()helper usingasyncio.gather()create_node_pool(),scale_node_pool(), and_progressive_scale()to capture concurrent timingTiming metadata stored via
op.add_metadata():node_readiness_time: seconds from start until K8s nodes were Readycommand_execution_time: seconds from start until ARM operation completedImplementation notes
asyncio.to_thread()to run ARM polling and K8s readiness checks concurrentlyreturn_exceptions=Trueensures both tasks complete even if one fails, enabling partial diagnostics