Skip to content

Dipowell/node readiness timing#1208

Open
diamondpowell wants to merge 14 commits into
mainfrom
dipowell/node-readiness-timing
Open

Dipowell/node readiness timing#1208
diamondpowell wants to merge 14 commits into
mainfrom
dipowell/node-readiness-timing

Conversation

@diamondpowell

@diamondpowell diamondpowell commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds node_readiness_time as a separate metric in the open-source CRUD module to match internal repo behavior. The internal repo captures how long K8s nodes take to become Ready independently from when the ARM API completes. The open-source repo was missing this - it only had combined duration.

Azure API says "done" when the control plane finishes, but nodes might not be schedulable yet. Capturing both timestamps separately enables regression analysis:

  • command_execution_time > node_readiness_time → ARM layer is the bottleneck
  • node_readiness_time > command_execution_time → K8s layer is the bottleneck

Changes

kubernetes_client.py

  • Added return_timestamp=False parameter to wait_for_nodes_ready()
  • When True, returns (ready_nodes, timestamp) tuple instead of just ready_nodes

aks_client.py

  • Added _run_concurrent_arm_and_readiness() helper using asyncio.gather()
  • Updated create_node_pool(), scale_node_pool(), and _progressive_scale() to capture concurrent timing

Timing metadata stored via op.add_metadata():

  • node_readiness_time: seconds from start until K8s nodes were Ready
  • command_execution_time: seconds from start until ARM operation completed

Implementation notes

  • Uses asyncio.to_thread() to run ARM polling and K8s readiness checks concurrently
  • return_exceptions=True ensures both tasks complete even if one fails, enabling partial diagnostics

@diamondpowell diamondpowell force-pushed the dipowell/node-readiness-timing branch from 50bdf6b to 7ff146e Compare June 2, 2026 20:54
@diamondpowell diamondpowell force-pushed the dipowell/node-readiness-timing branch 8 times, most recently from 6f41146 to 0b1cf21 Compare June 9, 2026 23:53
@diamondpowell diamondpowell force-pushed the dipowell/node-readiness-timing branch from 0b1cf21 to 4603b7b Compare June 10, 2026 00:31
@diamondpowell diamondpowell marked this pull request as ready for review June 10, 2026 00:40
Copilot AI review requested due to automatic review settings June 10, 2026 00:40

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds independent node-readiness timing to the Python AKS CRUD flow by extending the Kubernetes wait helper to optionally return a readiness timestamp, then running the ARM poller and the K8s readiness wait concurrently so both timings can be recorded for regression analysis.

Changes:

  • Extend KubernetesClient.wait_for_nodes_ready() with return_timestamp to optionally return (ready_nodes, ready_timestamp).
  • Add concurrent ARM + K8s readiness execution in AKSClient and record node_readiness_time / command_execution_time metadata.
  • Update AKS and Kubernetes client unit tests to cover the new return shape and timing metadata recording.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
modules/python/clients/kubernetes_client.py Adds return_timestamp option so callers can capture when nodes became Ready.
modules/python/clients/aks_client.py Runs ARM and readiness concurrently and stores separate timing metadata.
modules/python/tests/clients/test_kubernetes_client.py Adds a unit test validating the timestamp-returning behavior.
modules/python/tests/clients/test_aks_client.py Updates tests to expect timestamp return and to assert timing metadata is recorded.

Comment thread modules/python/clients/aks_client.py
@diamondpowell diamondpowell reopened this Jun 10, 2026

return arm_response, ready_nodes, node_readiness_time, command_execution_time

return asyncio.run(_run())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asyncio requires event loop. With this implementation it embed an implicit requirement that caller must not be in an event loop already. For example if clients want to invoke create_node_pool in an async fashion at higher level, the asyncio.run here will fail as an it is called from within an event loop already. Ideally if we want to use async it would be good to propagate the async chain all the way up but this is much bigger refactoring.

How about just spinning up a small ThreadPoolExecutor similar like this:

def _run_concurrent_arm_and_readiness(self, ...):
    with ThreadPoolExecutor(max_workers=2) as ex:
        arm_future = ex.submit(...)
        k8s_future = ex.submit(...)
        wait([arm_future, k8s_future])


return OperationContext

def _run_concurrent_arm_and_readiness(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think _instrument_nodepool_provisioning is a better name for the method


def _run_concurrent_arm_and_readiness(
self,
poller,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see only begin_create_or_update is passed in so no need to have a separate poller param

poller,
node_count: int,
label_selector: str,
start_time: float

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think start_time can be captured within this function, no need to be passed from outside

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants