Skip to content

Dipowell/node readiness timing#1216

Closed
diamondpowell wants to merge 10 commits into
Azure:mainfrom
diamondpowell:dipowell/node-readiness-timing
Closed

Dipowell/node readiness timing#1216
diamondpowell wants to merge 10 commits into
Azure:mainfrom
diamondpowell:dipowell/node-readiness-timing

Conversation

@diamondpowell

Copy link
Copy Markdown
Contributor

Summary

Adds node_readiness_time as a separate metric in the open-source CRUD module to match internal repo behavior. The internal repo captures how long K8s nodes take to become Ready independently from when the ARM API completes. The open-source repo was missing this - it only had combined duration.

Azure API says "done" when the control plane finishes, but nodes might not be schedulable yet. Capturing both timestamps separately enables regression analysis:

  • command_execution_time > node_readiness_time → ARM layer is the bottleneck
  • node_readiness_time > command_execution_time → K8s layer is the bottleneck

Changes

kubernetes_client.py

  • Added return_timestamp=False parameter to wait_for_nodes_ready()
  • When True, returns (ready_nodes, timestamp) tuple instead of just ready_nodes

aks_client.py

  • Added _run_concurrent_arm_and_readiness() helper using asyncio.gather()
  • Updated create_node_pool(), scale_node_pool(), and _progressive_scale() to capture concurrent timing

Timing metadata stored via op.add_metadata():

  • node_readiness_time: seconds from start until K8s nodes were Ready
  • command_execution_time: seconds from start until ARM operation completed

Implementation notes

  • Uses asyncio.to_thread() to run sync calls concurrently
  • return_exceptions=True ensures both tasks complete even if one fails, enabling partial diagnostics
  • Handles both sync and async calling contexts with ThreadPoolExecutor fallback

@diamondpowell diamondpowell deleted the dipowell/node-readiness-timing branch June 10, 2026 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant