Fail fast on terminal AKS machine failures#1212
Merged
karenychen merged 6 commits intoJun 12, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves AKS Machine-mode scale-out observability and failure behavior by failing readiness waits early when the AKS Machine resources reach terminal failure states, while also enforcing the 50-machine BatchPutMachine request limit and tightening recorded operation metadata.
Changes:
- Add Machine API ListMachines polling during node readiness waits to fail fast when expected machines are terminal and failed.
- Enforce BatchPutMachine sizing constraints (≤ 50 machines per request), including worker-based batch sizing validation.
- Record
successful_machinesas a count (not machine names) and add targeted unit tests for the new behaviors.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| modules/python/clients/aks_machine_client.py | Adds ListMachines pagination + terminal failure detection hooked into readiness polling; enforces 50-machine BatchPutMachine sizing; stores successful machine count metadata. |
| modules/python/tests/test_aks_machine_client.py | Updates metadata assertions and adds focused unit coverage for timeout propagation, ListMachines pagination, terminal failure detection, and batch sizing limits. |
8d6a269 to
d5821f9
Compare
d5821f9 to
f70bd18
Compare
liyu-ma
reviewed
Jun 9, 2026
liyu-ma
reviewed
Jun 9, 2026
liyu-ma
reviewed
Jun 9, 2026
liyu-ma
reviewed
Jun 9, 2026
liyu-ma
reviewed
Jun 9, 2026
liyu-ma
reviewed
Jun 9, 2026
liyu-ma
reviewed
Jun 9, 2026
liyu-ma
reviewed
Jun 9, 2026
liyu-ma
reviewed
Jun 9, 2026
liyu-ma
reviewed
Jun 10, 2026
86e59b2 to
20f516f
Compare
20f516f to
6f93d9e
Compare
liyu-ma
reviewed
Jun 12, 2026
liyu-ma
reviewed
Jun 12, 2026
liyu-ma
reviewed
Jun 12, 2026
86dc1d6 to
62e2cb5
Compare
liyu-ma
approved these changes
Jun 12, 2026
PabloTriv
pushed a commit
that referenced
this pull request
Jun 12, 2026
## Summary - add ListMachines polling during machine readiness waits to fail early when expected machines terminally fail - keep BatchPutMachine request sizing within the 50-machine service limit - record successful_machines as a count instead of uploading machine names - add focused unit coverage for timeout propagation, ListMachines pagination, terminal failure detection, and batch sizing ## Validation - python3 -m pytest modules/python/tests/test_aks_machine_client.py modules/python/tests/test_machine_crud.py modules/python/tests/test_crud_main.py -q - python3 -m py_compile modules/python/clients/aks_machine_client.py modules/python/tests/test_aks_machine_client.py - PYTHONPATH=modules/python python3 -m pylint modules/python/tests/test_aks_machine_client.py modules/python/clients/aks_machine_client.py modules/python/crud/azure/machine_crud.py modules/python/crud/main.py - git diff --check
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Validation