add header to NAP runs to change provisioning mode#1215
Draft
PabloTriv wants to merge 18 commits into
Draft
Conversation
## Summary Adds the non-batch scale path to `AKSMachineClient`, building on the scaffolding from #1179. The batch path (`use_batch_api=True`) is intentionally still stubbed with `NotImplementedError` and lands in a follow-up PR so this diff stays focused and reviewable. ## What's in this PR **Fully implemented in this PR:** - `scale_machine` (non-batch branch): snapshots baseline ready node count, submits per-machine PUTs via `_scale_machine_individually`, waits once on the agentpool-level `provisioningState`, then waits for P50/P70/P90/P99/P100 node readiness. Raises `RuntimeError` on partial landing, agentpool timeout, or P100 readiness failure. - `_wait_for_machine_node_readiness`: polls nodes labeled `agentpool=<pool>` and computes percentile envelopes against `baseline_count + expected_count` so only NEW nodes count toward thresholds. Never raises; returns an empty-state envelope on exhaustion. - `_get_machine_name_prefix`: `scale<N>k` for multiples of 1000 ≥ 1000, `scale<N>` otherwise. Stable scheme for cross-repo Kusto dashboards. - `_scale_machine_individually`: `ThreadPoolExecutor` dispatch with `max(1, machine_workers)` clamp. - `_create_single_machine`: single Machine PUT, returns `True` on any 2xx, never raises. Per-machine ARM polling intentionally omitted (AKS RP reports `Succeeded` essentially as soon as the agentpool admits the request). **Stubbed in this PR (lands in a follow-up PR):** - `scale_machine` batch branch (`use_batch_api=True`) - `_scale_machine_batch`, `_create_batch_machines`, `_make_batch_request`, `_array_split`, and the `_BATCH_429_*` retry constants ## Other notes - `scale_machine` return type changed from `-> bool` to `-> None` for consistency with `create_machine_agentpool` (only exceptions signal failure; the return value was always dead). - Adds `_MACHINE_API_VERSION = '2025-06-02-preview'` constant. - Preserves every fix from #1179: `self._session` for connection pooling, `_PER_REQUEST_TIMEOUT_CAP=60` timeout split, immediate 4xx raise in `_wait_for_agentpool_provisioning`, immediate-Failed on missing `provisioningState`, `tempfile.TemporaryDirectory()` in test setUp/tearDown. ## Tests - 12 unit tests pass (up from 5 on main). - New tests cover: - Non-batch scale success (metadata enrichment, all required keys present) - Partial-landing raises `RuntimeError` - `baseline_count` snapshot propagation to `_wait_for_machine_node_readiness` - `_get_machine_name_prefix` value tests for small / multiple-of-1000 / non-multiple-of-1000 counts - `_create_single_machine` accepts 200/201/202 and returns `False` (no raise) on 500 - Regression test asserting the batch path still raises `NotImplementedError` - `pylint` rating: 10.00/10 on both files. ## Follow-up PR A subsequent PR will: - Replace the `NotImplementedError` in the `use_batch_api=True` branch with batch dispatch. - Add `_scale_machine_batch`, `_create_batch_machines`, `_make_batch_request`, `_array_split`, and the `_BATCH_429_*` constants. - Add a batch success-path test.
## Summary
Adds StatefulSet workload support to the CRUD benchmarking framework —
the second of three planned workload methods (`deployment`,
`statefulset`, `jobs`). Measures K8s StatefulSet create/verify latency
on AKS node pools.
**Branch cleanup note:** Rebased and squashed for reviewability. All
commits are logically grouped. Also includes a fix for `gpu_profile`
driver setting that caused node pool creation failure on non-GPU pools.
## Changes
`modules/python/crud/workload_templates/statefulset.yml`
- New K8s manifest template with headless Service (`clusterIP: None`)
for stable pod DNS
- Uses `STATEFULSET_REPLICAS` placeholder, configurable node affinity
via label_selector
`modules/python/crud/azure/node_pool_crud.py`
- Add `create_statefulset()` — same loop pattern as `create_deployment`
- Uses `ready` condition (not `available`) since StatefulSets don't
support the `available` condition type
`modules/python/crud/main.py`
- Add `statefulset` subparser with `--node-pool-name`,
`--number-of-statefulsets`, `--replicas`, `--manifest-dir`
- Add `elif command == "statefulset"` routing in
`handle_workload_operations`
`modules/python/clients/kubernetes_client.py`
- Add `_is_statefulset_ready` and `_check_statefulset_condition` for
readiness polling
- Extend `wait_for_condition` to support StatefulSet resource type
`steps/engine/crud/k8s/execute.yml`
- Add `statefulset` script block calling `python3 main.py statefulset`
- StatefulSets use the shared count parameter (same as deployments) to
specify number of StatefulSet instances.
`steps/topology/k8s-crud-gpu/execute-crud.yml`
- Wire `number_of_statefulsets` through to engine step
`modules/python/clients/aks_client.py`
- Fix: set `gpu_profile` driver to `"None"` for non-GPU node pools (was
incorrectly set to `"Install"`, causing creation failures)
### Changes since initial PR
`modules/python/crud/azure/node_pool_crud.py`
- Add `WORKLOAD_CONFIG` dict — centralizes workload-specific settings
(template path, resource kind, replicas placeholder, condition type) for
deployment/statefulset
- Extract `_apply_workload()` unified helper — single implementation
handles both deployment and statefulset (template loading, placeholder
replacement, apply, verify)
- Extract `_create_workloads()` unified helper — orchestration loop with
error handling, calls `_apply_workload()` for each instance
- `create_deployment()` and `create_statefulset()` are now thin wrappers
that call `_create_workloads()` with the workload type
- Reorder methods: public API first (`create_deployment`,
`create_statefulset`), then helpers (`_create_workloads`,
`_apply_workload`)
- Pod labels now include workload type to prevent selector collision:
`nginx-container-deployment-1` vs `nginx-container-statefulset-1`
`modules/python/crud/main.py`
- Docstring updated to list supported workload types: `(deployment,
statefulset)`
- Uses `workload_common_parser` for shared args (`--count`,
`--replicas`, `--manifest-dir`, `--label-selector`)
`steps/engine/crud/k8s/execute.yml`
- Workload steps gated to Azure-only: `${{ if eq(parameters.cloud,
'azure') }}:` — AWS `node_pool_crud.py` doesn't have workload methods
yet
- `--manifest-dir` passed conditionally: `${MANIFEST_DIR:+--manifest-dir
"$MANIFEST_DIR"}` — avoids passing empty string when not set
## Tests
**`test_azure_node_pool_crud.py`:**
- `test_create_statefulset_success` — happy path
- `test_create_statefulset_failure` — all fail to become ready
- `test_create_statefulset_no_client` — returns early when k8s client
unavailable
- `test_create_statefulset_partial_success` — continues on failures,
returns False
**`test_kubernetes_client.py`:**
- `test_wait_for_condition_statefulset_success`
- `test_wait_for_condition_statefulset_timeout`
- `test_wait_for_condition_statefulset_not_found`
---------
Co-authored-by: Diamond Powell <32712461+engineeredcurlz@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Praneel Seth <t-pseth@microsoft.com>
**Add cluster-autoscaler benchmark for `VirtualMachines` agent pool** Adds a VMs-agent-pool variant of the existing AKS cluster-autoscaler benchmark, mirroring the VMSS variant's small/medium/large scale stages. **Verification** Validated end-to-end on the New Pipeline Test in ADO — all three stages (small, medium, large) provision the cluster, scale up to target, and scale down successfully. https://dev.azure.com/akstelescope/telescope/_build/results?buildId=68723&view=results --------- Co-authored-by: reneeli@microsoft.com <reneeli>
## Summary - Add `@LeonardCareer` (Leonard Du) to `CODEOWNERS` — global `*` line and the four perf-eval path entries — mirroring existing team members' coverage. Leonard recently joined the team. - Add `LeonardCareer` to `.github/dependabot.yml` reviewers so Terraform dependency PRs are routed alongside the rest of the team. ## Test plan - [ ] CODEOWNERS syntax validated by GitHub (no parse warnings on PR page)
Add Job workload support to CRUD benchmarking framework ## Summary Adds Job workload support to the CRUD benchmarking framework - the third and final of three planned workload methods (`deployment`, `statefulset`, `job`). Measures K8s Job create/verify latency on AKS node pools. ## Changes `modules/python/crud/workload_templates/job.yml` - New K8s Job manifest template using nginx config validation for quick completion - Uses `JOB_COMPLETIONS` placeholder, configurable node affinity via label_selector `modules/python/crud/azure/node_pool_crud.py` - Add `job` entry to `WORKLOAD_CONFIG` with template path, completions key, `complete` condition, and `verify_pods_ready=False` (jobs terminate after completion) - Add `create_job()` using unified `_create_workloads` helper `modules/python/crud/main.py` - Add `job` subparser with `--completions` argument (distinct from `--replicas`) - Add `elif command == "job"` routing in `handle_workload_operations` `modules/python/clients/kubernetes_client.py` - Add `job`/`jobs` to `valid_conditions` with `['complete', 'failed']` - Add `_check_job_condition` and `_is_job_condition_met` for job status polling - Extend `wait_for_condition` to support Job resource type `steps/engine/crud/k8s/execute.yml` - Add `job` script block (Azure-only, consistent with deployment/statefulset) `pipelines/perf-eval/GPU Benchmark/k8s-gpu-cluster-crud.yml` - Add `completions: 1` to matrix entries ## Tests `test_azure_node_pool_crud.py`: - `test_create_job_success` - happy path - `test_create_job_failure` - all fail to complete - `test_create_job_no_client` - returns early when k8s client unavailable - `test_create_job_partial_success` - continues on failures, returns False `test_kubernetes_client.py`: - `test_wait_for_condition_job_complete_success` - `test_wait_for_condition_job_failed` - `test_wait_for_condition_job_timeout` - `test_wait_for_condition_job_not_found`
## Summary
- The `scale` command was missing `--enable-managed-gpu`, causing
`args.enable_managed_gpu` to default to `False` during scale-up
- This made the guard in `main.py` install the NVIDIA GPU device plugin
even for fully managed GPU node pools (where AKS manages it via systemd)
- Added `${ENABLE_MANAGED_GPU:+--enable-managed-gpu}` to the scale-up
invocation only (scale-down doesn't add nodes so the install there is
harmless/pointless regardless)
## Test plan
- [ ] Run fully managed GPU node pool CRUD scenario and confirm device
plugin install is skipped during scale-up
- [ ] Run non-managed GPU node pool CRUD scenario and confirm device
plugin install still runs
🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary - add ListMachines polling during machine readiness waits to fail early when expected machines terminally fail - keep BatchPutMachine request sizing within the 50-machine service limit - record successful_machines as a count instead of uploading machine names - add focused unit coverage for timeout propagation, ListMachines pagination, terminal failure detection, and batch sizing ## Validation - python3 -m pytest modules/python/tests/test_aks_machine_client.py modules/python/tests/test_machine_crud.py modules/python/tests/test_crud_main.py -q - python3 -m py_compile modules/python/clients/aks_machine_client.py modules/python/tests/test_aks_machine_client.py - PYTHONPATH=modules/python python3 -m pylint modules/python/tests/test_aks_machine_client.py modules/python/clients/aks_machine_client.py modules/python/crud/azure/machine_crud.py modules/python/crud/main.py - git diff --check
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds header and new daily scheduled runs to use a header that controls the provisioning mode of NAP. This is useful for A/B testing.