Skip to content

add header to NAP runs to change provisioning mode#1215

Draft
PabloTriv wants to merge 18 commits into
mainfrom
pablotrivino/addKarpenterMachineAPIConfig
Draft

add header to NAP runs to change provisioning mode#1215
PabloTriv wants to merge 18 commits into
mainfrom
pablotrivino/addKarpenterMachineAPIConfig

Conversation

@PabloTriv

Copy link
Copy Markdown
Contributor

Adds header and new daily scheduled runs to use a header that controls the provisioning mode of NAP. This is useful for A/B testing.

Pablo Trivino and others added 18 commits May 18, 2026 16:33
## Summary

Adds the non-batch scale path to `AKSMachineClient`, building on the
scaffolding from #1179. The batch path (`use_batch_api=True`) is
intentionally still stubbed with `NotImplementedError` and lands in a
follow-up PR so this diff stays focused and reviewable.

## What's in this PR

**Fully implemented in this PR:**
- `scale_machine` (non-batch branch): snapshots baseline ready node
count, submits per-machine PUTs via `_scale_machine_individually`, waits
once on the agentpool-level `provisioningState`, then waits for
P50/P70/P90/P99/P100 node readiness. Raises `RuntimeError` on partial
landing, agentpool timeout, or P100 readiness failure.
- `_wait_for_machine_node_readiness`: polls nodes labeled
`agentpool=<pool>` and computes percentile envelopes against
`baseline_count + expected_count` so only NEW nodes count toward
thresholds. Never raises; returns an empty-state envelope on exhaustion.
- `_get_machine_name_prefix`: `scale<N>k` for multiples of 1000 ≥ 1000,
`scale<N>` otherwise. Stable scheme for cross-repo Kusto dashboards.
- `_scale_machine_individually`: `ThreadPoolExecutor` dispatch with
`max(1, machine_workers)` clamp.
- `_create_single_machine`: single Machine PUT, returns `True` on any
2xx, never raises. Per-machine ARM polling intentionally omitted (AKS RP
reports `Succeeded` essentially as soon as the agentpool admits the
request).

**Stubbed in this PR (lands in a follow-up PR):**
- `scale_machine` batch branch (`use_batch_api=True`)
- `_scale_machine_batch`, `_create_batch_machines`,
`_make_batch_request`, `_array_split`, and the `_BATCH_429_*` retry
constants

## Other notes

- `scale_machine` return type changed from `-> bool` to `-> None` for
consistency with `create_machine_agentpool` (only exceptions signal
failure; the return value was always dead).
- Adds `_MACHINE_API_VERSION = '2025-06-02-preview'` constant.
- Preserves every fix from #1179: `self._session` for connection
pooling, `_PER_REQUEST_TIMEOUT_CAP=60` timeout split, immediate 4xx
raise in `_wait_for_agentpool_provisioning`, immediate-Failed on missing
`provisioningState`, `tempfile.TemporaryDirectory()` in test
setUp/tearDown.

## Tests

- 12 unit tests pass (up from 5 on main).
- New tests cover:
- Non-batch scale success (metadata enrichment, all required keys
present)
  - Partial-landing raises `RuntimeError`
- `baseline_count` snapshot propagation to
`_wait_for_machine_node_readiness`
- `_get_machine_name_prefix` value tests for small / multiple-of-1000 /
non-multiple-of-1000 counts
- `_create_single_machine` accepts 200/201/202 and returns `False` (no
raise) on 500
- Regression test asserting the batch path still raises
`NotImplementedError`
- `pylint` rating: 10.00/10 on both files.

## Follow-up PR

A subsequent PR will:
- Replace the `NotImplementedError` in the `use_batch_api=True` branch
with batch dispatch.
- Add `_scale_machine_batch`, `_create_batch_machines`,
`_make_batch_request`, `_array_split`, and the `_BATCH_429_*` constants.
- Add a batch success-path test.
## Summary

Adds StatefulSet workload support to the CRUD benchmarking framework —
the second of three planned workload methods (`deployment`,
`statefulset`, `jobs`). Measures K8s StatefulSet create/verify latency
on AKS node pools.

**Branch cleanup note:** Rebased and squashed for reviewability. All
commits are logically grouped. Also includes a fix for `gpu_profile`
driver setting that caused node pool creation failure on non-GPU pools.

## Changes

`modules/python/crud/workload_templates/statefulset.yml`
- New K8s manifest template with headless Service (`clusterIP: None`)
for stable pod DNS
- Uses `STATEFULSET_REPLICAS` placeholder, configurable node affinity
via label_selector

`modules/python/crud/azure/node_pool_crud.py`
- Add `create_statefulset()` — same loop pattern as `create_deployment`
- Uses `ready` condition (not `available`) since StatefulSets don't
support the `available` condition type

`modules/python/crud/main.py`
- Add `statefulset` subparser with `--node-pool-name`,
`--number-of-statefulsets`, `--replicas`, `--manifest-dir`
- Add `elif command == "statefulset"` routing in
`handle_workload_operations`

`modules/python/clients/kubernetes_client.py`
- Add `_is_statefulset_ready` and `_check_statefulset_condition` for
readiness polling
- Extend `wait_for_condition` to support StatefulSet resource type

`steps/engine/crud/k8s/execute.yml`
- Add `statefulset` script block calling `python3 main.py statefulset`
- StatefulSets use the shared count parameter (same as deployments) to
specify number of StatefulSet instances.

`steps/topology/k8s-crud-gpu/execute-crud.yml`
- Wire `number_of_statefulsets` through to engine step

`modules/python/clients/aks_client.py`
- Fix: set `gpu_profile` driver to `"None"` for non-GPU node pools (was
incorrectly set to `"Install"`, causing creation failures)

### Changes since initial PR

`modules/python/crud/azure/node_pool_crud.py`
- Add `WORKLOAD_CONFIG` dict — centralizes workload-specific settings
(template path, resource kind, replicas placeholder, condition type) for
deployment/statefulset
- Extract `_apply_workload()` unified helper — single implementation
handles both deployment and statefulset (template loading, placeholder
replacement, apply, verify)
- Extract `_create_workloads()` unified helper — orchestration loop with
error handling, calls `_apply_workload()` for each instance
- `create_deployment()` and `create_statefulset()` are now thin wrappers
that call `_create_workloads()` with the workload type
- Reorder methods: public API first (`create_deployment`,
`create_statefulset`), then helpers (`_create_workloads`,
`_apply_workload`)
- Pod labels now include workload type to prevent selector collision:
`nginx-container-deployment-1` vs `nginx-container-statefulset-1`

`modules/python/crud/main.py`
- Docstring updated to list supported workload types: `(deployment,
statefulset)`
- Uses `workload_common_parser` for shared args (`--count`,
`--replicas`, `--manifest-dir`, `--label-selector`)

`steps/engine/crud/k8s/execute.yml`
- Workload steps gated to Azure-only: `${{ if eq(parameters.cloud,
'azure') }}:` — AWS `node_pool_crud.py` doesn't have workload methods
yet
- `--manifest-dir` passed conditionally: `${MANIFEST_DIR:+--manifest-dir
"$MANIFEST_DIR"}` — avoids passing empty string when not set

## Tests

**`test_azure_node_pool_crud.py`:**
- `test_create_statefulset_success` — happy path
- `test_create_statefulset_failure` — all fail to become ready
- `test_create_statefulset_no_client` — returns early when k8s client
unavailable
- `test_create_statefulset_partial_success` — continues on failures,
returns False

**`test_kubernetes_client.py`:**
- `test_wait_for_condition_statefulset_success`
- `test_wait_for_condition_statefulset_timeout`
- `test_wait_for_condition_statefulset_not_found`

---------

Co-authored-by: Diamond Powell <32712461+engineeredcurlz@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Praneel Seth <t-pseth@microsoft.com>
**Add cluster-autoscaler benchmark for `VirtualMachines` agent pool**
Adds a VMs-agent-pool variant of the existing AKS cluster-autoscaler
benchmark, mirroring the VMSS variant's small/medium/large scale stages.


**Verification**
Validated end-to-end on the New Pipeline Test in ADO — all three stages
(small, medium, large) provision the cluster, scale up to target, and
scale down successfully.
https://dev.azure.com/akstelescope/telescope/_build/results?buildId=68723&view=results

---------

Co-authored-by: reneeli@microsoft.com <reneeli>
## Summary
- Add `@LeonardCareer` (Leonard Du) to `CODEOWNERS` — global `*` line
and the four perf-eval path entries — mirroring existing team members'
coverage. Leonard recently joined the team.
- Add `LeonardCareer` to `.github/dependabot.yml` reviewers so Terraform
dependency PRs are routed alongside the rest of the team.

## Test plan
- [ ] CODEOWNERS syntax validated by GitHub (no parse warnings on PR
page)
Add Job workload support to CRUD benchmarking framework

## Summary

Adds Job workload support to the CRUD benchmarking framework - the third
and final of three planned workload methods (`deployment`,
`statefulset`, `job`). Measures K8s Job create/verify latency on AKS
node pools.

## Changes

`modules/python/crud/workload_templates/job.yml`
- New K8s Job manifest template using nginx config validation for quick
completion
- Uses `JOB_COMPLETIONS` placeholder, configurable node affinity via
label_selector

`modules/python/crud/azure/node_pool_crud.py`
- Add `job` entry to `WORKLOAD_CONFIG` with template path, completions
key, `complete` condition, and `verify_pods_ready=False` (jobs terminate
after completion)
- Add `create_job()` using unified `_create_workloads` helper

`modules/python/crud/main.py`
- Add `job` subparser with `--completions` argument (distinct from
`--replicas`)
- Add `elif command == "job"` routing in `handle_workload_operations`

`modules/python/clients/kubernetes_client.py`
- Add `job`/`jobs` to `valid_conditions` with `['complete', 'failed']`
- Add `_check_job_condition` and `_is_job_condition_met` for job status
polling
- Extend `wait_for_condition` to support Job resource type

`steps/engine/crud/k8s/execute.yml`
- Add `job` script block (Azure-only, consistent with
deployment/statefulset)

`pipelines/perf-eval/GPU Benchmark/k8s-gpu-cluster-crud.yml`
- Add `completions: 1` to matrix entries

## Tests

`test_azure_node_pool_crud.py`:
- `test_create_job_success` - happy path
- `test_create_job_failure` - all fail to complete
- `test_create_job_no_client` - returns early when k8s client
unavailable
- `test_create_job_partial_success` - continues on failures, returns
False

`test_kubernetes_client.py`:
- `test_wait_for_condition_job_complete_success`
- `test_wait_for_condition_job_failed`
- `test_wait_for_condition_job_timeout`
- `test_wait_for_condition_job_not_found`
## Summary
- The `scale` command was missing `--enable-managed-gpu`, causing
`args.enable_managed_gpu` to default to `False` during scale-up
- This made the guard in `main.py` install the NVIDIA GPU device plugin
even for fully managed GPU node pools (where AKS manages it via systemd)
- Added `${ENABLE_MANAGED_GPU:+--enable-managed-gpu}` to the scale-up
invocation only (scale-down doesn't add nodes so the install there is
harmless/pointless regardless)

## Test plan
- [ ] Run fully managed GPU node pool CRUD scenario and confirm device
plugin install is skipped during scale-up
- [ ] Run non-managed GPU node pool CRUD scenario and confirm device
plugin install still runs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
## Summary
- add ListMachines polling during machine readiness waits to fail early
when expected machines terminally fail
- keep BatchPutMachine request sizing within the 50-machine service
limit
- record successful_machines as a count instead of uploading machine
names
- add focused unit coverage for timeout propagation, ListMachines
pagination, terminal failure detection, and batch sizing

## Validation
- python3 -m pytest modules/python/tests/test_aks_machine_client.py
modules/python/tests/test_machine_crud.py
modules/python/tests/test_crud_main.py -q
- python3 -m py_compile modules/python/clients/aks_machine_client.py
modules/python/tests/test_aks_machine_client.py
- PYTHONPATH=modules/python python3 -m pylint
modules/python/tests/test_aks_machine_client.py
modules/python/clients/aks_machine_client.py
modules/python/crud/azure/machine_crud.py modules/python/crud/main.py
- git diff --check
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants