add header to NAP runs to change provisioning mode by PabloTriv · Pull Request #1215 · Azure/telescope

PabloTriv · 2026-06-08T21:29:14Z

Adds header and new daily scheduled runs to use a header that controls the provisioning mode of NAP. This is useful for A/B testing.

…liaeast

## Summary Adds the non-batch scale path to `AKSMachineClient`, building on the scaffolding from #1179. The batch path (`use_batch_api=True`) is intentionally still stubbed with `NotImplementedError` and lands in a follow-up PR so this diff stays focused and reviewable. ## What's in this PR **Fully implemented in this PR:** - `scale_machine` (non-batch branch): snapshots baseline ready node count, submits per-machine PUTs via `_scale_machine_individually`, waits once on the agentpool-level `provisioningState`, then waits for P50/P70/P90/P99/P100 node readiness. Raises `RuntimeError` on partial landing, agentpool timeout, or P100 readiness failure. - `_wait_for_machine_node_readiness`: polls nodes labeled `agentpool=<pool>` and computes percentile envelopes against `baseline_count + expected_count` so only NEW nodes count toward thresholds. Never raises; returns an empty-state envelope on exhaustion. - `_get_machine_name_prefix`: `scale<N>k` for multiples of 1000 ≥ 1000, `scale<N>` otherwise. Stable scheme for cross-repo Kusto dashboards. - `_scale_machine_individually`: `ThreadPoolExecutor` dispatch with `max(1, machine_workers)` clamp. - `_create_single_machine`: single Machine PUT, returns `True` on any 2xx, never raises. Per-machine ARM polling intentionally omitted (AKS RP reports `Succeeded` essentially as soon as the agentpool admits the request). **Stubbed in this PR (lands in a follow-up PR):** - `scale_machine` batch branch (`use_batch_api=True`) - `_scale_machine_batch`, `_create_batch_machines`, `_make_batch_request`, `_array_split`, and the `_BATCH_429_*` retry constants ## Other notes - `scale_machine` return type changed from `-> bool` to `-> None` for consistency with `create_machine_agentpool` (only exceptions signal failure; the return value was always dead). - Adds `_MACHINE_API_VERSION = '2025-06-02-preview'` constant. - Preserves every fix from #1179: `self._session` for connection pooling, `_PER_REQUEST_TIMEOUT_CAP=60` timeout split, immediate 4xx raise in `_wait_for_agentpool_provisioning`, immediate-Failed on missing `provisioningState`, `tempfile.TemporaryDirectory()` in test setUp/tearDown. ## Tests - 12 unit tests pass (up from 5 on main). - New tests cover: - Non-batch scale success (metadata enrichment, all required keys present) - Partial-landing raises `RuntimeError` - `baseline_count` snapshot propagation to `_wait_for_machine_node_readiness` - `_get_machine_name_prefix` value tests for small / multiple-of-1000 / non-multiple-of-1000 counts - `_create_single_machine` accepts 200/201/202 and returns `False` (no raise) on 500 - Regression test asserting the batch path still raises `NotImplementedError` - `pylint` rating: 10.00/10 on both files. ## Follow-up PR A subsequent PR will: - Replace the `NotImplementedError` in the `use_batch_api=True` branch with batch dispatch. - Add `_scale_machine_batch`, `_create_batch_machines`, `_make_batch_request`, `_array_split`, and the `_BATCH_429_*` constants. - Add a batch success-path test.

…1185)

## Summary Adds StatefulSet workload support to the CRUD benchmarking framework — the second of three planned workload methods (`deployment`, `statefulset`, `jobs`). Measures K8s StatefulSet create/verify latency on AKS node pools. **Branch cleanup note:** Rebased and squashed for reviewability. All commits are logically grouped. Also includes a fix for `gpu_profile` driver setting that caused node pool creation failure on non-GPU pools. ## Changes `modules/python/crud/workload_templates/statefulset.yml` - New K8s manifest template with headless Service (`clusterIP: None`) for stable pod DNS - Uses `STATEFULSET_REPLICAS` placeholder, configurable node affinity via label_selector `modules/python/crud/azure/node_pool_crud.py` - Add `create_statefulset()` — same loop pattern as `create_deployment` - Uses `ready` condition (not `available`) since StatefulSets don't support the `available` condition type `modules/python/crud/main.py` - Add `statefulset` subparser with `--node-pool-name`, `--number-of-statefulsets`, `--replicas`, `--manifest-dir` - Add `elif command == "statefulset"` routing in `handle_workload_operations` `modules/python/clients/kubernetes_client.py` - Add `_is_statefulset_ready` and `_check_statefulset_condition` for readiness polling - Extend `wait_for_condition` to support StatefulSet resource type `steps/engine/crud/k8s/execute.yml` - Add `statefulset` script block calling `python3 main.py statefulset` - StatefulSets use the shared count parameter (same as deployments) to specify number of StatefulSet instances. `steps/topology/k8s-crud-gpu/execute-crud.yml` - Wire `number_of_statefulsets` through to engine step `modules/python/clients/aks_client.py` - Fix: set `gpu_profile` driver to `"None"` for non-GPU node pools (was incorrectly set to `"Install"`, causing creation failures) ### Changes since initial PR `modules/python/crud/azure/node_pool_crud.py` - Add `WORKLOAD_CONFIG` dict — centralizes workload-specific settings (template path, resource kind, replicas placeholder, condition type) for deployment/statefulset - Extract `_apply_workload()` unified helper — single implementation handles both deployment and statefulset (template loading, placeholder replacement, apply, verify) - Extract `_create_workloads()` unified helper — orchestration loop with error handling, calls `_apply_workload()` for each instance - `create_deployment()` and `create_statefulset()` are now thin wrappers that call `_create_workloads()` with the workload type - Reorder methods: public API first (`create_deployment`, `create_statefulset`), then helpers (`_create_workloads`, `_apply_workload`) - Pod labels now include workload type to prevent selector collision: `nginx-container-deployment-1` vs `nginx-container-statefulset-1` `modules/python/crud/main.py` - Docstring updated to list supported workload types: `(deployment, statefulset)` - Uses `workload_common_parser` for shared args (`--count`, `--replicas`, `--manifest-dir`, `--label-selector`) `steps/engine/crud/k8s/execute.yml` - Workload steps gated to Azure-only: `${{ if eq(parameters.cloud, 'azure') }}:` — AWS `node_pool_crud.py` doesn't have workload methods yet - `--manifest-dir` passed conditionally: `${MANIFEST_DIR:+--manifest-dir "$MANIFEST_DIR"}` — avoids passing empty string when not set ## Tests **`test_azure_node_pool_crud.py`:** - `test_create_statefulset_success` — happy path - `test_create_statefulset_failure` — all fail to become ready - `test_create_statefulset_no_client` — returns early when k8s client unavailable - `test_create_statefulset_partial_success` — continues on failures, returns False **`test_kubernetes_client.py`:** - `test_wait_for_condition_statefulset_success` - `test_wait_for_condition_statefulset_timeout` - `test_wait_for_condition_statefulset_not_found` --------- Co-authored-by: Diamond Powell <32712461+engineeredcurlz@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com> Co-authored-by: Praneel Seth <t-pseth@microsoft.com>

**Add cluster-autoscaler benchmark for `VirtualMachines` agent pool** Adds a VMs-agent-pool variant of the existing AKS cluster-autoscaler benchmark, mirroring the VMSS variant's small/medium/large scale stages. **Verification** Validated end-to-end on the New Pipeline Test in ADO — all three stages (small, medium, large) provision the cluster, scale up to target, and scale down successfully. https://dev.azure.com/akstelescope/telescope/_build/results?buildId=68723&view=results --------- Co-authored-by: reneeli@microsoft.com <reneeli>

## Summary - Add `@LeonardCareer` (Leonard Du) to `CODEOWNERS` — global `*` line and the four perf-eval path entries — mirroring existing team members' coverage. Leonard recently joined the team. - Add `LeonardCareer` to `.github/dependabot.yml` reviewers so Terraform dependency PRs are routed alongside the rest of the team. ## Test plan - [ ] CODEOWNERS syntax validated by GitHub (no parse warnings on PR page)

Add Job workload support to CRUD benchmarking framework ## Summary Adds Job workload support to the CRUD benchmarking framework - the third and final of three planned workload methods (`deployment`, `statefulset`, `job`). Measures K8s Job create/verify latency on AKS node pools. ## Changes `modules/python/crud/workload_templates/job.yml` - New K8s Job manifest template using nginx config validation for quick completion - Uses `JOB_COMPLETIONS` placeholder, configurable node affinity via label_selector `modules/python/crud/azure/node_pool_crud.py` - Add `job` entry to `WORKLOAD_CONFIG` with template path, completions key, `complete` condition, and `verify_pods_ready=False` (jobs terminate after completion) - Add `create_job()` using unified `_create_workloads` helper `modules/python/crud/main.py` - Add `job` subparser with `--completions` argument (distinct from `--replicas`) - Add `elif command == "job"` routing in `handle_workload_operations` `modules/python/clients/kubernetes_client.py` - Add `job`/`jobs` to `valid_conditions` with `['complete', 'failed']` - Add `_check_job_condition` and `_is_job_condition_met` for job status polling - Extend `wait_for_condition` to support Job resource type `steps/engine/crud/k8s/execute.yml` - Add `job` script block (Azure-only, consistent with deployment/statefulset) `pipelines/perf-eval/GPU Benchmark/k8s-gpu-cluster-crud.yml` - Add `completions: 1` to matrix entries ## Tests `test_azure_node_pool_crud.py`: - `test_create_job_success` - happy path - `test_create_job_failure` - all fail to complete - `test_create_job_no_client` - returns early when k8s client unavailable - `test_create_job_partial_success` - continues on failures, returns False `test_kubernetes_client.py`: - `test_wait_for_condition_job_complete_success` - `test_wait_for_condition_job_failed` - `test_wait_for_condition_job_timeout` - `test_wait_for_condition_job_not_found`

## Summary - The `scale` command was missing `--enable-managed-gpu`, causing `args.enable_managed_gpu` to default to `False` during scale-up - This made the guard in `main.py` install the NVIDIA GPU device plugin even for fully managed GPU node pools (where AKS manages it via systemd) - Added `${ENABLE_MANAGED_GPU:+--enable-managed-gpu}` to the scale-up invocation only (scale-down doesn't add nodes so the install there is harmless/pointless regardless) ## Test plan - [ ] Run fully managed GPU node pool CRUD scenario and confirm device plugin install is skipped during scale-up - [ ] Run non-managed GPU node pool CRUD scenario and confirm device plugin install still runs 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

## Summary - add ListMachines polling during machine readiness waits to fail early when expected machines terminally fail - keep BatchPutMachine request sizing within the 50-machine service limit - record successful_machines as a count instead of uploading machine names - add focused unit coverage for timeout propagation, ListMachines pagination, terminal failure detection, and batch sizing ## Validation - python3 -m pytest modules/python/tests/test_aks_machine_client.py modules/python/tests/test_machine_crud.py modules/python/tests/test_crud_main.py -q - python3 -m py_compile modules/python/clients/aks_machine_client.py modules/python/tests/test_aks_machine_client.py - PYTHONPATH=modules/python python3 -m pylint modules/python/tests/test_aks_machine_client.py modules/python/clients/aks_machine_client.py modules/python/crud/azure/machine_crud.py modules/python/crud/main.py - git diff --check

Pablo Trivino and others added 18 commits May 18, 2026 16:33

add args for bootstrapping method

88853a0

add test in staging with reference to new values

a16f775

explicit mappings

7fe8ed7

8 am time

fb955c1

scheduled runs with bootstrappignclient and machineapi in only austra…

81336c7

…liaeast

switch header to correct files

2070166

add 10 node run for testing

70afbfe

feat(crud): add AKSMachineClient batch scale path (BatchPutMachine) (#…

23ba6df

…1185)

feat(crud-machine): AKS Machine API client + CRUD layer + CLI (#1187)

5fe3bc9

Merge branch 'main' into pablotrivino/addKarpenterMachineAPIConfig

5739557

update sku

a33f572

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add header to NAP runs to change provisioning mode#1215

add header to NAP runs to change provisioning mode#1215
PabloTriv wants to merge 18 commits into
mainfrom
pablotrivino/addKarpenterMachineAPIConfig

PabloTriv commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

PabloTriv commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants