ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios by skosuri1 · Pull Request #1168 · Azure/telescope

skosuri1 · 2026-05-06T21:00:08Z

Stacked on top of #1157 (skosuri/clustermesh-scale). Do not merge until #1157 is merged; review/merge order matters.

This PR continues the ClusterMesh scale-test scenario with Phase 3 work — moving from harness validation (2 small clusters) to real scale measurement across cluster-count tiers.

Phase 3 Deliverables

20-node baseline cluster size (spec line 24). Current clusters are 3 nodes (2 default + 1 prompool) — sized for harness validation, not real scale measurement.
Cluster-count tiers: add azure-5.tfvars, azure-10.tfvars, azure-20.tfvars and corresponding pipeline matrix entries. Each tier: validate quota, validate peering count (N·(N-1) at separate-VNet mode — 380 at N=20), tune CL2 timeouts, document breaking points.
Parallel CL2 fan-out: replace sequential per-cluster CL2 with bounded concurrency (default 4). Requires async wrapping of utils.run_cl2_command (currently synchronous, modules/python/clusterloader2/utils.py:66-72) and confirming the AzDO agent has CPU/RAM headroom for N concurrent CL2 + Prometheus.
etcd PodMonitor capacity check at 20 clusters: 28 watchers per cluster × 20 = 560 watchers; verify Prom scrape budget holds.
Scaling-curve dashboards from cluster-attributed results (Kusto).

Out of Scope (deferred to later phases / pre-merge of #1157)

Pre-merge housekeeping for Add Cilium ClusterMesh scale-test scenario #1157 (DEBUG-DUMP block, dev-pipeline placeholder revert, comment-trim pass) — stays on the base branch.
Remaining scale scenarios (Allowing terraform inputs as JSON format #2 pod churn, add userdata bash scripts for lb eof error repro #3 node churn, Refactoring to add role concept #4 API server failure, Fix aws and azure issue #5 isolation, Change job id to run id #6 upper-bound, Refactor Terraform input variables #7 HA replicas) — Phase 4.

…pelines)

…ipefail)

…dd Total counts

…extension PUT)

…tep, restore Endpoints (ip/v1)

…llback metric per scope

…se/rate queries

… pipelines)

…or prom baseline

…oncurrent creates)

…idn't fix root cause); fix n5 condition syntax

…n10 in dev for n20 iteration

…N=20 mesh convergence

…s_v3 (DSv3 quota fits 1600 vCPU at n20)

… referenced it but variables.tf didn't declare)

…r-none)

…me _FakePopen attrs)

…flip prod skip_publish to false

…rds)

… warn not abort

…cross all clusters

…reakdown

…ments, harness knobs

…collect.yml

…entry

…0 for smoke first

…y with ingress:[{}])

… enable it (per upstream source: opt-in via --metrics flag)

…rifying it is the AKS-managed Cilium proxy for policy regen duration

…age + matching test-inputs json)

…ion baseline + headline) + matching test-inputs jsons + fix pre-existing azure-{20,50}-shared.json gap

…to cc) + metrics Phase 2 ProposalsCommitted + DropByReason

…oke (azure-3-shared.tfvars + execute.yml launcher + scale.py collector + pipeline stage)

…build 69300 evidence: n=3 first-convergence > 180s due to LB+CMP reconcile lag)

…m exec + deterministic leader=min-role (build 69318 evidence: leader=victim=mesh-3 collision + wrong cilium subcommand)

…3 stages + finish observer_kc rename in probe.sh hold/post phases

…ter cap + 40K Dv3 free; cc hit 99-cluster cap at build 69317)

…24 DSv4 free; max N=154 ceiling for future N=150)

…o real headroom for cluster recreate or future scale; eastus2 is the right pick with 40K Dv3 free)

…ster cap free + 62K DSv4 vCPU = 7% util massive buffer)

…(build 67579) for headline; remove cc N=92 stage + azure-92-shared-cc tfvars/json

…uild 69332 evidence: SIGTERM at ~6h50m actual budget needed for 6h churn plus setup plus 10min terminate phase)

…throttle/OOM/endpoint-state metric IDs in cilium.yaml + new policy-scale.yaml scenario creates N CNPs per ns + scale.py CLI knobs + execute.yml env vars + n=2 pipeline stage)

…chestrator + execute.yml launcher/wait + scale.py collector + n=3 pipeline stage (complements per-cluster policy-scale by measuring fleet-wide rollout latency)

…risk + user-perceived service-works latency); opt-in via env, default off; enabled on euap n=2 + cc n=2 smoke stages

…extension enabled, curl pod IP directly instead of global Service DNS (build 69395 evidence: all 10 samples emitted nulls because pause pod cannot respond to curl)

…uild 69392 evidence: SIGTERM at 7h1min CL2 wall instead of expected 6h50min; +1h in-CL2 overhead from inter-phase measurement gather)

…trics (build 69395 evidence: Hubble queries emit "No data items" because CL2 prometheus does not scrape Hubble metrics port 9965; ACNS exposes them but our scrape config covers only standard cilium-agent port 9962)

…over (scale-down/up; gap #4) + clustermesh-apiserver restart survival (in-pod curl loop; gap #8) + n=3 smoke stages for both

…pdate test_configure_command_parsing kwargs to match scale.py CLI

…py + tests + hoist subprocess import + suppress not-callable false positive on tuple-unpacked transform); pylint now 10/10 exit 0

…d): port-forward + curl + tar to capture in-cluster prometheus state for offline PromQL; adds PodMonitors for hubble:9965 + coredns:9153 + kvstoremesh-standalone:9964 (cilium-agent + cilium-operator already scraped by CL2 built-in flags); enabled on n=3 failover + n=3 restart-survival smoke stages

…tifact (artifact owned by our pipeline run, downloadable from Build page; eliminates Telescope-team storage dependency that defeated the purpose of having an independent backup)

… own storage account, OAuth via SP, satisfies sub no-shared-key policy); knobs cl2_prom_snapshot_target=artifact|blob + storage_account + container; scales to N=100; n3 smoke stages use blob to validate end-to-end

…o with single AzureCLI@2 task (AzDO does not allow runtime condition on step-template references; AzureCLI@2 supports condition directly and handles SP auth via azureSubscription input)

…ows, CoreDNS latency+cache, kvstoremesh sync duration+readiness, operator identity GC+IPAM) + gap #3 service-backend membership probe (transient global Service per probe pod with propagation-probe-id selector, wait_peer_service_backend polls BPF lb map on peers, creates+deletes Service per iteration)

skosuri and others added 30 commits May 6, 2026 13:59

phase 3: bounded-parallel CL2 fan-out across clusters

b5fe281

phase 3: add 5-cluster tier (azure-5.tfvars + n5 stage on dev/prod pi…

506d195

…pelines)

aks-cli: wait for stable Succeeded before extra node pool create

56942b1

aks-cli: run wait-for-succeeded with bash interpreter (dash rejects p…

5801228

…ipefail)

fix per-type events rate: scope ip/v1 doesn't exist in kvstoremesh; a…

1b02f57

…dd Total counts

probe: dump actual scope/action labels on kvstoremesh events metric

7ec0c43

aks-cli: retry nodepool add on OperationNotAllowed (race vs lazy AKS …

4714d26

…extension PUT)

fix per-type events rate: range vector for increase, finer subquery s…

dbaf930

…tep, restore Endpoints (ip/v1)

diag: add CurrentValue/SeriesCount per scope; add operations-count fa…

a92b84e

…llback metric per scope

per-scope events: report TotalCount (instant sum), drop broken increa…

81ea7c3

…se/rate queries

phase 3: add 10-cluster tier (azure-10.tfvars + n10 stage on dev/prod…

3a9af93

… pipelines)

per-scope events: restore rate queries; add 90s pre-workload settle f…

380d34c

…or prom baseline

n10: lower terraform apply parallelism to 4 (AKS RP throttles at 10 c…

90ef4e7

…oncurrent creates)

dev pipeline: disable n2 + n5 stages temporarily (RG quota pressure)

cac3392

cleanup phase 3: drop dead per-scope rate queries; drop 90s settle (d…

4ca27f0

…idn't fix root cause); fix n5 condition syntax

phase 3: add 20-cluster tier (final scale-test point); disable n2/n5/…

55c8a40

…n10 in dev for n20 iteration

n20: parallelism=8 + 480min timeout; validate retry budget 30min for …

5714f9c

…N=20 mesh convergence

20-node baseline (spec line 24): default pool 2->20 nodes, D4s_v5->D4…

2d717a7

…s_v3 (DSv3 quota fits 1600 vCPU at n20)

aks-cli: add pod_subnet_name to variable schema (latent bug — main.tf…

e24962f

… referenced it but variables.tf didn't declare)

aks-cli: pass --pod-subnet-id to nodepool add too (AKS requires all-o…

529aa91

…r-none)

pylint: clear R1732 (Popen disable), R1731 (max builtin), W0212 (rena…

fd67123

…me _FakePopen attrs)

pre-merge cleanup: strip DEBUG-DUMP/SMOKE-FAILURE-DEBUG-DUMP blocks; …

1bd56a6

…flip prod skip_publish to false

dev pipeline: flip skip_publish to false (need Kusto data for dashboa…

5c45946

…rds)

collect: stash subdirs around process_cl2_reports; per-cluster errors…

f44129b

… warn not abort

validate: pre-gate on clustermesh-apiserver Deployment+LB readiness a…

ca6895b

…cross all clusters

cl2 measurements: add per-pod apiserver CPU + per-peer mesh failure b…

e961e15

…reakdown

phase 4a: pod-churn-scale + pod-churn-kill CL2 configs, slope measure…

d80105a

…ments, harness knobs

phase 4a: wire pod-churn matrix entries + churn knobs in execute.yml/…

c144982

…collect.yml

phase 4a: pod-churn-combined config + Method:Exec killer; n20 matrix …

a021e02

…entry

phase 4a: enable n=2 stage with pod_churn_combined entry; disable n=2…

8433840

…0 for smoke first

skosuri1 and others added 30 commits June 3, 2026 05:59

policy canary: L4 toPorts rule (force policy regen, was optimized awa…

b67a1b4

…y with ingress:[{}])

policy regen metric: keep query, document AKS-managed Cilium does not…

da7511c

… enable it (per upstream source: opt-in via --metrics flag)

endpoint regen metric: increase(%v) + Mean/TotalSamples + comment cla…

eb2711f

…rifying it is the AKS-managed Cilium proxy for policy regen duration

cc migration: n=2 shared-vnet smoke (DSv4 SKU swap + canadacentral st…

020ffd0

…age + matching test-inputs json)

cc migration N=20+N=100: tfvars DSv4 swap + pipeline cells (cross-reg…

606fdec

…ion baseline + headline) + matching test-inputs jsons + fix pre-existing azure-{20,50}-shared.json gap

parallel next-batch: pod-density 500+800 n2 stages (euap, orthogonal …

6b4c588

…to cc) + metrics Phase 2 ProposalsCommitted + DropByReason

cluster-loss-recovery probe: mesh-detach-rejoin orchestrator + n=3 sm…

aa90d3e

…oke (azure-3-shared.tfvars + execute.yml launcher + scale.py collector + pipeline stage)

detach probe: bump prewait 120->300s + pre-state deadline 60s->300s (…

1838e67

…build 69300 evidence: n=3 first-convergence > 180s due to LB+CMP reconcile lag)

detach probe: cilium-dbg status (not clustermesh status) via ds/ciliu…

de451ac

…m exec + deterministic leader=min-role (build 69318 evidence: leader=victim=mesh-3 collision + wrong cilium subcommand)

final batch: long-soak 6h canary + repeatability-variance N=20 g100 x…

215c00e

…3 stages + finish observer_kc rename in probe.sh hold/post phases

cc N=100 fallback: azure_eastus2_n100_pod_churn (eastus2 has 143 clus…

f1a77fd

…ter cap + 40K Dv3 free; cc hit 99-cluster cap at build 69317)

centraluseuap N=100 stage (highest-capacity region; 187 AKS free + 74…

8c226b2

…24 DSv4 free; max N=154 ceiling for future N=150)

revert centraluseuap N=100 (DSv4 only 7424 free = ~1.5x N=100 need, n…

43cd066

…o real headroom for cluster recreate or future scale; eastus2 is the right pick with 40K Dv3 free)

pivot N=100 -> cc N=92 (eastus2 Dv3 SKU-policy-blocked; cc has 92 clu…

493240e

…ster cap free + 62K DSv4 vCPU = 7% util massive buffer)

drop N=100-in-alternate-region attempts; rely on euap N=100 baseline …

87f0041

…(build 67579) for headline; remove cc N=92 stage + azure-92-shared-cc tfvars/json

soak canary: worker_timeout 7h to 8h plus stage timeout 10h to 11h (b…

351e4f5

…uild 69332 evidence: SIGTERM at ~6h50m actual budget needed for 6h churn plus setup plus 10min terminate phase)

metrics Phase 3 + NetworkPolicy at scale scenario (10 new Hubble/CRI/…

35ced14

…throttle/OOM/endpoint-state metric IDs in cilium.yaml + new policy-scale.yaml scenario creates N CNPs per ns + scale.py CLI knobs + execute.yml env vars + n=2 pipeline stage)

cross-cluster CNP propagation cost probe: host-side parallel-apply or…

32367f8

…chestrator + execute.yml launcher/wait + scale.py collector + n=3 pipeline stage (complements per-cluster policy-scale by measuring fleet-wide rollout latency)

propagation probe: add REMOVE + FIRST_PACKET extensions (stale-state …

dcb5afb

…risk + user-perceived service-works latency); opt-in via env, default off; enabled on euap n=2 + cc n=2 smoke stages

FIRST_PACKET probe fix: switch probe pod to nginx (HTTP server) when …

90124bd

…extension enabled, curl pod IP directly instead of global Service DNS (build 69395 evidence: all 10 samples emitted nulls because pause pod cannot respond to curl)

soak canary: worker_timeout 8h to 9h plus stage timeout 11h to 12h (b…

ff23224

…uild 69392 evidence: SIGTERM at 7h1min CL2 wall instead of expected 6h50min; +1h in-CL2 overhead from inter-phase measurement gather)

mesh-behavior gap probes: identity GC in REMOVE + single-cluster fail…

def3fd8

…over (scale-down/up; gap #4) + clustermesh-apiserver restart survival (in-pod curl loop; gap #8) + n=3 smoke stages for both

validation gate fixes: strip trailing whitespace in pipeline yaml + u…

be8994a

…pdate test_configure_command_parsing kwargs to match scale.py CLI

fix pre-existing pylint regressions (too-many-lines disable on scale.…

05e32b3

…py + tests + hoist subprocess import + suppress not-callable false positive on tuple-unpacked transform); pylint now 10/10 exit 0

switch prom snapshot delivery from Telescope blob to AzDO pipeline ar…

34f41e3

…tifact (artifact owned by our pipeline run, downloadable from Build page; eliminates Telescope-team storage dependency that defeated the purpose of having an independent backup)

fix prom snapshot blob upload: replace login.yml template + bash comb…

20bd804

…o with single AzureCLI@2 task (AzDO does not allow runtime condition on step-template references; AzureCLI@2 supports condition directly and handles SP auth via azureSubscription input)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios#1168

ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios#1168
skosuri1 wants to merge 166 commits into
skosuri/clustermesh-scalefrom
skosuri/clustermesh-scale-2

skosuri1 commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

skosuri1 commented May 6, 2026

Phase 3 Deliverables

Out of Scope (deferred to later phases / pre-merge of #1157)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant