Skip to content

ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios#1168

Draft
skosuri1 wants to merge 166 commits into
skosuri/clustermesh-scalefrom
skosuri/clustermesh-scale-2
Draft

ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios#1168
skosuri1 wants to merge 166 commits into
skosuri/clustermesh-scalefrom
skosuri/clustermesh-scale-2

Conversation

@skosuri1

@skosuri1 skosuri1 commented May 6, 2026

Copy link
Copy Markdown

Stacked on top of #1157 (skosuri/clustermesh-scale). Do not merge until #1157 is merged; review/merge order matters.

This PR continues the ClusterMesh scale-test scenario with Phase 3 work — moving from harness validation (2 small clusters) to real scale measurement across cluster-count tiers.

Phase 3 Deliverables

  • 20-node baseline cluster size (spec line 24). Current clusters are 3 nodes (2 default + 1 prompool) — sized for harness validation, not real scale measurement.
  • Cluster-count tiers: add azure-5.tfvars, azure-10.tfvars, azure-20.tfvars and corresponding pipeline matrix entries. Each tier: validate quota, validate peering count (N·(N-1) at separate-VNet mode — 380 at N=20), tune CL2 timeouts, document breaking points.
  • Parallel CL2 fan-out: replace sequential per-cluster CL2 with bounded concurrency (default 4). Requires async wrapping of utils.run_cl2_command (currently synchronous, modules/python/clusterloader2/utils.py:66-72) and confirming the AzDO agent has CPU/RAM headroom for N concurrent CL2 + Prometheus.
  • etcd PodMonitor capacity check at 20 clusters: 28 watchers per cluster × 20 = 560 watchers; verify Prom scrape budget holds.
  • Scaling-curve dashboards from cluster-attributed results (Kusto).

Out of Scope (deferred to later phases / pre-merge of #1157)

skosuri and others added 30 commits May 6, 2026 13:59
…idn't fix root cause); fix n5 condition syntax
… referenced it but variables.tf didn't declare)
skosuri1 and others added 30 commits June 3, 2026 05:59
… enable it (per upstream source: opt-in via --metrics flag)
…rifying it is the AKS-managed Cilium proxy for policy regen duration
…ion baseline + headline) + matching test-inputs jsons + fix pre-existing azure-{20,50}-shared.json gap
…to cc) + metrics Phase 2 ProposalsCommitted + DropByReason
…oke (azure-3-shared.tfvars + execute.yml launcher + scale.py collector + pipeline stage)
…build 69300 evidence: n=3 first-convergence > 180s due to LB+CMP reconcile lag)
…m exec + deterministic leader=min-role (build 69318 evidence: leader=victim=mesh-3 collision + wrong cilium subcommand)
…3 stages + finish observer_kc rename in probe.sh hold/post phases
…ter cap + 40K Dv3 free; cc hit 99-cluster cap at build 69317)
…24 DSv4 free; max N=154 ceiling for future N=150)
…o real headroom for cluster recreate or future scale; eastus2 is the right pick with 40K Dv3 free)
…ster cap free + 62K DSv4 vCPU = 7% util massive buffer)
…(build 67579) for headline; remove cc N=92 stage + azure-92-shared-cc tfvars/json
…uild 69332 evidence: SIGTERM at ~6h50m actual budget needed for 6h churn plus setup plus 10min terminate phase)
…throttle/OOM/endpoint-state metric IDs in cilium.yaml + new policy-scale.yaml scenario creates N CNPs per ns + scale.py CLI knobs + execute.yml env vars + n=2 pipeline stage)
…chestrator + execute.yml launcher/wait + scale.py collector + n=3 pipeline stage (complements per-cluster policy-scale by measuring fleet-wide rollout latency)
…risk + user-perceived service-works latency); opt-in via env, default off; enabled on euap n=2 + cc n=2 smoke stages
…extension enabled, curl pod IP directly instead of global Service DNS (build 69395 evidence: all 10 samples emitted nulls because pause pod cannot respond to curl)
…uild 69392 evidence: SIGTERM at 7h1min CL2 wall instead of expected 6h50min; +1h in-CL2 overhead from inter-phase measurement gather)
…trics (build 69395 evidence: Hubble queries emit "No data items" because CL2 prometheus does not scrape Hubble metrics port 9965; ACNS exposes them but our scrape config covers only standard cilium-agent port 9962)
…over (scale-down/up; gap #4) + clustermesh-apiserver restart survival (in-pod curl loop; gap #8) + n=3 smoke stages for both
…pdate test_configure_command_parsing kwargs to match scale.py CLI
…py + tests + hoist subprocess import + suppress not-callable false positive on tuple-unpacked transform); pylint now 10/10 exit 0
…d): port-forward + curl + tar to capture in-cluster prometheus state for offline PromQL; adds PodMonitors for hubble:9965 + coredns:9153 + kvstoremesh-standalone:9964 (cilium-agent + cilium-operator already scraped by CL2 built-in flags); enabled on n=3 failover + n=3 restart-survival smoke stages
…tifact (artifact owned by our pipeline run, downloadable from Build page; eliminates Telescope-team storage dependency that defeated the purpose of having an independent backup)
… own storage account, OAuth via SP, satisfies sub no-shared-key policy); knobs cl2_prom_snapshot_target=artifact|blob + storage_account + container; scales to N=100; n3 smoke stages use blob to validate end-to-end
…o with single AzureCLI@2 task (AzDO does not allow runtime condition on step-template references; AzureCLI@2 supports condition directly and handles SP auth via azureSubscription input)
…ows, CoreDNS latency+cache, kvstoremesh sync duration+readiness, operator identity GC+IPAM) + gap #3 service-backend membership probe (transient global Service per probe pod with propagation-probe-id selector, wait_peer_service_backend polls BPF lb map on peers, creates+deletes Service per iteration)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant