ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios#1168
Draft
skosuri1 wants to merge 166 commits into
Draft
ClusterMesh scale: Phase 3 +4 — scale tiers + parallel CL2 fan-out + add all scenarios#1168skosuri1 wants to merge 166 commits into
skosuri1 wants to merge 166 commits into
Conversation
…tep, restore Endpoints (ip/v1)
…llback metric per scope
…oncurrent creates)
…idn't fix root cause); fix n5 condition syntax
…n10 in dev for n20 iteration
…N=20 mesh convergence
…s_v3 (DSv3 quota fits 1600 vCPU at n20)
… referenced it but variables.tf didn't declare)
…me _FakePopen attrs)
…flip prod skip_publish to false
…cross all clusters
…ments, harness knobs
…0 for smoke first
…y with ingress:[{}])
… enable it (per upstream source: opt-in via --metrics flag)
…rifying it is the AKS-managed Cilium proxy for policy regen duration
…age + matching test-inputs json)
…ion baseline + headline) + matching test-inputs jsons + fix pre-existing azure-{20,50}-shared.json gap
…to cc) + metrics Phase 2 ProposalsCommitted + DropByReason
…oke (azure-3-shared.tfvars + execute.yml launcher + scale.py collector + pipeline stage)
…build 69300 evidence: n=3 first-convergence > 180s due to LB+CMP reconcile lag)
…m exec + deterministic leader=min-role (build 69318 evidence: leader=victim=mesh-3 collision + wrong cilium subcommand)
…3 stages + finish observer_kc rename in probe.sh hold/post phases
…ter cap + 40K Dv3 free; cc hit 99-cluster cap at build 69317)
…24 DSv4 free; max N=154 ceiling for future N=150)
…o real headroom for cluster recreate or future scale; eastus2 is the right pick with 40K Dv3 free)
…ster cap free + 62K DSv4 vCPU = 7% util massive buffer)
…(build 67579) for headline; remove cc N=92 stage + azure-92-shared-cc tfvars/json
…uild 69332 evidence: SIGTERM at ~6h50m actual budget needed for 6h churn plus setup plus 10min terminate phase)
…throttle/OOM/endpoint-state metric IDs in cilium.yaml + new policy-scale.yaml scenario creates N CNPs per ns + scale.py CLI knobs + execute.yml env vars + n=2 pipeline stage)
…chestrator + execute.yml launcher/wait + scale.py collector + n=3 pipeline stage (complements per-cluster policy-scale by measuring fleet-wide rollout latency)
…risk + user-perceived service-works latency); opt-in via env, default off; enabled on euap n=2 + cc n=2 smoke stages
…extension enabled, curl pod IP directly instead of global Service DNS (build 69395 evidence: all 10 samples emitted nulls because pause pod cannot respond to curl)
…uild 69392 evidence: SIGTERM at 7h1min CL2 wall instead of expected 6h50min; +1h in-CL2 overhead from inter-phase measurement gather)
…trics (build 69395 evidence: Hubble queries emit "No data items" because CL2 prometheus does not scrape Hubble metrics port 9965; ACNS exposes them but our scrape config covers only standard cilium-agent port 9962)
…pdate test_configure_command_parsing kwargs to match scale.py CLI
…py + tests + hoist subprocess import + suppress not-callable false positive on tuple-unpacked transform); pylint now 10/10 exit 0
…d): port-forward + curl + tar to capture in-cluster prometheus state for offline PromQL; adds PodMonitors for hubble:9965 + coredns:9153 + kvstoremesh-standalone:9964 (cilium-agent + cilium-operator already scraped by CL2 built-in flags); enabled on n=3 failover + n=3 restart-survival smoke stages
…tifact (artifact owned by our pipeline run, downloadable from Build page; eliminates Telescope-team storage dependency that defeated the purpose of having an independent backup)
… own storage account, OAuth via SP, satisfies sub no-shared-key policy); knobs cl2_prom_snapshot_target=artifact|blob + storage_account + container; scales to N=100; n3 smoke stages use blob to validate end-to-end
…o with single AzureCLI@2 task (AzDO does not allow runtime condition on step-template references; AzureCLI@2 supports condition directly and handles SP auth via azureSubscription input)
…ows, CoreDNS latency+cache, kvstoremesh sync duration+readiness, operator identity GC+IPAM) + gap #3 service-backend membership probe (transient global Service per probe pod with propagation-probe-id selector, wait_peer_service_backend polls BPF lb map on peers, creates+deletes Service per iteration)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on top of #1157 (
skosuri/clustermesh-scale). Do not merge until #1157 is merged; review/merge order matters.This PR continues the ClusterMesh scale-test scenario with Phase 3 work — moving from harness validation (2 small clusters) to real scale measurement across cluster-count tiers.
Phase 3 Deliverables
azure-5.tfvars,azure-10.tfvars,azure-20.tfvarsand corresponding pipeline matrix entries. Each tier: validate quota, validate peering count (N·(N-1) at separate-VNet mode — 380 at N=20), tune CL2 timeouts, document breaking points.utils.run_cl2_command(currently synchronous,modules/python/clusterloader2/utils.py:66-72) and confirming the AzDO agent has CPU/RAM headroom for N concurrent CL2 + Prometheus.Out of Scope (deferred to later phases / pre-merge of #1157)