Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
2000574
Add Cilium ClusterMesh scale-test scenario (Phase 1 vertical slice)
Apr 28, 2026
b482bc5
Point new-pipeline-test.yml at clustermesh-scale for dev pipeline runs
Apr 28, 2026
44d106d
Use cilium-dbg status for in-pod check; add cilium-cli for runner-sid…
Apr 28, 2026
54e581b
[debug] dump pods + cilium-cli + fleet member state every 3 retries d…
Apr 28, 2026
21f0835
fix(cssc): use mcr.microsoft.com pause image to satisfy supply chain …
Apr 28, 2026
76c1ae5
debug: surface fleet clustermeshprofile connection state (not just pr…
Apr 28, 2026
7cf9703
fix(ci): drop __init__.py (script, not module) and rename test-input …
Apr 28, 2026
aa43ffb
test(clustermesh-scale): unit tests for scale.py configure/collect wi…
Apr 29, 2026
ea51dea
feat(clustermesh-scale): wire phase 2 measurement modules (cilium, co…
Apr 29, 2026
879a6e9
feat(clustermesh-scale): plumb mesh_size end-to-end + log clustermesh…
Apr 29, 2026
84d98e2
feat(clustermesh-scale): add scale scenario #1 cross-cluster event th…
Apr 29, 2026
562e57c
feat(clustermesh-scale): bump pod subnet to /22 to fit event-throughp…
Apr 29, 2026
3f2664c
fix: grant Network Contributor on VNet to AKS identity
May 1, 2026
d45a5ad
ci: skip results upload while iterating clustermesh-scale
May 1, 2026
c804872
debug(clustermesh-scale): dump cilium svc + pod-IP probe on smoke fai…
May 1, 2026
3601254
validate: bump mesh convergence retries 30->60 (~10 min budget)
May 1, 2026
08c6d98
smoke: annotate cm-smoke namespace with clustermesh.cilium.io/global=…
May 1, 2026
a615507
fix(cl2): export CL2_* from auto-exported matrix env vars ($(name) do…
May 1, 2026
0ea89f5
debug(cl2): dump env + try lowercase matrix-var names + macro test
May 1, 2026
8f3ad72
debug(cl2): dump full env (no grep filter) to find matrix vars
May 1, 2026
5018c5f
fix(cl2): drop ${{ }} from comment — AzDO template parser sees throug…
May 1, 2026
cca8a69
fix: align dev pipeline matrix with production; remove env-dump diag
May 1, 2026
2d0d3dc
fleet: bump retry 30->60
May 1, 2026
8ba31c4
fix(cl2): right-size prometheus stack + detect failure via junit.xml
May 1, 2026
96a7d78
cl2: shrink prometheus to 0.1x defaults + dump pod state on failure
May 1, 2026
41a1a28
cl2: keep prometheus alive on failure + dump operator logs/CR/events
May 2, 2026
4251e55
cl2: explicit prometheus mem request=1Gi/limit=2Gi (drop broken FACTO…
May 2, 2026
ac28c20
cl2: pass --prometheus-memory-request=1Gi (the overrides key isn't ho…
May 2, 2026
6358ac3
scale-test.yaml: remove template refs from comment
May 2, 2026
a2eb5c3
fix: 4 issues from last run
May 4, 2026
ef8759e
fix: scale-test start measurement + retry profile delete
May 4, 2026
ac963f5
fix: cl2 start params + destroy relabel-then-apply
May 4, 2026
50f5e0e
dev pipeline: add n2_event_throughput matrix entry
May 4, 2026
39f22b8
fix: event-throughput start params + relax api SLO violation gate
May 4, 2026
d49f2b0
bump max-pods to 110 + drop n2 from dev pipeline
May 4, 2026
3dd2e92
shrink pause pod limits to 5m/20Mi
May 4, 2026
4225852
add prompool + pin prometheus to it
May 4, 2026
fabbee9
drop FS latency queries + add prom metric-name probe
May 4, 2026
630dd3f
fleet destroy: poll list-members before profile delete
May 5, 2026
be79cce
fix clustermesh prom queries: use kvstoremesh prefix; restore fs latency
May 5, 2026
4c89a19
drop fs write latency; sum() backlog subtraction to fix label mismatch
May 5, 2026
8aadb02
add watch queue depth metric; etcd port discovery probe
May 5, 2026
fd34a5c
wire etcd metrics; drop discovery probe
May 5, 2026
cd5f794
close phase 1/2 spec gaps: pod logs, network usage, per-type event rate
May 5, 2026
76c4665
probe kvstoremesh metric labels for per-type event rate
May 5, 2026
bd8edf1
fix per-type event rate: scope label not prefix
May 5, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 12 additions & 8 deletions jobs/competitive-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,9 @@ parameters:
- name: ssh_key_enabled
type: boolean
default: true
- name: skip_publish
type: boolean
default: false

jobs:
- job: ${{ parameters.cloud }}
Expand Down Expand Up @@ -89,14 +92,15 @@ jobs:
engine: ${{ parameters.engine }}
regions: ${{ parameters.regions }}
engine_input: ${{ parameters.engine_input }}
- template: /steps/publish-results.yml
parameters:
cloud: ${{ parameters.cloud }}
topology: ${{ parameters.topology }}
engine: ${{ parameters.engine }}
regions: ${{ parameters.regions }}
engine_input: ${{ parameters.engine_input }}
credential_type: ${{ parameters.credential_type }}
- ${{ if not(parameters.skip_publish) }}:
- template: /steps/publish-results.yml
parameters:
cloud: ${{ parameters.cloud }}
topology: ${{ parameters.topology }}
engine: ${{ parameters.engine }}
regions: ${{ parameters.regions }}
engine_input: ${{ parameters.engine_input }}
credential_type: ${{ parameters.credential_type }}
- template: /steps/cleanup-resources.yml
parameters:
cloud: ${{ parameters.cloud }}
Expand Down
105 changes: 105 additions & 0 deletions modules/python/clusterloader2/clustermesh-scale/config/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
name: clustermesh-scale-test

# Workload: deploy a small fixed number of pods on this cluster (no churn,
# no traffic). Measurement modules under modules/measurements/ run the actual
# scale-test instrumentation (cilium agent/operator CPU+memory, kube-apiserver
# health, mesh-specific PromQL) so each per-cluster JSONL row carries the data
# needed for cross-cluster comparison in Kusto. The workload is deliberately
# trivial — fan-out, attribution, and metric coverage are what we're testing
# in Phase 1; richer workloads land per scenario in Phase 2+.

{{$namespaces := DefaultParam .CL2_NAMESPACES 1}}
{{$deploymentsPerNamespace := DefaultParam .CL2_DEPLOYMENTS_PER_NAMESPACE 2}}
{{$replicasPerDeployment := DefaultParam .CL2_REPLICAS_PER_DEPLOYMENT 2}}
{{$operationTimeout := DefaultParam .CL2_OPERATION_TIMEOUT "15m"}}
{{$apiServerCallsPerSecond := DefaultParam .CL2_API_SERVER_CALLS_PER_SECOND 5}}

namespace:
number: {{$namespaces}}
prefix: clustermesh-scale
deleteStaleNamespaces: true
deleteAutomanagedNamespaces: true
enableExistingNamespaces: false
deleteNamespaceTimeout: 20m

tuningSets:
- name: Sequence
parallelismLimitedLoad:
parallelismLimit: 1
- name: DeploymentCreateQps
qpsLoad:
qps: {{$apiServerCallsPerSecond}}

steps:
# ----- Start measurements -----
# control-plane.yaml owns PodStartupLatency + APIResponsivenessPrometheus +
# apiserver CPU/mem queries; cilium.yaml owns cilium-agent + cilium-operator
# CPU/mem; clustermesh-metrics.yaml owns mesh-specific PromQL (remote-cluster
# connectivity, kvstore event rate, identity count, etc.). All three are
# gathered later (see "Gather measurements" below) so the steady-state window
# is bounded by the workload create/delete pair.
- module:
path: /modules/measurements/control-plane.yaml
params:
action: start
group: clustermesh-scale-test

- module:
path: /modules/measurements/cilium.yaml
params:
action: start

- module:
path: /modules/measurements/clustermesh-metrics.yaml
params:
action: start

- module:
path: /modules/clustermesh.yaml
params:
actionName: create
tuningSet: DeploymentCreateQps

- module:
path: /modules/scale-test.yaml
params:
actionName: create
namespaces: {{$namespaces}}
deploymentsPerNamespace: {{$deploymentsPerNamespace}}
replicasPerDeployment: {{$replicasPerDeployment}}
tuningSet: DeploymentCreateQps
operationTimeout: {{$operationTimeout}}

# ----- Gather measurements -----
# Mirror the start block above. Order matches network-scale convention.
- module:
path: /modules/measurements/control-plane.yaml
params:
action: gather
group: clustermesh-scale-test

- module:
path: /modules/measurements/cilium.yaml
params:
action: gather

- module:
path: /modules/measurements/clustermesh-metrics.yaml
params:
action: gather

- module:
path: /modules/scale-test.yaml
params:
actionName: delete
namespaces: {{$namespaces}}
deploymentsPerNamespace: {{$deploymentsPerNamespace}}
replicasPerDeployment: {{$replicasPerDeployment}}
tuningSet: DeploymentCreateQps
operationTimeout: {{$operationTimeout}}

- module:
path: /modules/clustermesh.yaml
params:
actionName: delete
tuningSet: DeploymentCreateQps
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
name: clustermesh-event-throughput

# Scale scenario #1: Cross-Cluster Event Throughput.
#
# Goal (scale testing.txt line 42-54): determine max sustainable and burst
# event rates for endpoints, services, and identities propagating across
# the mesh; measure events/sec processed and time-to-convergence proxy.
#
# Sequence (every cluster runs this in parallel; CL2 fan-out lives in
# steps/engine/.../execute.yml):
#
# 1. Start measurements (control-plane, cilium, clustermesh-metrics +
# scenario-specific clustermesh-throughput + etcd-metrics).
# 2. Deploy PodMonitor scraping clustermesh-apiserver.
# 3. Create N pods + N global Services per cluster at a controlled QPS.
# 4. Warmup sleep — let initial create-flurry settle into steady state.
# 5. Burst rolling-restart of every Deployment (closes the "burst"
# coverage gap from scale testing.txt line 52).
# 6. Settle sleep — let kvstore queues drain and propagation latency
# histograms accumulate steady-state samples.
# 7. Gather all measurements.
# 8. Tear down the workload + PodMonitor.

{{$namespaces := DefaultParam .CL2_NAMESPACES 5}}
{{$deploymentsPerNamespace := DefaultParam .CL2_DEPLOYMENTS_PER_NAMESPACE 4}}
{{$replicasPerDeployment := DefaultParam .CL2_REPLICAS_PER_DEPLOYMENT 10}}
{{$operationTimeout := DefaultParam .CL2_OPERATION_TIMEOUT "20m"}}
{{$apiServerCallsPerSecond := DefaultParam .CL2_API_SERVER_CALLS_PER_SECOND 20}}
{{$warmupDuration := DefaultParam .CL2_WARMUP_DURATION "30s"}}
{{$holdDuration := DefaultParam .CL2_HOLD_DURATION "2m"}}
{{$restartGeneration := DefaultParam .CL2_RESTART_GENERATION 1}}

namespace:
number: {{$namespaces}}
prefix: clustermesh-et
deleteStaleNamespaces: true
deleteAutomanagedNamespaces: true
enableExistingNamespaces: false
deleteNamespaceTimeout: 20m

tuningSets:
- name: Sequence
parallelismLimitedLoad:
parallelismLimit: 1
- name: DeploymentCreateQps
qpsLoad:
qps: {{$apiServerCallsPerSecond}}

steps:
# ----- Start measurements -----
- module:
path: /modules/measurements/control-plane.yaml
params:
action: start
group: clustermesh-event-throughput

- module:
path: /modules/measurements/cilium.yaml
params:
action: start

- module:
path: /modules/measurements/clustermesh-metrics.yaml
params:
action: start

- module:
path: /modules/measurements/clustermesh-throughput.yaml
params:
action: start

- module:
path: /modules/measurements/etcd-metrics.yaml
params:
action: start

- module:
path: /modules/clustermesh.yaml
params:
actionName: create
tuningSet: DeploymentCreateQps

# ----- Workload: create -----
- module:
path: /modules/event-throughput-workload.yaml
params:
actionName: create
generation: 0
namespaces: {{$namespaces}}
deploymentsPerNamespace: {{$deploymentsPerNamespace}}
replicasPerDeployment: {{$replicasPerDeployment}}
tuningSet: DeploymentCreateQps
operationTimeout: {{$operationTimeout}}

# ----- Warmup: let the create-flurry settle into steady state -----
- name: Warmup before burst
measurements:
- Identifier: WarmupSleep
Method: Sleep
Params:
duration: {{$warmupDuration}}

# ----- Burst: rolling-restart of every Deployment -----
- module:
path: /modules/event-throughput-workload.yaml
params:
actionName: restart
generation: {{$restartGeneration}}
namespaces: {{$namespaces}}
deploymentsPerNamespace: {{$deploymentsPerNamespace}}
replicasPerDeployment: {{$replicasPerDeployment}}
tuningSet: DeploymentCreateQps
operationTimeout: {{$operationTimeout}}

# ----- Settle: let kvstore queues drain post-burst -----
- name: Settle after burst
measurements:
- Identifier: SettleSleep
Method: Sleep
Params:
duration: {{$holdDuration}}

# ----- Gather measurements -----
- module:
path: /modules/measurements/control-plane.yaml
params:
action: gather
group: clustermesh-event-throughput

- module:
path: /modules/measurements/cilium.yaml
params:
action: gather

- module:
path: /modules/measurements/clustermesh-metrics.yaml
params:
action: gather

- module:
path: /modules/measurements/clustermesh-throughput.yaml
params:
action: gather

- module:
path: /modules/measurements/etcd-metrics.yaml
params:
action: gather

# ----- Workload: delete -----
- module:
path: /modules/event-throughput-workload.yaml
params:
actionName: delete
generation: {{$restartGeneration}}
namespaces: {{$namespaces}}
deploymentsPerNamespace: {{$deploymentsPerNamespace}}
replicasPerDeployment: {{$replicasPerDeployment}}
tuningSet: DeploymentCreateQps
operationTimeout: {{$operationTimeout}}

- module:
path: /modules/clustermesh.yaml
params:
actionName: delete
tuningSet: DeploymentCreateQps
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
## ClusterMesh module: deploys a PodMonitor for clustermesh-apiserver so the
## CL2-spawned Prometheus picks up at least one mesh-side metric per cluster.
## Phase 1 exit criteria require this — see plan.md Phase 1 line 318.

{{$tuningSet := DefaultParam .tuningSet "DeploymentCreateQps"}}
{{$interval := DefaultParam .interval "15s"}}
{{ $replicasPerNamespace := 1 }}

{{if eq .actionName "create"}}
{{ $replicasPerNamespace = 1 }}
{{else}}
{{ $replicasPerNamespace = 0 }}
{{end}}

steps:
- name: {{.actionName}} ClusterMesh Pod Monitor
phases:
- namespaceList:
- "monitoring"
replicasPerNamespace: {{$replicasPerNamespace}}
tuningSet: {{$tuningSet}}
objectBundle:
- objectTemplatePath: "modules/clustermesh/podmonitor.yaml"
basename: clustermesh-apiserver
templateFillMap:
Interval: {{$interval}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: clustermesh-apiserver
namespace: monitoring
spec:
# Cilium clustermesh-apiserver exposes metrics on port 9963 (apiserver) and
# 9964 (kvstoremesh sidecar) when Prometheus integration is enabled. AKS
# managed Cilium uses the same upstream defaults. If a future preview
# changes these, override via __address__ relabel below.
selector:
matchLabels:
k8s-app: clustermesh-apiserver
namespaceSelector:
matchNames:
- kube-system
podMetricsEndpoints:
- interval: {{.Interval}}
honorLabels: true
path: /metrics
relabelings:
- sourceLabels: [__address__]
action: replace
targetLabel: __address__
regex: (.+?)(\:\d+)?
replacement: $1:9963
- interval: {{.Interval}}
honorLabels: true
path: /metrics
relabelings:
- sourceLabels: [__address__]
action: replace
targetLabel: __address__
regex: (.+?)(\:\d+)?
replacement: $1:9964
Loading
Loading