Add 15K node cluster creation pipeline#1217
Conversation
Pipeline provisions a private AKS cluster with Cilium CNI, Azure Firewall, custom VNet, jumpbox VM, and 14997 user nodes across 19 node pools planned via PlanNodePools. All key params are ADO pipeline variables. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
| ), | ||
| azure.CreateSubnetAndRoleAssignment( | ||
| const.DEFAULT_SERVICE_CONNECTION, | ||
| RESOURCE_GROUP, VNET_NAME, |
There was a problem hiding this comment.
nit: one parameter per line please.
| JUMPBOX_NAME = "${CLUSTER_NAME}-jumpbox" | ||
|
|
||
| VNET_CIDR = "10.0.0.0/8" | ||
| NODES_SUBNET_CIDR = "10.1.0.0/18" |
There was a problem hiding this comment.
The variable is only used once, remove it for better readability.
| const.DEFAULT_SERVICE_CONNECTION, | ||
| RESOURCE_GROUP, VNET_NAME, | ||
| "nodes", NODES_SUBNET_CIDR, | ||
| SUBSCRIPTION_ID, "$(IDENTITY_PRINCIPAL_ID)" |
There was a problem hiding this comment.
Where is "IDENTITY_PRINCIPAL_ID" defined?
There was a problem hiding this comment.
It will be exposed through identity.k.
| VNET_CIDR, | ||
| SUBSCRIPTION_ID | ||
| ), | ||
| azure.CreateSubnetAndRoleAssignment( |
There was a problem hiding this comment.
Do you need to assign the role for every subnet? It seems you can set the scope to a resource group, then it grants the role permission to all subnets in the resource group.
| FW_NAME, VNET_NAME, "nodes", | ||
| "$(Pipeline.Workspace)/s/kcl/ccp_team/15K_cluster/firewall-policy.json" | ||
| ), | ||
| azure.AzCli( |
There was a problem hiding this comment.
Is it possible to extract this out as a reusable lib?
There was a problem hiding this comment.
Do you mean exporting a jumpbox resource library or a regular VM resource library?
There was a problem hiding this comment.
jumpbox, I suspect this is not the only pipeline that will need a jumpbox.
| } | ||
| EOF | ||
|
|
||
| az rest \\ |
There was a problem hiding this comment.
I may ask this before, but why use rest api over az aks create?
There was a problem hiding this comment.
just follow the old pattern, will use az aks create to hyperscale cluster
There was a problem hiding this comment.
This is exactly why we need to scrutinise every line of code, someone or AI may "just follow" the code and pattern.
| FIREWALL_SUBNET_CIDR = "10.2.0.0/16" | ||
| JUMPBOX_SUBNET_CIDR = "10.3.0.0/16" | ||
|
|
||
| # Small values for small scare e2e testing; replace with the lines below for full 15K run |
There was a problem hiding this comment.
Using 100 contradicts with your PR title and folder name. Please be consistent.
There was a problem hiding this comment.
This is just for testing, have tested it and will change it now
| USER_TOTAL_NODES = 100 | ||
| MAX_NODES_PER_POOL = 20 | ||
|
|
||
| _userPools = [azure.NodePool { |
There was a problem hiding this comment.
nit, don't need to use underscore variables, nothing is exported.
|
|
||
| VNET_NAME = "${CLUSTER_NAME}-net" | ||
| IDENTITY_NAME = "${CLUSTER_NAME}-identity" | ||
| FW_NAME = "${CLUSTER_NAME}-firewall" |
There was a problem hiding this comment.
Could you check all the variables and make sure only use a variable when it occurs in multiple places? Using a variable makes the code less readable, so only use it when necessary.
| --kubernetes-version "${K8S_VERSION}" \\ | ||
| --dns-name-prefix "${CLUSTER_NAME}-dns" \\ | ||
| --tier standard \\ | ||
| --assign-identity "$(IDENTITY_ID)" \\ |
There was a problem hiding this comment.
where is IDENTITY_ID defined?
There was a problem hiding this comment.
Copilot said: I exported it as an environment variable, but it looks less readable. I’ll update the related code to improve readability.
| --dns-name-prefix "${CLUSTER_NAME}-dns" \\ | ||
| --tier standard \\ | ||
| --assign-identity "$(IDENTITY_ID)" \\ | ||
| --tags SkipAKSCluster=true \\ |
There was a problem hiding this comment.
I thought we can't use it anymore, no?
There was a problem hiding this comment.
This is to skip gatekeeper, not the az sec pack
| "Azure Kubernetes Service Cluster Admin Role" \\ | ||
| "Azure Kubernetes Service Cluster User Role" \\ | ||
| "Azure Kubernetes Service Contributor Role" \\ |
There was a problem hiding this comment.
Is AKS contributor role a superset of admin and user role?
❯ az role definition list --name "Azure Kubernetes Service Contributor Role"
[
{
"assignableScopes": [
"/"
],
"createdBy": null,
"createdOn": "2020-02-27T19:27:15.073997+00:00",
"description": "Grants access to read and write Azure Kubernetes Service clusters",
"id": "/subscriptions/feb5b150-60fe-4441-be73-8c02a524f55a/providers/Microsoft.Authorization/roleDefinitions/ed7f3fbd-7b88-4dd4-9017-9adb7ce333f8",
"name": "ed7f3fbd-7b88-4dd4-9017-9adb7ce333f8",
"permissions": [
{
"actions": [
"Microsoft.Authorization/*/read",
"Microsoft.ContainerService/locations/*",
"Microsoft.ContainerService/managedClusters/*",
"Microsoft.ContainerService/managedclustersnapshots/*",
"Microsoft.ContainerService/snapshots/*",
"Microsoft.Insights/alertRules/*",
"Microsoft.Resources/deployments/*",
"Microsoft.Resources/subscriptions/resourceGroups/read",
"Microsoft.ContainerService/deploymentSafeguards/*"
],
"condition": null,
"conditionVersion": null,
"dataActions": [],
"notActions": [],
"notDataActions": []
}
],
"roleName": "Azure Kubernetes Service Contributor Role",
"roleType": "BuiltInRole",
"type": "Microsoft.Authorization/roleDefinitions",
"updatedBy": null,
"updatedOn": "2025-07-24T15:33:31.343706+00:00"
}
]
~ via v22.22.2
❯ az role definition list --name "Azure Kubernetes Service Cluster User Role"
[
{
"assignableScopes": [
"/"
],
"createdBy": null,
"createdOn": "2018-08-15T22:04:53.403724+00:00",
"description": "List cluster user credential action.",
"id": "/subscriptions/feb5b150-60fe-4441-be73-8c02a524f55a/providers/Microsoft.Authorization/roleDefinitions/4abbcc35-e782-43d8-92c5-2d3f1bd2253f",
"name": "4abbcc35-e782-43d8-92c5-2d3f1bd2253f",
"permissions": [
{
"actions": [
"Microsoft.ContainerService/managedClusters/listClusterUserCredential/action",
"Microsoft.ContainerService/managedClusters/read"
],
"condition": null,
"conditionVersion": null,
"dataActions": [],
"notActions": [],
"notDataActions": []
}
],
"roleName": "Azure Kubernetes Service Cluster User Role",
"roleType": "BuiltInRole",
"type": "Microsoft.Authorization/roleDefinitions",
"updatedBy": null,
"updatedOn": "2021-11-11T20:13:20.435197+00:00"
}
]
~ via v22.22.2
❯ az role definition list --name "Azure Kubernetes Service Cluster Admin Role"
[
{
"assignableScopes": [
"/"
],
"createdBy": null,
"createdOn": "2018-08-15T21:38:18.595385+00:00",
"description": "List cluster admin credential action.",
"id": "/subscriptions/feb5b150-60fe-4441-be73-8c02a524f55a/providers/Microsoft.Authorization/roleDefinitions/0ab0b1a8-8aac-4efd-b8c2-3ee1fb270be8",
"name": "0ab0b1a8-8aac-4efd-b8c2-3ee1fb270be8",
"permissions": [
{
"actions": [
"Microsoft.ContainerService/managedClusters/listClusterAdminCredential/action",
"Microsoft.ContainerService/managedClusters/accessProfiles/listCredential/action",
"Microsoft.ContainerService/managedClusters/read",
"Microsoft.ContainerService/managedClusters/runcommand/action"
],
"condition": null,
"conditionVersion": null,
"dataActions": [],
"notActions": [],
"notDataActions": []
}
],
"roleName": "Azure Kubernetes Service Cluster Admin Role",
"roleType": "BuiltInRole",
"type": "Microsoft.Authorization/roleDefinitions",
"updatedBy": null,
"updatedOn": "2022-05-17T01:51:12.039065+00:00"
}
]
| @@ -0,0 +1,788 @@ | |||
| { | |||
| "location": "eastasia", | |||
There was a problem hiding this comment.
In the pipeline, location is a pipeline parameter, here it's a constant, would it cause problems?
There was a problem hiding this comment.
Right, will delete it
| }, | ||
| { | ||
| "fqdnTags": [], | ||
| "name": "gcr.io", |
There was a problem hiding this comment.
Is gcr.io a superset of k8s.gcr.io
There was a problem hiding this comment.
NO, but *.gcr.io is a superset of k8s.gcr.io, this can be deleted
| }, | ||
| { | ||
| "fqdnTags": [], | ||
| "name": "*.gcr.io", |
There was a problem hiding this comment.
These 2 FQDN: *.gcr.io and gcr.io are different things in firewall
| }, | ||
| { | ||
| "fqdnTags": [], | ||
| "name": "*.lz4.dev", |
There was a problem hiding this comment.
I also forgot the exact purpose, but I remember this FQDN being blocked before and it impacted benchmarking, so I added it to the whitelist.
| }, | ||
| { | ||
| "fqdnTags": [], | ||
| "name": "*.quay.io", |
There was a problem hiding this comment.
is this a superset of quay.io?
There was a problem hiding this comment.
*.quay.io and quay.io are different in firewall FQDN setting
| } | ||
| ] |
There was a problem hiding this comment.
Do we actually need to access all these websites, or you copied it from somewhere else?
There was a problem hiding this comment.
We need these. I am getting them from 3 different parts:
- IMDS endpoint from Azure
- endpoints from https://learn.microsoft.com/en-us/azure/aks/outbound-rules-control-egress#azure-global-required-fqdn--application-rules
- endpoints behind the CL2 and Prometheus pod images — these can vary each time depending on where the images are served from.
Adds a KCL pipeline for provisioning a large-scale private AKS cluster, intended for 15K-node performance benchmarking.
Full cluster setup: VNet, subnets, firewall, managed identity, jumpbox, AKS cluster, and user node pools
Jumpbox cloud-init installs Azure CLI, Go, and kubectl
Conditional networkDataplane support — set via network_dataplane parameter, omitted if empty (uses AKS default)
Runtime-configurable parameters (subscription_id, location, cluster_name, user_pool_vm_size, network_dataplane) shown in ADO Run dialog