Skip to content

Add 15K node cluster creation pipeline#1217

Open
xinWeiWei24 wants to merge 17 commits into
v2from
xinwei/15k-cluster-pipeline
Open

Add 15K node cluster creation pipeline#1217
xinWeiWei24 wants to merge 17 commits into
v2from
xinwei/15k-cluster-pipeline

Conversation

@xinWeiWei24

Copy link
Copy Markdown
Collaborator

Adds a KCL pipeline for provisioning a large-scale private AKS cluster, intended for 15K-node performance benchmarking.

Full cluster setup: VNet, subnets, firewall, managed identity, jumpbox, AKS cluster, and user node pools
Jumpbox cloud-init installs Azure CLI, Go, and kubectl
Conditional networkDataplane support — set via network_dataplane parameter, omitted if empty (uses AKS default)
Runtime-configurable parameters (subscription_id, location, cluster_name, user_pool_vm_size, network_dataplane) shown in ADO Run dialog

xinWeiWei24 and others added 8 commits June 10, 2026 15:43
Pipeline provisions a private AKS cluster with Cilium CNI, Azure Firewall,
custom VNet, jumpbox VM, and 14997 user nodes across 19 node pools planned
via PlanNodePools. All key params are ADO pipeline variables.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Comment thread kcl/ccp_team/15K_cluster/pipeline.k Outdated
),
azure.CreateSubnetAndRoleAssignment(
const.DEFAULT_SERVICE_CONNECTION,
RESOURCE_GROUP, VNET_NAME,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: one parameter per line please.

Comment thread kcl/ccp_team/15K_cluster/pipeline.k Outdated
JUMPBOX_NAME = "${CLUSTER_NAME}-jumpbox"

VNET_CIDR = "10.0.0.0/8"
NODES_SUBNET_CIDR = "10.1.0.0/18"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable is only used once, remove it for better readability.

Comment thread kcl/ccp_team/15K_cluster/pipeline.k Outdated
const.DEFAULT_SERVICE_CONNECTION,
RESOURCE_GROUP, VNET_NAME,
"nodes", NODES_SUBNET_CIDR,
SUBSCRIPTION_ID, "$(IDENTITY_PRINCIPAL_ID)"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is "IDENTITY_PRINCIPAL_ID" defined?

@xinWeiWei24 xinWeiWei24 Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be exposed through identity.k.

Comment thread kcl/ccp_team/15K_cluster/pipeline.k Outdated
VNET_CIDR,
SUBSCRIPTION_ID
),
azure.CreateSubnetAndRoleAssignment(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to assign the role for every subnet? It seems you can set the scope to a resource group, then it grants the role permission to all subnets in the resource group.

FW_NAME, VNET_NAME, "nodes",
"$(Pipeline.Workspace)/s/kcl/ccp_team/15K_cluster/firewall-policy.json"
),
azure.AzCli(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to extract this out as a reusable lib?

@xinWeiWei24 xinWeiWei24 Jun 15, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean exporting a jumpbox resource library or a regular VM resource library?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jumpbox, I suspect this is not the only pipeline that will need a jumpbox.

Comment thread kcl/ccp_team/15K_cluster/pipeline.k Outdated
}
EOF

az rest \\

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may ask this before, but why use rest api over az aks create?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just follow the old pattern, will use az aks create to hyperscale cluster

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly why we need to scrutinise every line of code, someone or AI may "just follow" the code and pattern.

Comment thread kcl/ccp_team/15K_cluster/pipeline.k Outdated
FIREWALL_SUBNET_CIDR = "10.2.0.0/16"
JUMPBOX_SUBNET_CIDR = "10.3.0.0/16"

# Small values for small scare e2e testing; replace with the lines below for full 15K run

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using 100 contradicts with your PR title and folder name. Please be consistent.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for testing, have tested it and will change it now

Comment thread kcl/ccp_team/15K_cluster/pipeline.k Outdated
USER_TOTAL_NODES = 100
MAX_NODES_PER_POOL = 20

_userPools = [azure.NodePool {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, don't need to use underscore variables, nothing is exported.

Comment thread kcl/ccp_team/15K_cluster/pipeline.k Outdated

VNET_NAME = "${CLUSTER_NAME}-net"
IDENTITY_NAME = "${CLUSTER_NAME}-identity"
FW_NAME = "${CLUSTER_NAME}-firewall"

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check all the variables and make sure only use a variable when it occurs in multiple places? Using a variable makes the code less readable, so only use it when necessary.

Comment thread kcl/ccp_team/15K_cluster/pipeline.k Outdated
--kubernetes-version "${K8S_VERSION}" \\
--dns-name-prefix "${CLUSTER_NAME}-dns" \\
--tier standard \\
--assign-identity "$(IDENTITY_ID)" \\

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is IDENTITY_ID defined?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot said: I exported it as an environment variable, but it looks less readable. I’ll update the related code to improve readability.

Comment thread kcl/ccp_team/15K_cluster/pipeline.k Outdated
--dns-name-prefix "${CLUSTER_NAME}-dns" \\
--tier standard \\
--assign-identity "$(IDENTITY_ID)" \\
--tags SkipAKSCluster=true \\

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we can't use it anymore, no?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to skip gatekeeper, not the az sec pack

Comment on lines +57 to +59
"Azure Kubernetes Service Cluster Admin Role" \\
"Azure Kubernetes Service Cluster User Role" \\
"Azure Kubernetes Service Contributor Role" \\

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is AKS contributor role a superset of admin and user role?

❯ az role definition list --name "Azure Kubernetes Service Contributor Role"
[
  {
    "assignableScopes": [
      "/"
    ],
    "createdBy": null,
    "createdOn": "2020-02-27T19:27:15.073997+00:00",
    "description": "Grants access to read and write Azure Kubernetes Service clusters",
    "id": "/subscriptions/feb5b150-60fe-4441-be73-8c02a524f55a/providers/Microsoft.Authorization/roleDefinitions/ed7f3fbd-7b88-4dd4-9017-9adb7ce333f8",
    "name": "ed7f3fbd-7b88-4dd4-9017-9adb7ce333f8",
    "permissions": [
      {
        "actions": [
          "Microsoft.Authorization/*/read",
          "Microsoft.ContainerService/locations/*",
          "Microsoft.ContainerService/managedClusters/*",
          "Microsoft.ContainerService/managedclustersnapshots/*",
          "Microsoft.ContainerService/snapshots/*",
          "Microsoft.Insights/alertRules/*",
          "Microsoft.Resources/deployments/*",
          "Microsoft.Resources/subscriptions/resourceGroups/read",
          "Microsoft.ContainerService/deploymentSafeguards/*"
        ],
        "condition": null,
        "conditionVersion": null,
        "dataActions": [],
        "notActions": [],
        "notDataActions": []
      }
    ],
    "roleName": "Azure Kubernetes Service Contributor Role",
    "roleType": "BuiltInRole",
    "type": "Microsoft.Authorization/roleDefinitions",
    "updatedBy": null,
    "updatedOn": "2025-07-24T15:33:31.343706+00:00"
  }
]

~ via  v22.22.2
❯ az role definition list --name "Azure Kubernetes Service Cluster User Role"
[
  {
    "assignableScopes": [
      "/"
    ],
    "createdBy": null,
    "createdOn": "2018-08-15T22:04:53.403724+00:00",
    "description": "List cluster user credential action.",
    "id": "/subscriptions/feb5b150-60fe-4441-be73-8c02a524f55a/providers/Microsoft.Authorization/roleDefinitions/4abbcc35-e782-43d8-92c5-2d3f1bd2253f",
    "name": "4abbcc35-e782-43d8-92c5-2d3f1bd2253f",
    "permissions": [
      {
        "actions": [
          "Microsoft.ContainerService/managedClusters/listClusterUserCredential/action",
          "Microsoft.ContainerService/managedClusters/read"
        ],
        "condition": null,
        "conditionVersion": null,
        "dataActions": [],
        "notActions": [],
        "notDataActions": []
      }
    ],
    "roleName": "Azure Kubernetes Service Cluster User Role",
    "roleType": "BuiltInRole",
    "type": "Microsoft.Authorization/roleDefinitions",
    "updatedBy": null,
    "updatedOn": "2021-11-11T20:13:20.435197+00:00"
  }
]

~ via  v22.22.2
❯ az role definition list --name "Azure Kubernetes Service Cluster Admin Role"
[
  {
    "assignableScopes": [
      "/"
    ],
    "createdBy": null,
    "createdOn": "2018-08-15T21:38:18.595385+00:00",
    "description": "List cluster admin credential action.",
    "id": "/subscriptions/feb5b150-60fe-4441-be73-8c02a524f55a/providers/Microsoft.Authorization/roleDefinitions/0ab0b1a8-8aac-4efd-b8c2-3ee1fb270be8",
    "name": "0ab0b1a8-8aac-4efd-b8c2-3ee1fb270be8",
    "permissions": [
      {
        "actions": [
          "Microsoft.ContainerService/managedClusters/listClusterAdminCredential/action",
          "Microsoft.ContainerService/managedClusters/accessProfiles/listCredential/action",
          "Microsoft.ContainerService/managedClusters/read",
          "Microsoft.ContainerService/managedClusters/runcommand/action"
        ],
        "condition": null,
        "conditionVersion": null,
        "dataActions": [],
        "notActions": [],
        "notDataActions": []
      }
    ],
    "roleName": "Azure Kubernetes Service Cluster Admin Role",
    "roleType": "BuiltInRole",
    "type": "Microsoft.Authorization/roleDefinitions",
    "updatedBy": null,
    "updatedOn": "2022-05-17T01:51:12.039065+00:00"
  }
]

@@ -0,0 +1,788 @@
{
"location": "eastasia",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the pipeline, location is a pipeline parameter, here it's a constant, would it cause problems?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, will delete it

},
{
"fqdnTags": [],
"name": "gcr.io",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is gcr.io a superset of k8s.gcr.io

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NO, but *.gcr.io is a superset of k8s.gcr.io, this can be deleted

},
{
"fqdnTags": [],
"name": "*.gcr.io",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another gcr.io rule

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 FQDN: *.gcr.io and gcr.io are different things in firewall

},
{
"fqdnTags": [],
"name": "*.lz4.dev",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also forgot the exact purpose, but I remember this FQDN being blocked before and it impacted benchmarking, so I added it to the whitelist.

},
{
"fqdnTags": [],
"name": "*.quay.io",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a superset of quay.io?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*.quay.io and quay.io are different in firewall FQDN setting

Comment on lines +782 to +783
}
]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need to access all these websites, or you copied it from somewhere else?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need these. I am getting them from 3 different parts:

  1. IMDS endpoint from Azure
  2. endpoints from https://learn.microsoft.com/en-us/azure/aks/outbound-rules-control-egress#azure-global-required-fqdn--application-rules
  3. endpoints behind the CL2 and Prometheus pod images — these can vary each time depending on where the images are served from.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants