Align FlexNode config with RP bootstrap data#195
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates AKS FlexNode to accept the AKS Resource Provider listBootstrapData payload shape and normalize it into the agent’s canonical config, while adding explicit configuration for the target agent pool name and the Azure Resource Manager endpoint so ARM/Arc client behavior is consistent across clouds.
Changes:
- Accept and normalize RP bootstrap data (
components,networking,node) into the runtimepkg/config.Configon load. - Introduce
azure.targetAgentPoolName(defaulting toaksflexnodes) and use it for ARM Machine resource IDs and config generation. - Introduce
azure.resourceManagerEndpointand centralize ARM endpoint/audience/authority derivation for Azure SDK clients, including Arc flows.
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/install.sh | Updates sample config JSON to include ARM endpoint and target agent pool name. |
| scripts/aks-flex-config | Adds --agent-pool-name and writes ARM endpoint + target pool into generated configs. |
| README.md | Updates quickstart to pass agent pool name and reflects new config fields. |
| pkg/config/config.go | Adds ARM endpoint + target pool defaults, adapts RP bootstrap data on load, and updates auth validation behavior. |
| pkg/config/config_test.go | Updates/extends tests for new defaults, auth rules, and RP bootstrap normalization. |
| pkg/config/bootstrap_data.go | Implements RP bootstrap payload normalization into canonical config fields. |
| pkg/azclient/resource_manager.go | Centralizes ARM endpoint/audience/authority derivation and Azure SDK client options. |
| pkg/azclient/resource_manager_test.go | Adds unit tests for ARM environment derivation and client options. |
| pkg/arc/arc_installer.go | Switches Arc credential/token/client setup to shared ARM environment handling. |
| pkg/aksmachine/types.go | Tightens validation to keep maxPods within int32 bounds for ARM payloads. |
| pkg/aksmachine/client_armapi.go | Uses targetAgentPoolName for Machine resource IDs and shared ARM client options/credentials. |
| pkg/aksmachine/client_armapi_test.go | Updates tests for configurable pool name, client options, and kubelet config behavior. |
| hack/e2e/run.sh | Documents new E2E env var for target agent pool name. |
| hack/e2e/README.md | Documents E2E_TARGET_AGENT_POOL_NAME. |
| hack/e2e/lib/node-join-token.sh | Passes --agent-pool-name when generating token-based node configs. |
| hack/e2e/lib/node-join-msi.sh | Includes ARM endpoint + target pool name in MSI node join config. |
| hack/e2e/lib/node-join-kubeadm.sh | Includes ARM endpoint + target pool name in kubeadm node join config. |
| hack/e2e/lib/common.sh | Sets/logs default E2E_TARGET_AGENT_POOL_NAME. |
| docs/usages/joining-nodes.md | Updates minimal config examples to include ARM endpoint and target pool name. |
| docs/usages/configuration.md | Documents new fields, updated auth rules, and RP bootstrap-to-runtime field mapping. |
| docs/usages/aks-flex-config.md | Documents --agent-pool-name and the generated config fields. |
| docs/labs/gpu-node-setup.md | Updates lab to pass agent pool name through config generation. |
| docs/labs/aks-public-cluster-unbounded-net-wireguard.md | Updates lab to pass agent pool name through config generation. |
| docs/labs/aks-private-cluster-unbounded-net.md | Updates lab to pass agent pool name through config generation. |
| docs/labs/aks-private-cluster-cilium.md | Updates lab to pass agent pool name through config generation. |
| .env.example | Adds E2E_TARGET_AGENT_POOL_NAME example configuration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if c.requiresTenantID() && c.TenantID == "" { | ||
| return fmt.Errorf("azure.tenantId is required") | ||
| } |
There was a problem hiding this comment.
this validation makes sense, can we add one as a dedicate method? @wenxuan0923
There was a problem hiding this comment.
Updated in a3b0a7f: endpoint validation now lives in AzureConfig.validateResourceManagerEndpointURL(), requires an absolute https URL, and rejects user info, explicit ports, path, query, and fragment. Added tests for missing scheme, http, path, port, and user info cases.
There was a problem hiding this comment.
Yes this is now in method validateResourceManagerEndpointURL
| // poolBootstrapData is the AKS RP-generated bootstrap config shape. The agent | ||
| // accepts it at load time and normalizes it into the runtime Config shape. | ||
| // this is based on v20260502preview AKS RP API version | ||
| type poolBootstrapData struct { |
| ) | ||
|
|
||
| // poolBootstrapData is the AKS RP-generated bootstrap config shape. The agent | ||
| // accepts it at load time and normalizes it into the runtime Config shape. |
There was a problem hiding this comment.
do we want to make the runtime config shape to follow AKS RP side settings, then add compatibility support for the current config in main? We can eventually remove the compatibility support in a few versions later.
There was a problem hiding this comment.
Sounds good, updated the code to make the RP side setting first class citizen now, with a backward adapter for legacy config
There was a problem hiding this comment.
but do we want to replace current Config with this poolBootstrapData now?
| func (c *AzureConfig) validateResourceManagerEndpointURL() error { | ||
| endpoint := strings.TrimSpace(c.ResourceManagerEndpointURL) | ||
| if endpoint == "" { | ||
| return fmt.Errorf("azure.resourceManagerEndpoint is required") | ||
| } | ||
| parsed, err := url.Parse(endpoint) | ||
| if err != nil || parsed.Scheme == "" || parsed.Host == "" { | ||
| return fmt.Errorf("azure.resourceManagerEndpoint must be an absolute https URL") | ||
| } | ||
| if parsed.Scheme != "https" { | ||
| return fmt.Errorf("azure.resourceManagerEndpoint must use https") | ||
| } | ||
| if parsed.Path != "" || parsed.RawQuery != "" || parsed.Fragment != "" { | ||
| return fmt.Errorf("azure.resourceManagerEndpoint must not include a path, query, or fragment") | ||
| } | ||
| return nil | ||
| } |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| @@ -615,6 +653,9 @@ | |||
| cfg.Azure.TargetCluster.ResourceGroup = resourceGroupName | |||
| cfg.Azure.TargetCluster.SubscriptionID = subscriptionID | |||
| cfg.Azure.TargetCluster.NodeResourceGroup = mcResourceGroup | |||
There was a problem hiding this comment.
Updated in 2d96feb. We do not currently need TargetCluster.NodeResourceGroup in runtime code; only the target cluster name/resource group/subscription are consumed. I removed the MC_* derivation entirely so we no longer fabricate a misleading node resource group when location is omitted or when the AKS cluster uses a custom node resource group. Added coverage for both location-present and location-omitted cases.
There was a problem hiding this comment.
Follow-up in d5cfefd: removed the unused TargetCluster.NodeResourceGroup field and the related test assertion completely. The agent still derives the target cluster name/resource group/subscription from the resource ID, but no node resource group code remains in runtime/config.
There was a problem hiding this comment.
NodeResourceGroup property is not used at all, removed it.
| // poolBootstrapData is the AKS RP-generated bootstrap config shape. The agent | ||
| // accepts it at load time and normalizes it into the runtime Config shape. | ||
| // this is based on v20260502preview AKS RP API version | ||
| type poolBootstrapData struct { |
| "arc": { | ||
| "machineName": "YOUR_MACHINE_NAME", | ||
| "tags": { |
There was a problem hiding this comment.
Fixed in 1b2643e: the installer example now sets azure.arc.enabled to true so the sample actually selects Arc authentication.
| "node": { | ||
| "kubelet": { | ||
| "serverURL": "https://<aks-api-server>", // AKS API server endpoint. | ||
| "clusterFQDN": "<aks-api-server-fqdn>", // AKS API server FQDN. | ||
| "caCertData": "<base64-ca-data>" // Cluster CA bundle from kubeconfig. | ||
| } |
There was a problem hiding this comment.
Fixed in 1b2643e: scripts/aks-flex-config now renders the canonical node.kubelet.clusterFQDN field instead of the legacy node.kubelet.serverURL field. The README example remains aligned with the generated output.
| "agent": { | ||
| "logLevel": "info", // Agent log verbosity. | ||
| "logDir": "/var/log/aks-flex-node" // Host log directory. | ||
| }, | ||
| "kubernetes": { "version": "<aks-kubernetes-version>" } // Kubelet version to install. | ||
| "components": { "kubernetes": "<aks-kubernetes-version>" } // Kubelet version to install. | ||
| } |
There was a problem hiding this comment.
Fixed in 1b2643e: scripts/aks-flex-config now renders components.kubernetes instead of the legacy kubernetes.version shape.
| // When using bootstrap token, clusterFQDN and caCertData are required in kubelet config | ||
| // because there's no Azure authentication to fetch them | ||
| if c.Node.Kubelet.ServerURL == "" { | ||
| return fmt.Errorf("node.kubelet.serverURL is required when using bootstrap token authentication") | ||
| if c.APIServerURL() == "" { | ||
| return fmt.Errorf("node.kubelet.clusterFQDN is required when using bootstrap token authentication") | ||
| } |
There was a problem hiding this comment.
Fixed in 1b2643e: bootstrap-token validation now checks that node.kubelet.clusterFQDN resolves to an absolute HTTPS kube-apiserver URL and rejects malformed URLs, HTTP, user info, path, query, and fragment. Added table coverage for accepted host/HTTPS inputs and rejected invalid cases.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 33 out of 33 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
pkg/daemon/daemon_test.go:20
- This test sets node.kubelet.clusterFQDN to a full URL and then asserts restCfg.Host equals the raw ClusterFQDN. ClusterFQDN is documented/used as a FQDN/host input, and bootstrapCredentialRESTConfig intentionally uses cfg.APIServerURL() (derived URL) as restCfg.Host. The test should set ClusterFQDN to a host value and assert Host against cfg.APIServerURL() to validate the intended normalization logic (including defaulting :443).
cfg.Node.Kubelet.ClusterFQDN = "https://example.test"
cfg.Node.Kubelet.CACertData = base64.StdEncoding.EncodeToString([]byte("ca"))
cfg.Azure.BootstrapToken = &config.BootstrapTokenConfig{Token: "token.value"}
restCfg, err := bootstrapCredentialRESTConfig(cfg)
| "subscriptionId": metadata["subscription_id"], | ||
| "tenantId": metadata["tenant_id"], | ||
| "cloud": "AzurePublicCloud", | ||
| "resourceManagerEndpoint": RESOURCE_MANAGER_ENDPOINT, | ||
| "targetAgentPoolName": metadata["agent_pool_name"], | ||
| "targetCluster": { |
| ```jsonc | ||
| { | ||
| "azure": { | ||
| "subscriptionId": "<subscription-id>", // Azure subscription that owns the AKS cluster. | ||
| "tenantId": "<tenant-id>", // Microsoft Entra tenant for the subscription. | ||
| "cloud": "AzurePublicCloud", // Azure cloud environment. | ||
| "resourceManagerEndpoint": "https://management.azure.com", // Azure Resource Manager endpoint for ARM calls. | ||
| "targetAgentPoolName": "<agent-pool-name>", // AKS agent pool used for FlexNode machine registration. |
| func machineResourceIDFromConfig(cfg *config.Config) (*arm.ResourceID, error) { | ||
| if cfg.Azure.TargetCluster.ResourceID == "" || cfg.Agent.NodeName == "" || cfg.Kubernetes.Version == "" { | ||
| return nil, fmt.Errorf("incomplete AKS machine config: clusterResourceId=%q machineName=%q kubernetesVersion=%q", | ||
| cfg.Azure.TargetCluster.ResourceID, cfg.Agent.NodeName, cfg.Kubernetes.Version) | ||
| var clusterResourceID, agentPoolName, machineName, k8sVersion string | ||
| if cfg != nil { | ||
| if cfg.Azure.TargetCluster != nil { | ||
| clusterResourceID = cfg.Azure.TargetCluster.ResourceID | ||
| } | ||
| agentPoolName = strings.TrimSpace(cfg.Azure.TargetAgentPoolName) | ||
| machineName = cfg.Agent.NodeName | ||
| k8sVersion = cfg.Components.Kubernetes | ||
| } | ||
| if clusterResourceID == "" || agentPoolName == "" || machineName == "" || k8sVersion == "" { | ||
| return nil, fmt.Errorf("incomplete AKS machine config: clusterResourceId=%q targetAgentPoolName=%q machineName=%q kubernetesVersion=%q", | ||
| clusterResourceID, agentPoolName, machineName, k8sVersion) | ||
| } |
| cfg := &config.Config{} | ||
| cfg.Node.Kubelet.ServerURL = "https://example.test" | ||
| cfg.Node.Kubelet.ClusterFQDN = "https://example.test" | ||
| cfg.Node.Kubelet.CACertData = base64.StdEncoding.EncodeToString([]byte("ca")) | ||
| cfg.Azure.BootstrapToken = &config.BootstrapTokenConfig{Token: "token.value"} | ||
|
|
||
| restCfg, err := bootstrapCredentialRESTConfig(cfg) | ||
| if err != nil { | ||
| t.Fatalf("bootstrapCredentialRESTConfig: %v", err) | ||
| } | ||
| if restCfg.Host != cfg.Node.Kubelet.ServerURL || restCfg.BearerToken != cfg.Azure.BootstrapToken.Token { | ||
| if restCfg.Host != cfg.Node.Kubelet.ClusterFQDN || restCfg.BearerToken != cfg.Azure.BootstrapToken.Token { | ||
| t.Fatalf("rest config = %#v", restCfg) |
| cfg := &config.Config{} | ||
| cfg.Node.Kubelet.ServerURL = "https://example.test" | ||
| cfg.Node.Kubelet.ClusterFQDN = "https://example.test" | ||
| cfg.Node.Kubelet.CACertData = base64.StdEncoding.EncodeToString([]byte("ca")) | ||
| cfg.Azure.ServicePrincipal = &config.ServicePrincipalConfig{TenantID: "tenant", ClientID: "client", ClientSecret: "secret"} |
| cfg := &config.Config{} | ||
| cfg.Node.Kubelet.ServerURL = "https://example.test" | ||
| cfg.Node.Kubelet.ClusterFQDN = "https://example.test" | ||
| cfg.Node.Kubelet.CACertData = base64.StdEncoding.EncodeToString([]byte("ca")) |
| if parsed.User != nil { | ||
| if opts.allowPort { | ||
| return fmt.Errorf("%s must not include user info", opts.fieldName) | ||
| } | ||
| return fmt.Errorf("%s must not include user info or port", opts.fieldName) | ||
| } | ||
| if !opts.allowPort && parsed.Port() != "" { | ||
| return fmt.Errorf("%s must not include user info or port", opts.fieldName) | ||
| } |
|
@copilot resolve the merge conflicts in this pull request |
# Conflicts: # pkg/npd/start.go
Resolved in 2026ca3. The only conflict was in |
Summary
listBootstrapDatashape and normalize it into the local agent config.azure.targetAgentPoolNamesupport and use it for ARM Machine resource paths, with a compatibility default ofaksflexnodes.azure.resourceManagerEndpointsupport and share Resource Manager endpoint/audience/authority handling across Machine and Arc clients.scripts/aks-flex-config generate-node-configbackward compatible by defaulting--agent-pool-nametoaksflexnodes.Validation
go test ./...go test -tags local_e2e ./...python3 -m py_compile scripts/aks-flex-configscripts/aks-flex-config generate-node-config --helpbash -n hack/e2e/run.sh hack/e2e/lib/common.sh hack/e2e/lib/node-join-kubeadm.sh hack/e2e/lib/node-join-msi.sh hack/e2e/lib/node-join-token.sh scripts/install.shgit diff --checkStandalone validation
Validated against AKS standalone
wenxuanebld168272832using RPlistBootstrapDataandarmProxyURLOverrideForE2E:aksflexnodesFlexNodes pool.vm-flexnode-06161434with the dev AKSFlexNode binary.Succeeded.Ready.The local detailed report remains uncommitted per local validation workflow.