Skip to content

Align FlexNode config with RP bootstrap data#195

Open
wenxuan0923 wants to merge 14 commits into
mainfrom
wenx/api-align
Open

Align FlexNode config with RP bootstrap data#195
wenxuan0923 wants to merge 14 commits into
mainfrom
wenx/api-align

Conversation

@wenxuan0923

Copy link
Copy Markdown
Collaborator

Summary

  • Accept AKS RP listBootstrapData shape and normalize it into the local agent config.
  • Add azure.targetAgentPoolName support and use it for ARM Machine resource paths, with a compatibility default of aksflexnodes.
  • Add azure.resourceManagerEndpoint support and share Resource Manager endpoint/audience/authority handling across Machine and Arc clients.
  • Keep scripts/aks-flex-config generate-node-config backward compatible by defaulting --agent-pool-name to aksflexnodes.
  • Update E2E config generation and docs to include the target agent pool name.

Validation

  • go test ./...
  • go test -tags local_e2e ./...
  • python3 -m py_compile scripts/aks-flex-config
  • scripts/aks-flex-config generate-node-config --help
  • bash -n hack/e2e/run.sh hack/e2e/lib/common.sh hack/e2e/lib/node-join-kubeadm.sh hack/e2e/lib/node-join-msi.sh hack/e2e/lib/node-join-token.sh scripts/install.sh
  • git diff --check

Standalone validation

Validated against AKS standalone wenxuanebld168272832 using RP listBootstrapData and armProxyURLOverrideForE2E:

  • Created cluster and aksflexnodes FlexNodes pool.
  • Used RP bootstrap data as base agent config.
  • Bootstrapped VM vm-flexnode-06161434 with the dev AKSFlexNode binary.
  • Verified ARM Machine exists and is Succeeded.
  • Verified Kubernetes node joined and is Ready.
  • Verified smoke pod scheduled on the Flex node.

The local detailed report remains uncommitted per local validation workflow.

Copilot AI review requested due to automatic review settings June 16, 2026 22:37

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates AKS FlexNode to accept the AKS Resource Provider listBootstrapData payload shape and normalize it into the agent’s canonical config, while adding explicit configuration for the target agent pool name and the Azure Resource Manager endpoint so ARM/Arc client behavior is consistent across clouds.

Changes:

  • Accept and normalize RP bootstrap data (components, networking, node) into the runtime pkg/config.Config on load.
  • Introduce azure.targetAgentPoolName (defaulting to aksflexnodes) and use it for ARM Machine resource IDs and config generation.
  • Introduce azure.resourceManagerEndpoint and centralize ARM endpoint/audience/authority derivation for Azure SDK clients, including Arc flows.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
scripts/install.sh Updates sample config JSON to include ARM endpoint and target agent pool name.
scripts/aks-flex-config Adds --agent-pool-name and writes ARM endpoint + target pool into generated configs.
README.md Updates quickstart to pass agent pool name and reflects new config fields.
pkg/config/config.go Adds ARM endpoint + target pool defaults, adapts RP bootstrap data on load, and updates auth validation behavior.
pkg/config/config_test.go Updates/extends tests for new defaults, auth rules, and RP bootstrap normalization.
pkg/config/bootstrap_data.go Implements RP bootstrap payload normalization into canonical config fields.
pkg/azclient/resource_manager.go Centralizes ARM endpoint/audience/authority derivation and Azure SDK client options.
pkg/azclient/resource_manager_test.go Adds unit tests for ARM environment derivation and client options.
pkg/arc/arc_installer.go Switches Arc credential/token/client setup to shared ARM environment handling.
pkg/aksmachine/types.go Tightens validation to keep maxPods within int32 bounds for ARM payloads.
pkg/aksmachine/client_armapi.go Uses targetAgentPoolName for Machine resource IDs and shared ARM client options/credentials.
pkg/aksmachine/client_armapi_test.go Updates tests for configurable pool name, client options, and kubelet config behavior.
hack/e2e/run.sh Documents new E2E env var for target agent pool name.
hack/e2e/README.md Documents E2E_TARGET_AGENT_POOL_NAME.
hack/e2e/lib/node-join-token.sh Passes --agent-pool-name when generating token-based node configs.
hack/e2e/lib/node-join-msi.sh Includes ARM endpoint + target pool name in MSI node join config.
hack/e2e/lib/node-join-kubeadm.sh Includes ARM endpoint + target pool name in kubeadm node join config.
hack/e2e/lib/common.sh Sets/logs default E2E_TARGET_AGENT_POOL_NAME.
docs/usages/joining-nodes.md Updates minimal config examples to include ARM endpoint and target pool name.
docs/usages/configuration.md Documents new fields, updated auth rules, and RP bootstrap-to-runtime field mapping.
docs/usages/aks-flex-config.md Documents --agent-pool-name and the generated config fields.
docs/labs/gpu-node-setup.md Updates lab to pass agent pool name through config generation.
docs/labs/aks-public-cluster-unbounded-net-wireguard.md Updates lab to pass agent pool name through config generation.
docs/labs/aks-private-cluster-unbounded-net.md Updates lab to pass agent pool name through config generation.
docs/labs/aks-private-cluster-cilium.md Updates lab to pass agent pool name through config generation.
.env.example Adds E2E_TARGET_AGENT_POOL_NAME example configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/config/config.go
Comment on lines +428 to 430
if c.requiresTenantID() && c.TenantID == "" {
return fmt.Errorf("azure.tenantId is required")
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this validation makes sense, can we add one as a dedicate method? @wenxuan0923

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in a3b0a7f: endpoint validation now lives in AzureConfig.validateResourceManagerEndpointURL(), requires an absolute https URL, and rejects user info, explicit ports, path, query, and fragment. Added tests for missing scheme, http, path, port, and user info cases.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is now in method validateResourceManagerEndpointURL

Comment thread pkg/config/bootstrap_data.go Outdated
Comment on lines +8 to +11
// poolBootstrapData is the AKS RP-generated bootstrap config shape. The agent
// accepts it at load time and normalizes it into the runtime Config shape.
// this is based on v20260502preview AKS RP API version
type poolBootstrapData struct {
Comment thread pkg/azclient/resource_manager.go Outdated
Comment thread pkg/config/config.go
Comment thread pkg/config/config.go Outdated
Copilot AI review requested due to automatic review settings June 16, 2026 23:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 1 comment.

Comment thread pkg/config/config.go
Comment thread pkg/aksmachine/client_armapi.go
Comment thread pkg/config/bootstrap_data.go Outdated
)

// poolBootstrapData is the AKS RP-generated bootstrap config shape. The agent
// accepts it at load time and normalizes it into the runtime Config shape.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to make the runtime config shape to follow AKS RP side settings, then add compatibility support for the current config in main? We can eventually remove the compatibility support in a few versions later.

@wenxuan0923 wenxuan0923 Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, updated the code to make the RP side setting first class citizen now, with a backward adapter for legacy config

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but do we want to replace current Config with this poolBootstrapData now?

Copilot AI review requested due to automatic review settings June 17, 2026 02:05

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 26 changed files in this pull request and generated 3 comments.

Comment thread pkg/config/config.go
Comment on lines +481 to +497
func (c *AzureConfig) validateResourceManagerEndpointURL() error {
endpoint := strings.TrimSpace(c.ResourceManagerEndpointURL)
if endpoint == "" {
return fmt.Errorf("azure.resourceManagerEndpoint is required")
}
parsed, err := url.Parse(endpoint)
if err != nil || parsed.Scheme == "" || parsed.Host == "" {
return fmt.Errorf("azure.resourceManagerEndpoint must be an absolute https URL")
}
if parsed.Scheme != "https" {
return fmt.Errorf("azure.resourceManagerEndpoint must use https")
}
if parsed.Path != "" || parsed.RawQuery != "" || parsed.Fragment != "" {
return fmt.Errorf("azure.resourceManagerEndpoint must not include a path, query, or fragment")
}
return nil
}

This comment was marked as resolved.

Comment thread pkg/config/config.go Outdated
Comment on lines 646 to 655
@@ -615,6 +653,9 @@
cfg.Azure.TargetCluster.ResourceGroup = resourceGroupName
cfg.Azure.TargetCluster.SubscriptionID = subscriptionID
cfg.Azure.TargetCluster.NodeResourceGroup = mcResourceGroup

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 2d96feb. We do not currently need TargetCluster.NodeResourceGroup in runtime code; only the target cluster name/resource group/subscription are consumed. I removed the MC_* derivation entirely so we no longer fabricate a misleading node resource group when location is omitted or when the AKS cluster uses a custom node resource group. Added coverage for both location-present and location-omitted cases.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up in d5cfefd: removed the unused TargetCluster.NodeResourceGroup field and the related test assertion completely. The agent still derives the target cluster name/resource group/subscription from the resource ID, but no node resource group code remains in runtime/config.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeResourceGroup property is not used at all, removed it.

Comment thread pkg/config/bootstrap_data.go Outdated
Comment on lines +8 to +11
// poolBootstrapData is the AKS RP-generated bootstrap config shape. The agent
// accepts it at load time and normalizes it into the runtime Config shape.
// this is based on v20260502preview AKS RP API version
type poolBootstrapData struct {
Copilot AI review requested due to automatic review settings June 17, 2026 02:56

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 33 changed files in this pull request and generated 4 comments.

Comment thread scripts/install.sh
Comment on lines 325 to 327
"arc": {
"machineName": "YOUR_MACHINE_NAME",
"tags": {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1b2643e: the installer example now sets azure.arc.enabled to true so the sample actually selects Arc authentication.

Comment thread README.md
Comment on lines 81 to 85
"node": {
"kubelet": {
"serverURL": "https://<aks-api-server>", // AKS API server endpoint.
"clusterFQDN": "<aks-api-server-fqdn>", // AKS API server FQDN.
"caCertData": "<base64-ca-data>" // Cluster CA bundle from kubeconfig.
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1b2643e: scripts/aks-flex-config now renders the canonical node.kubelet.clusterFQDN field instead of the legacy node.kubelet.serverURL field. The README example remains aligned with the generated output.

Comment thread README.md
Comment on lines 87 to 92
"agent": {
"logLevel": "info", // Agent log verbosity.
"logDir": "/var/log/aks-flex-node" // Host log directory.
},
"kubernetes": { "version": "<aks-kubernetes-version>" } // Kubelet version to install.
"components": { "kubernetes": "<aks-kubernetes-version>" } // Kubelet version to install.
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1b2643e: scripts/aks-flex-config now renders components.kubernetes instead of the legacy kubernetes.version shape.

Comment thread pkg/config/config.go
Comment on lines +419 to 423
// When using bootstrap token, clusterFQDN and caCertData are required in kubelet config
// because there's no Azure authentication to fetch them
if c.Node.Kubelet.ServerURL == "" {
return fmt.Errorf("node.kubelet.serverURL is required when using bootstrap token authentication")
if c.APIServerURL() == "" {
return fmt.Errorf("node.kubelet.clusterFQDN is required when using bootstrap token authentication")
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 1b2643e: bootstrap-token validation now checks that node.kubelet.clusterFQDN resolves to an absolute HTTPS kube-apiserver URL and rejects malformed URLs, HTTP, user info, path, query, and fragment. Added table coverage for accepted host/HTTPS inputs and rejected invalid cases.

Copilot AI review requested due to automatic review settings June 17, 2026 03:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 33 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

pkg/daemon/daemon_test.go:20

  • This test sets node.kubelet.clusterFQDN to a full URL and then asserts restCfg.Host equals the raw ClusterFQDN. ClusterFQDN is documented/used as a FQDN/host input, and bootstrapCredentialRESTConfig intentionally uses cfg.APIServerURL() (derived URL) as restCfg.Host. The test should set ClusterFQDN to a host value and assert Host against cfg.APIServerURL() to validate the intended normalization logic (including defaulting :443).
	cfg.Node.Kubelet.ClusterFQDN = "https://example.test"
	cfg.Node.Kubelet.CACertData = base64.StdEncoding.EncodeToString([]byte("ca"))
	cfg.Azure.BootstrapToken = &config.BootstrapTokenConfig{Token: "token.value"}

	restCfg, err := bootstrapCredentialRESTConfig(cfg)

Comment thread scripts/aks-flex-config
Comment on lines 152 to 156
"subscriptionId": metadata["subscription_id"],
"tenantId": metadata["tenant_id"],
"cloud": "AzurePublicCloud",
"resourceManagerEndpoint": RESOURCE_MANAGER_ENDPOINT,
"targetAgentPoolName": metadata["agent_pool_name"],
"targetCluster": {
Comment thread README.md
Comment on lines 65 to +71
```jsonc
{
"azure": {
"subscriptionId": "<subscription-id>", // Azure subscription that owns the AKS cluster.
"tenantId": "<tenant-id>", // Microsoft Entra tenant for the subscription.
"cloud": "AzurePublicCloud", // Azure cloud environment.
"resourceManagerEndpoint": "https://management.azure.com", // Azure Resource Manager endpoint for ARM calls.
"targetAgentPoolName": "<agent-pool-name>", // AKS agent pool used for FlexNode machine registration.
Comment on lines 118 to 131
func machineResourceIDFromConfig(cfg *config.Config) (*arm.ResourceID, error) {
if cfg.Azure.TargetCluster.ResourceID == "" || cfg.Agent.NodeName == "" || cfg.Kubernetes.Version == "" {
return nil, fmt.Errorf("incomplete AKS machine config: clusterResourceId=%q machineName=%q kubernetesVersion=%q",
cfg.Azure.TargetCluster.ResourceID, cfg.Agent.NodeName, cfg.Kubernetes.Version)
var clusterResourceID, agentPoolName, machineName, k8sVersion string
if cfg != nil {
if cfg.Azure.TargetCluster != nil {
clusterResourceID = cfg.Azure.TargetCluster.ResourceID
}
agentPoolName = strings.TrimSpace(cfg.Azure.TargetAgentPoolName)
machineName = cfg.Agent.NodeName
k8sVersion = cfg.Components.Kubernetes
}
if clusterResourceID == "" || agentPoolName == "" || machineName == "" || k8sVersion == "" {
return nil, fmt.Errorf("incomplete AKS machine config: clusterResourceId=%q targetAgentPoolName=%q machineName=%q kubernetesVersion=%q",
clusterResourceID, agentPoolName, machineName, k8sVersion)
}
Copilot AI review requested due to automatic review settings June 17, 2026 03:30

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 33 out of 33 changed files in this pull request and generated 4 comments.

Comment thread pkg/daemon/daemon_test.go
Comment on lines 15 to 25
cfg := &config.Config{}
cfg.Node.Kubelet.ServerURL = "https://example.test"
cfg.Node.Kubelet.ClusterFQDN = "https://example.test"
cfg.Node.Kubelet.CACertData = base64.StdEncoding.EncodeToString([]byte("ca"))
cfg.Azure.BootstrapToken = &config.BootstrapTokenConfig{Token: "token.value"}

restCfg, err := bootstrapCredentialRESTConfig(cfg)
if err != nil {
t.Fatalf("bootstrapCredentialRESTConfig: %v", err)
}
if restCfg.Host != cfg.Node.Kubelet.ServerURL || restCfg.BearerToken != cfg.Azure.BootstrapToken.Token {
if restCfg.Host != cfg.Node.Kubelet.ClusterFQDN || restCfg.BearerToken != cfg.Azure.BootstrapToken.Token {
t.Fatalf("rest config = %#v", restCfg)
Comment thread pkg/daemon/daemon_test.go
Comment on lines 32 to 35
cfg := &config.Config{}
cfg.Node.Kubelet.ServerURL = "https://example.test"
cfg.Node.Kubelet.ClusterFQDN = "https://example.test"
cfg.Node.Kubelet.CACertData = base64.StdEncoding.EncodeToString([]byte("ca"))
cfg.Azure.ServicePrincipal = &config.ServicePrincipalConfig{TenantID: "tenant", ClientID: "client", ClientSecret: "secret"}
Comment thread pkg/daemon/daemon_test.go
Comment on lines 53 to 55
cfg := &config.Config{}
cfg.Node.Kubelet.ServerURL = "https://example.test"
cfg.Node.Kubelet.ClusterFQDN = "https://example.test"
cfg.Node.Kubelet.CACertData = base64.StdEncoding.EncodeToString([]byte("ca"))
Comment thread pkg/config/config.go
Comment on lines +509 to +517
if parsed.User != nil {
if opts.allowPort {
return fmt.Errorf("%s must not include user info", opts.fieldName)
}
return fmt.Errorf("%s must not include user info or port", opts.fieldName)
}
if !opts.allowPort && parsed.Port() != "" {
return fmt.Errorf("%s must not include user info or port", opts.fieldName)
}
@wenxuan0923

Copy link
Copy Markdown
Collaborator Author

@copilot resolve the merge conflicts in this pull request

Copilot AI review requested due to automatic review settings June 17, 2026 18:01
@wenxuan0923 wenxuan0923 removed the request for review from Copilot June 17, 2026 18:01

Copilot AI commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

@copilot resolve the merge conflicts in this pull request

Resolved in 2026ca3. The only conflict was in pkg/npd/start.go — the Start function signature — which was resolved by accepting the origin/main change (using goalstates.NodeStart and adding the nodeName field). Build verified after the merge.

Copilot AI deployed to e2e-testing June 17, 2026 18:05 Active
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants