A Kubernetes operator designed to automate the lifecycle of TPU slices within self-managed clusters on Google Compute Engine (GCE).
When running self-managed Kubernetes on Google Compute Engine (GCE), provisioning TPU slices manually requires orchestrating multiple GCE APIs, managing complex metadata for hardware discovery, and bootstrapping nodes.
The TPUNodeGroup controller acts as a Kubernetes-native infrastructure operator that automates the complexities of provisioning TPU slices. Unlike standard GCE VMs, TPU slices involve multiple VMs interconnected with a specialized high-speed network, forming a single, atomic unit for large scale ML workloads.
From your perspective, you simply provide a TPUNodeGroup CR declaring your desired TPU shape. Behind the scenes, the controller will:
- Abstract GCE Complexity: To properly provision TPUs, the controller automatically creates three critical GCE resources on your behalf:
- GCE Workload Policy: Responsible for specifying the physical properties of the underlying infrastructure. For multi-host slices, this policy specifies the acceleratorTopology (e.g., 4x4x4), which tells Google's data centers exactly how to physically wire the TPU chips together using high-speed Inter-Chip Interconnects (ICI).
- Instance Template: Defines the VM configuration (machine type, image, disks, metadata, etc.) used by the Managed Instance Group.
- Managed Instance Group (MIG): Responsible for actually provisioning and managing the lifecycle of the Virtual Machines. For multi-host slices, users must explicitly configure
targetSizePolicyMode: "BULK"in the TPUNodeGroup CR. The controller then configures the MIG to use "bulk mode", which guarantees atomic provisioning. This ensures GCE will only create the VMs if it can successfully allocate the entire slice at once, preventing partial, unusable setups.
- Inject TPU Metadata: Automatically inject TPU configuration metadata needed by the TPU Device Plugin (like kube-labels and accelerator_topology_id) into the VMs via the MIG.
- Secure "Pull-Model" Bootstrapping: To handle long GCE provisioning times (where standard short-lived kubeadm tokens might expire), the controller implements a Pull Model. Instead of injecting tokens at VM creation, the controller waits for the VM state to be
RUNNING, then injects a dynamic, short-lived join token into Instance Metadata. A startup script on the VM polls for this token, ensuring a secure and reliable cluster join even after extended provisioning delays. (Note: This feature is primarily provided to facilitate rapid prototyping and testing. For production environments, it is expected that users will leverage their existing mechanisms for node bootstrapping, such as custom OS images or configuration management tools). - Deploy the TPU Device Plugin: The controller automatically deploys a streamlined, open-source TPU device plugin to your cluster as a DaemonSet. This plugin acts as the bridge between the raw TPU hardware and the Kubernetes scheduler. It discovers the attached TPU chips, registers them as schedulable resources (e.g., google.com/tpu), and handles local hardware health monitoring.
- Handle Graceful Teardown: When the TPUNodeGroup CR is deleted, the controller orchestrates a graceful teardown using a Kubernetes Finalizer. It cordons the nodes, deletes the GCE resources (MIG, Instance Template, and Workload Policy) in reverse order, and removes stale Node objects from the cluster.
Note: The current release supports v6e and v7x TPU generations.
This project is intended as a Reference Implementation for experimental and testing purposes.
Core Assumption: We assume that production customers already have mature solutions for their node infrastructure and lifecycle. For those who need a rapid, "turn-key" way to experiment with TPUs on self-managed clusters, we provide a Reference Bootstrapping flow using standard Kubernetes mechanics.
This controller is designed using the Composite Pattern to maximize modularity and reusability. Instead of a single monolithic state machine, the project is structured as a parent controller that orchestrates three dedicated child Custom Resources:
- ResourcePolicy Controller: Manages GCE Resource Policies (Workload Policies) for ICI networking.
- InstanceTemplate Controller: Manages GCE Instance Templates.
- ManagedInstanceGroup Controller: Manages GCE MIGs, with specific support for bulk allocation.
All infrastructure resources are managed asynchronously via official Google Cloud Go Client Libraries. This architecture allows the TPUNodeGroup controller to focus solely on high-level workload orchestration and node lifecycle, leaving the low-level GCE "3C" (Compute) resource management to specialized, decoupled sub-controllers.
When bootstrapKubernetes is enabled, the controller orchestrates a secure, asynchronous Pull-Model Bootstrapping flow. This design safely coordinates dynamic token generation and join operations over potentially extended GCE VM booting delays.
The configuration sequence proceeds as follows:
- GCE VM Initialization (VM Starts Booting): The Managed Instance Group (MIG) provisions the VM.
- GCE VM (Polling Begins): The VM's startup script immediately begins polling its own GCE Instance Metadata for the
kubeadm-join-token. - Controller (RUNNING Detection): In parallel, the controller detects that the GCE VM state has transitioned to
RUNNING. - Controller (Token Generation): The controller generates a unique, short-lived
kubeadmjoin token (valid for 1 hour) and creates a corresponding Bootstrap Secret in thekube-systemnamespace. - Controller (Metadata Injection): The controller fetches the cluster's CA Cert Hash and injects it along with the Join Token and Control Plane IP directly into the GCE Instance Metadata for that specific VM.
- GCE VM (Poll Succeeds): The VM's polling loop succeeds and retrieves the injected configuration values.
- GCE VM (Pull-Model Join): The VM executes
kubeadm joinusing the retrieved credentials to securely attach itself to the cluster. - Kubernetes (Node Registered): The K8s API Server registers the node and marks it as
Ready. - Controller (Node Matching & Labeling): Once the new worker node appears in the cluster, the controller matches it to the GCE instance using the node's
ProviderIDand automatically patches the Node with TPU labels (topology, accelerator type, chip count).
The following sequence diagram visualizes this chronological pull-model orchestration:
The TPUNodeGroup controller requires permissions to create and manage GCE resources (Instance Templates, Workload Policies, and MIGs) on your behalf.
You can assign the roles/compute.admin role, or for least-privilege environments, ensure the service account has the following specific permissions:
- compute.instanceTemplates.*
- compute.resourcePolicies.*
- compute.instanceGroupManagers.*
- compute.instanceGroups.*
- compute.instances.*
- compute.subnetworks.get
If you are setting up a new self-managed cluster, you can initialize a control plane on a GCE VM using kubeadm.
- Create a GCE VM (if you don't have one):
export ZONE=us-central1-c
export SERVICE_ACCOUNT_EMAIL=your-service-account@project.iam.gserviceaccount.com
gcloud compute instances create k8s-control-plane \
--zone=$ZONE \
--machine-type=e2-standard-4 \
--image-family=ubuntu-2404-lts-amd64 \
--image-project=ubuntu-os-cloud \
--tags=k8s-control-plane \
--service-account=$SERVICE_ACCOUNT_EMAIL \
--scopes=https://www.googleapis.com/auth/cloud-platform- Initialize the cluster: SSH into the VM and run the following commands.
Ensure
conntrack,containerd,kubeadmand a CNI plugin are installed.
# Initialize the control plane with a specific pod CIDR (Flannel requires 10.244.0.0/16)
sudo kubeadm init --pod-network-cidr=10.244.0.0/16Note: kubeadm is required if enabling bootstrapKubernetes in the TPUNodeGroup CR.
Ensure your subnet has internet access (e.g., via Cloud NAT) so nodes can download packages during bootstrapping.
Allow API Server access (TCP 6443) from the TPU nodes and allow internal traffic between the Control Plane and the TPU VMs. Adjust the source-ranges to match your VPC CIDR:
export VPC_NETWORK=default
export CONTROL_PLANE_CIDR=0.0.0.0/0 # Restrict this in production
export CLUSTER_CIDR=10.128.0.0/9 # Adjust to your VPC CIDR
# Allow API Server access (6443)
gcloud compute firewall-rules create allow-k8s-apiserver \
--allow tcp:6443 \
--network=$VPC_NETWORK \
--source-ranges=$CONTROL_PLANE_CIDR \
--target-tags=k8s-control-plane
# Allow all internal traffic between nodes (Control Plane <-> TPU VM)
gcloud compute firewall-rules create allow-k8s-internal \
--allow tcp,udp,icmp,ipip \
--network=$VPC_NETWORK \
--source-ranges=$CLUSTER_CIDR \
--target-tags=k8s-control-planeIf you are using a reservation-bound provisioning model, you must reserve the TPU capacity in your target zone.
export RESERVATION_NAME=my-tpu-reservation
export VM_FAMILY=tpu7x # or tpu6e
export ACCELERATOR_COUNT=4
export HOST_COUNT=2
export ZONE=us-central1-c
gcloud compute reservations create $RESERVATION_NAME \
--vm-family=$VM_FAMILY \
--accelerator-count=$ACCELERATOR_COUNT \
--host-count=$HOST_COUNT \
--zone=$ZONEBefore creating the TPUNodeGroup resource, you need to build the controller image, push it to your container registry, and deploy it using the provided manifests.
- Build and Push the Image:
export PROJECT_ID=your-project-id
docker build -t gcr.io/$PROJECT_ID/tpunodegroup-controller:latest .
docker push gcr.io/$PROJECT_ID/tpunodegroup-controller:latest- Update the Deployment Manifest: Open
deploy/kustomization.yamland update thenewNameandnewTagunder theimagessection to match the image you just pushed. - Install Components: Apply the kustomization to deploy all components (CRDs, Controller, Device Plugin, etc.).
kubectl apply -k deploy/- Verify the Controller is Running:
kubectl get pods -n tpu-node-groupNote: If you are running a single-node cluster (control plane only), you may need to untaint the node to allow the controller to run: kubectl taint nodes --all node-role.kubernetes.io/control-plane-
Once the controller is running, create a TPUNodeGroup resource in your cluster. This tells the controller what capacity and configuration to instantiate.
The controller manages the creation of the Instance Template dynamically based on the instanceConfig fields. The bootstrapKubernetes block instructs the controller to automatically bootstrap the Kubernetes nodes and join them to the control plane.
Set the required environment variables:
export PROJECT=my-gcp-project
export ZONE=us-central1-c
export RESERVATION_NAME=my-tpu-reservation
export IMAGE=projects/ubuntu-os-accelerator-images/global/images/ubuntu-accel-2404-amd64-tpu-tpu7x-v20260320
export TOPOLOGY=2x2x2
export NODE_COUNT=2
export CONTROL_PLANE_IP=10.128.0.2
export TPU_NODE_GROUP_NAME=my-tpu-node-group
export NAMESPACE=default
export ACCELERATOR=tpu7x
export MACHINE_TYPE=tpu7x-standard-4tExample CR creation using these variables:
cat <<EOF | kubectl apply -f -
apiVersion: tpu.google.com/v1alpha1
kind: TPUNodeGroup
metadata:
name: $TPU_NODE_GROUP_NAME
namespace: $NAMESPACE
spec:
project: $PROJECT
nodeLocation: $ZONE
nodeCount: $NODE_COUNT
acceleratorConnectionMode: "STATIC"
topology: $TOPOLOGY
targetSizePolicyMode: "BULK" # Required in latest versions
instanceConfig:
machineType: $MACHINE_TYPE
provisioningModel: "RESERVATION_BOUND"
reservation: $RESERVATION_NAME
image: $IMAGE
subnetwork: projects/$PROJECT/regions/us-central1/subnetworks/default # Required if network is in custom mode
bootDiskSizeGB: 100
bootstrapKubernetes:
version: "1.31"
controlPlaneIP: $CONTROL_PLANE_IP
EOFIf you want to further customize the VM, you can bring your own pre-provisioned GCE Instance Template and bypass the controller's template creation process. You can provide its URI using the instanceTemplateURI field. When this field is set, the controller will skip generating an Instance Template and use yours directly to create the Managed Instance Group. The controller will perform basic validation.
Note: Because the controller skips the bootstrapping phase when using a custom Instance Template, you must configure your own bootstrapping solution. Once the nodes join, the controller will still perform node labeling and deploy the TPU device plugin DaemonSet.
The controller provides real-time readiness feedback in the CRD status. You don't need to check the GCP console; simply inspect the resource in Kubernetes to see its exact lifecycle state.
kubectl get tpunodegroup $TPU_NODE_GROUP_NAME -o yamlYou can view the detailed conditions array to see the health and provisioning phase of your TPU capacity. The conditions dynamically adapt based on your configuration:
- Ready: Set to True only when 100% of the nodes in the slice are healthy, registered, and their ICI links are established.
In your workload manifest, you can add Kubernetes node selectors to ensure that your TPU workload is scheduled on the correct TPU machine type and topology. The controller automatically injects these labels into the nodes.
Note: For JAX workloads, please ensure you use JAX 0.10.0 or newer to avoid the need for manual worker network endpoint metadata configuration.
Example deployment using node selectors:
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
name: tpu-svc-v7x
namespace: $NAMESPACE
spec:
clusterIP: None
publishNotReadyAddresses: true
selector:
app: tpu-worker-v7x
ports:
- name: coordinator
port: 1234
- name: runtime-1
port: 8470
- name: runtime-2
port: 8471
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: tpu-worker-v7x
namespace: $NAMESPACE
spec:
serviceName: "tpu-svc-v7x"
replicas: 2
podManagementPolicy: Parallel
selector:
matchLabels:
app: tpu-worker-v7x
template:
metadata:
labels:
app: tpu-worker-v7x
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
nodeSelector:
cloud.google.com/tpu-node-group: $NAMESPACE-$TPU_NODE_GROUP_NAME
cloud.google.com/gke-tpu-topology: $TOPOLOGY
cloud.google.com/gke-tpu-accelerator: $ACCELERATOR
tolerations:
- key: "google.com/tpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- name: tpu-job
image: us-docker.pkg.dev/cloud-tpu-images/jax-ai-image/tpu:latest
securityContext:
privileged: true
command:
- python3
- -c
- |
import jax, os, time, socket
def wait_for_dns(hostname):
for _ in range(30):
try:
socket.gethostbyname(hostname)
return True
except socket.gaierror:
time.sleep(5)
return False
svc_name = "tpu-svc-v7x"
pod_base_name = "tpu-worker-v7x"
namespace = "default"
dns0 = f"{pod_base_name}-0.{svc_name}.{namespace}.svc.cluster.local"
dns1 = f"{pod_base_name}-1.{svc_name}.{namespace}.svc.cluster.local"
if not wait_for_dns(dns0) or not wait_for_dns(dns1):
print("Failed to resolve DNS for TPU workers")
exit(1)
os.environ.update({
"JAX_COORDINATOR_ADDRESS": f"{dns0}:1234",
"MEGASCALE_COORDINATOR_ADDRESS": f"{dns0}:8471",
"TPU_PROCESS_ADDRESSES": f"{dns0}:8470,{dns1}:8470",
"TPU_WORKER_HOSTNAMES": f"{dns0}:8471,{dns1}:8471"
})
p_id = int(os.environ["K8S_POD_NAME"].split("-")[-1])
print(f"Initializing JAX distributed: coordinator={dns0}:1234, total_processes=2, process_id={p_id}")
jax.distributed.initialize(coordinator_address=f"{dns0}:1234", num_processes=2, process_id=p_id)
print("Devices:", jax.devices())
assert len(jax.devices()) == 8, f"Expected 8 devices, got {len(jax.devices())}"
print("TPU OK")
env:
- name: JAX_PLATFORMS
value: "tpu"
- name: JAX_PROCESS_COUNT
value: "2"
- name: TPU_TOPOLOGY
value: "$TOPOLOGY"
- name: TPU_ACCELERATOR_TYPE
value: "tpu7x-4"
- name: K8S_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NODE_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: TPU_CHIPS_PER_HOST_BOUNDS
value: "2,2,1"
- name: TPU_HOST_BOUNDS
value: "1,1,2"
- name: VBAR_CONTROL_SERVICE_URL
value: "127.0.0.1:8353"
- name: TPU_PROCESS_PORT
value: "8470"
- name: TPU_RUNTIME_METRICS_PORTS
value: "8431,8432,8433,8434"
resources:
limits:
google.com/tpu: "4"
requests:
google.com/tpu: "4"
EOFAfter the job is done, you can clean up the resources using
kubectl delete tpunodegroup $TPU_NODE_GROUP_NAMEThe controller orchestrates a graceful teardown using a Kubernetes Finalizer. It executes the teardown in three strict phases:
- Cordon: Marks all nodes in the slice as Unschedulable to prevent the scheduler from placing new Pods on them.
- Infrastructure Removal: Deletes GCE resources (MIGs, Instance Templates and Workload Policies) in reverse creation order.
- Kubernetes Cleanup: Deletes any remaining stale Node objects from the cluster and finally removes the finalizer, allowing the CR to be fully removed from Kubernetes.
The TPUNodeGroup controller needs GCP permissions to create and manage GCE resources. You can provide these credentials in two ways:
If your Kubernetes control plane or operator is running on a GCE VM, the controller transparently uses the VM's attached service account via Application Default Credentials (ADC). This is the recommended approach for development within GCE, leveraging the existing node identity and predefined IAM scopes for GCE worker nodes.
- Ensure the GCE VM was created with the required IAM permissions (or
roles/compute.admin). - Ensure the VM has the
https://www.googleapis.com/auth/cloud-platformscope enabled.
If you are running the controller outside of GCE or want to use a specific service account, you can provide a key file via a Kubernetes Secret.
- Create a Service Account in GCP and grant it the required permissions (see Prerequisites).
- Generate a JSON key for the service account.
- Create a Kubernetes Secret in the
tpu-node-groupnamespace:
export KEY_PATH=/path/to/your/service-account-key.json
kubectl create secret generic tpu-node-group-credentials \
--from-file=key.json=$KEY_PATH \
--namespace tpu-node-group- Update
deploy/controller/deployment.yaml: Uncomment theenvsection forGOOGLE_APPLICATION_CREDENTIALSin the controller deployment:
env:
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /etc/gcp/key.jsonThis project is licensed under the Apache 2.0 License.
We welcome contributions! Please see docs/contributing.md for more information.
We follow Google's Open Source Community Guidelines.
This is not an officially supported Google product.
This project is not eligible for the Google Open Source Software Vulnerability Rewards Program.