Skip to content

aws-samples/sample-aws-eks-auto-mode

Setting up EKS Auto Mode using Terraform

License: MIT-0 EKS Terraform Kubernetes Docs

Automate Kubernetes the easy way. Deploy once, explore GPU, Spot, Graviton, cost optimization, ODCR, disruption budgets, observability, and more. Minimal add-on management. Auto Mode handles core compute, storage, and networking add-ons for you.

Table of Contents

Overview

Amazon EKS Auto Mode simplifies Kubernetes cluster management by automating compute, storage, and networking decisions. Under the hood it runs Karpenter, the AWS Load Balancer Controller, and the EBS CSI driver as managed components. You get the benefits without installing or upgrading any of them.

This repository is an educational companion. Each example demonstrates a specific EKS Auto Mode pattern (Graviton, GPU, Spot, ODCR targeting, disruption budgets, etc.) with a self-contained README explaining the "why" alongside the "how." Deploy the base cluster once, then apply individual examples to explore.

Key capabilities covered:

  • Graviton (ARM64) and x86 workloads side by side
  • GPU and Inferentia2 (Neuron) ML inference
  • Spot and On-Demand mixed pools with overprovision headroom
  • On-Demand Capacity Reservation targeting
  • Static capacity pools and disruption budgets
  • HPA and KEDA-driven autoscaling
  • KMS encryption for ephemeral node storage
  • CloudWatch Container Insights observability
  • 5-layer resource tagging for cost allocation

Prerequisites

Required Tools:

Note: This project currently provides Linux-specific commands in the examples. Windows compatibility will be added in future updates.

Claude Code Skills (Plugin)

This repo ships as a Claude Code plugin with two AI-assisted skills for EKS Auto Mode:

Skill Audience What it covers
eks-automode-onboard Newcomers Concepts, deployment, example selection, troubleshooting
eks-automode-maintain Repo maintainers Rendering chain, 5-layer tagging, docs sync, PR checklist

Install

/plugin marketplace add https://github.com/aws-samples/sample-aws-eks-auto-mode.git
/plugin install eks-automode@sample-aws-eks-auto-mode

Alternative: manual install

git clone https://github.com/aws-samples/sample-aws-eks-auto-mode.git
cp -r sample-aws-eks-auto-mode/skills/eks-automode-onboard ~/.claude/skills/
cp -r sample-aws-eks-auto-mode/skills/eks-automode-maintain ~/.claude/skills/

Quick Start

  1. Clone Repository:
git clone https://github.com/aws-samples/sample-aws-eks-auto-mode.git
cd sample-aws-eks-auto-mode
  1. Deploy Cluster:
cd terraform
terraform init
terraform apply -auto-approve

# Configure kubectl
$(terraform output -raw configure_kubectl)
  1. Apply an example (e.g., Graviton):
kubectl apply -f examples/graviton/

Examples

Each example has its own README with detailed explanations of the underlying mechanics.

Compute Patterns

Example Description
Graviton ARM64 workloads on cost-effective Graviton instances. Deploys a 2048 game.
Spot Fault-tolerant workloads on EC2 Spot with diverse instance families. Deploys a 2048 game.
GPU GPU-accelerated ML inference (Qwen 3 on NVIDIA GPUs).
Neuron ML inference on AWS Inferentia2 (DeepSeek-R1-Qwen3-8B served by vLLM).

Cost Optimization

Example Description
Cost Optimization OD/Spot mixed pools with weighted priorities and pause-pod overprovision headroom.

Advanced Scheduling

Example Description
Capacity Reservation Pin workloads to On-Demand Capacity Reservations (ODCRs) so reserved capacity is consumed.
Static Capacity Maintain a fixed fleet of always-on nodes using spec.replicas, immune to consolidation.
Batch Jobs Protect long-running jobs from eviction using do-not-disrupt annotations and dedicated NodePools.
Disruption Budgets Limit simultaneous node drains during consolidation to prevent cascading failures.

Autoscaling

Example Description
Pod Autoscaling HPA for CPU-based scaling plus KEDA for event-driven scaling (SQS queue depth).

Observability

Example Description
Observability CloudWatch Container Insights integration for metrics, pod logs, and Application Signals tracing.

Cleanup

A standalone cleanup script handles the full teardown lifecycle. It drains Kubernetes-controller-managed AWS resources (ALBs, EBS volumes, EC2 instances) before terraform destroy, then sweeps for any orphans that survived.

# Recommended: interactive cleanup (prompts per resource)
./scripts/cleanup.sh

# Non-interactive: delete everything
./scripts/cleanup.sh --yes

# Preview what would be deleted
./scripts/cleanup.sh --dry-run

# Delete everything except storage (PVCs/EBS)
./scripts/cleanup.sh --yes --keep-storage

# Orphan sweep only (terraform already destroyed)
./scripts/cleanup.sh --skip-terraform --cluster-name <name> --region <region>

The script runs in three phases:

  1. Pre-drain deletes Ingresses, LoadBalancer Services, PVCs, Helm releases, NodePools/NodeClaims while the cluster API is alive so controllers can fire finalizers and release AWS resources.
  2. Terraform destroy runs terraform init + destroy for both the main and KEDA terraform roots.
  3. Orphan sweep scans for resources tagged with the cluster name (or matching known patterns for untaggable resources like Auto Mode internal volumes) and prompts for deletion.

Why not just terraform destroy? A bare terraform destroy doesn't drain Kubernetes-managed resources first. ALBs, EBS volumes, EC2 instances, and ENIs created by in-cluster controllers (ALB controller, EBS CSI, Karpenter) are not in Terraform state. They persist as orphans after the cluster is gone. The cleanup script handles these.

Manual alternative (not recommended)
cd terraform
terraform init
terraform destroy --auto-approve

This only destroys Terraform-managed resources. You will need to manually find and delete any orphaned load balancers, volumes, instances, security groups, IAM roles, OIDC providers, and CloudWatch log groups.

Configuration

All inputs are defined in terraform/variables.tf. Override them with -var flags or a terraform.tfvars file.

Variable Description Default
name Name of the VPC and EKS cluster automode-cluster
region AWS region to deploy into us-west-2
eks_cluster_version EKS Kubernetes version 1.34
vpc_cidr VPC CIDR block (RFC 1918) 10.0.0.0/16
tags Tags applied to every taggable resource (provider default_tags, EKS primary SG, NodeClass EC2/EBS/ENI, StorageClass EBS, ALB) {"auto-delete" = "never"}
base_domain Public Route53 hosted zone for HTTPS exposure. Leave empty for internal-only (safe-by-default). ""
subdomain Optional prefix under base_domain (e.g., automode gives automode.example.com). Ignored when base_domain is empty. ""
ephemeral_storage_kms_key_id KMS key ID for encrypting ephemeral node storage. Leave empty for default encryption. ""
enable_observability Enable CloudWatch Container Insights addon (metrics, logs, Application Signals). Incurs CloudWatch costs. false

Example: public exposure with observability:

terraform apply \
  -var='base_domain=example.com' \
  -var='subdomain=automode' \
  -var='enable_observability=true'

Components

How EKS Auto Mode Works

EKS Auto Mode fully automates the operational overhead of running Kubernetes on AWS. Rather than requiring you to install, configure, and upgrade individual cluster add-ons, Auto Mode runs them as managed components inside the EKS control plane. Specifically, it:

  • Provisions, scales, and consolidates compute. Powered by Karpenter, it matches pending pods to optimal EC2 instances, bins-packs efficiently, and removes underutilized nodes automatically.
  • Manages pod networking. Handles VPC CNI configuration, IP address allocation, and security group enforcement without any DaemonSet you need to maintain.
  • Handles persistent storage. Provisions and attaches EBS volumes from PersistentVolumeClaims via the managed EBS CSI driver.
  • Automates load balancing. Creates ALBs and NLBs from Ingress and Service resources, including TLS termination with ACM certificates.
  • Runs CoreDNS. Cluster DNS is a managed component with no installation or tuning required.
  • Manages Pod Identity Agent. Enables fine-grained IAM roles for pods without manual IRSA configuration.
  • Monitors node health. Detects unhealthy nodes and automatically repairs or replaces them.
  • Handles AMI selection and patching. Picks the correct AMI for each instance type, applies security patches, and remediates drift.

All of these run in the control plane. You never install, configure, or upgrade them. You interact through standard Kubernetes APIs (NodePool, NodeClass, Ingress, StorageClass, etc.) and EKS handles the rest.

NodePool → NodeClass → EC2 Flow

The provisioning path:

Pod pending → Karpenter matches NodePool constraints (instance families, AZs, capacity type)
           → NodePool references a NodeClass (subnet selection, security groups, tags, storage)
           → Karpenter launches an EC2 instance matching the constraints
           → kubelet registers the node and the pod is scheduled

NodePools define what to launch (instance types, architectures, capacity type, taints/labels). NodeClasses define how to launch (subnets, SGs, ephemeral storage, tags pushed to EC2/EBS/ENI).

Load Balancer Configuration

EKS Auto Mode automates ALB and NLB setup:

  • Application Load Balancer (ALB): IngressClass-based, supports shared ALB groups across namespaces. Docs
  • Network Load Balancer (NLB): Native Kubernetes Service type LoadBalancer. Docs

Subnet tagging requirement: If subnet IDs are not explicit in IngressClassParams, subnets need kubernetes.io/role/elb: "1" (public) or kubernetes.io/role/internal-elb: "1" (private). The Terraform code in this repo adds these tags automatically.

Public exposure (opt-in)

By default this stack is safe-by-default: every example workload exposes an internal-scheme load balancer reachable only via kubectl port-forward. Nothing is published to the public internet without an explicit opt-in.

To expose the example workloads on a real domain over HTTPS, set var.base_domain (and optionally var.subdomain) to a public Route53 hosted zone you already own:

terraform apply -var='base_domain=example.com' -var='subdomain=automode'

When base_domain is set, Terraform will:

  • Look up the existing public hosted zone (it does not create one; the zone must already exist and be the authoritative DNS for that name).
  • Issue an ACM wildcard certificate *.<subdomain>.<base_domain> validated via DNS records added to the zone.
  • Install external-dns bound to a Pod Identity IAM role scoped to only that hosted zone (not Route53FullAccess).
  • Switch the cluster-wide IngressClass alb to internet-facing with a shared ALB group so all example Ingresses share one load balancer.
  • Render each example with a public hostname and the appropriate annotations.

Workload hostnames once enabled:

Example URL
examples/graviton https://2048-graviton.<full_domain>
examples/spot https://2048-spot.<full_domain>
examples/gpu https://gpu.<full_domain>
examples/neuron https://neuron.<full_domain>

The ALB controller picks the right certificate via SNI from each Ingress's host: against the wildcard cert. No certificateArn is configured anywhere.

To revert to safe-by-default, unset var.base_domain and re-apply.

EBS CSI Driver

EKS Auto Mode includes the EBS CSI driver as a managed component. No installation required.

  • Only volumes provisioned from a StorageClass using ebs.csi.eks.amazonaws.com can mount on Auto Mode nodes.
  • Existing volumes need migration via volume snapshots.
  • Custom KMS encryption may require additional IAM permissions.

AWS Documentation

Learn More

Security Considerations

See SECURITY_CONSIDERATIONS.md for Checkov scan results and documented exceptions.

Contributing

Contributions welcome! Please read our Contributing Guidelines and Code of Conduct.

License and Disclaimer

License

This project is licensed under the MIT License - see LICENSE file.

Disclaimer

This repository is intended for demonstration and learning purposes only. It is not intended for production use. The code provided here is for educational purposes and should not be used in a live environment without proper testing, validation, and modifications.

Use at your own risk. The authors are not responsible for any issues, damages, or losses that may result from using this code in production.

In this samples, there may be use of third-party models ("Third-Party Models") that AWS does not own, and that AWS does not exercise control over. By using any prototype or proof of concept from AWS you acknowledge that the Third-Party Models are "Third-Party Content" under your agreement for services with AWS. You should perform your own independent assessment of the Third-Party Models. You should also take measures to ensure that your use of the Third-Party Models complies with your own specific quality control practices and standards, and the local rules, laws, regulations, licenses and terms of use that apply to you, your content, and the Third-Party Models. AWS does not make any representations or warranties regarding the Third-Party Models, including that use of the Third-Party Models and the associated outputs will result in a particular outcome or result. You also acknowledge that outputs generated by the Third-Party Models are Your Content/Customer Content, as defined in the AWS Customer Agreement or the agreement between you and AWS for AWS Services. You are responsible for your use of outputs from the Third-Party Models.