GPU Inference Platform on AWS EKS

A production-grade GPU inference platform serving LLMs on Kubernetes with full GPU observability and autoscaling.

Built from scratch with real GPUs, real money, and every dollar tracked.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    AWS EKS Cluster                           │
│                                                              │
│  System Node (t3.medium)          GPU Node (g5.xlarge)       │
│  ┌──────────────────┐             ┌──────────────────────┐   │
│  │ Prometheus        │             │ vLLM (Mistral 7B)    │   │
│  │ Grafana           │   scrapes   │ NVIDIA Device Plugin  │   │
│  │ CoreDNS           │◄───────────│ DCGM Exporter         │   │
│  │ Cluster Autoscaler│             │ Container Toolkit     │   │
│  └──────────────────┘             │ GPU Feature Discovery │   │
│                                    └──────────────────────┘   │
│                                              │                │
│                                     NVIDIA GPU Operator       │
│                                     (auto-configures all)     │
│                                                              │
│  LoadBalancer ──► vLLM Service ──► vLLM Pod (GPU)            │
└─────────────────────────────────────────────────────────────┘

What This Proves

Deployed and Tested

Deploy and operate GPU workloads on production Kubernetes
NVIDIA GPU Operator for automated driver, toolkit, and monitoring setup
GPU-aware scheduling with taints, tolerations, and resource limits
Real-time GPU observability with DCGM Exporter + Prometheus + Grafana
Cost-optimized with spot GPU instances
LLM serving with vLLM behind a LoadBalancer

Manifests Ready (Not Applied — Cost Optimization)

HPA for pod-level autoscaling based on utilization
Cluster Autoscaler for dynamic GPU node scaling

HPA and Cluster Autoscaler manifests are included and production-ready but were not applied during this build to minimize spend on a personal budget project.

Tech Stack

Component	Tool	Purpose
Model Serving	vLLM	Serve Mistral 7B with bfloat16, Flash Attention v2, CUDA graphs
GPU Management	NVIDIA GPU Operator	Auto-install drivers, toolkit, device plugin, DCGM
Orchestration	Amazon EKS	Managed Kubernetes with GPU spot node groups
GPU Monitoring	DCGM Exporter	GPU utilization, memory, temperature, power metrics
Metrics	Prometheus	Scrape and store GPU + inference metrics
Dashboards	Grafana	Real-time GPU monitoring dashboards
Autoscaling	HPA + Cluster Autoscaler	Scale pods and GPU nodes based on demand

GPU Instance Details

Spec	Value
Instance	g5.xlarge (Spot)
GPU	NVIDIA A10G
VRAM	23 GB
Driver	580.126.09
CUDA	13.0
Architecture	Ampere
Compute Capability	8.6

Quick Start

Prerequisites

AWS CLI configured
eksctl and kubectl installed
GPU spot quota of at least 4 vCPUs in your region

1. Create EKS Cluster

eksctl create cluster -f kubernetes/cluster.yaml

2. Install NVIDIA GPU Operator

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set dcgmExporter.enabled=true \
  --set devicePlugin.enabled=true

3. Verify GPU is Registered

kubectl describe node <GPU_NODE> | grep nvidia.com/gpu
# Should show: nvidia.com/gpu: 1

4. Deploy vLLM

kubectl create namespace inference
kubectl apply -f kubernetes/vllm-deployment.yaml
kubectl apply -f kubernetes/vllm-service.yaml

5. Test Inference

ENDPOINT=$(kubectl get svc -n inference vllm-service -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
curl http://$ENDPOINT:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mistralai/Mistral-7B-Instruct-v0.3","messages":[{"role":"user","content":"What is GPU infrastructure?"}],"max_tokens":100}'

6. Deploy Monitoring (Standalone)

cd standalone
docker-compose up -d
# Grafana: http://localhost:3000 (admin/gpu-project)

7. Tear Down

eksctl delete cluster --name gpu-inference-cluster --region us-east-1

Standalone Setup (Without Kubernetes)

For quick testing without EKS overhead:

# Launch vLLM on any GPU instance
docker run -d --gpus all --name vllm -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --max-model-len 4096

# Launch DCGM + Prometheus + Grafana
cd standalone
docker-compose up -d

Cost Breakdown

Component	Rate	Usage	Cost
EKS Control Plane	$0.10/hr	~4 hrs	$0.40
t3.medium (system)	$0.04/hr	~4 hrs	$0.16
g5.xlarge spot (GPU)	$0.44/hr	~4 hrs	$1.76
NAT Gateway	$0.045/hr	~4 hrs	$0.18
Standalone EC2 (Day 1-2)	$1.00/hr	~3 hrs	$3.00
EBS Storage	$0.08/GB/mo	100GB, 5 days	$1.30
Total			~$7

Project Structure

gpu-inference-eks/
├── README.md
├── kubernetes/
│   ├── cluster.yaml              # EKS cluster config
│   ├── vllm-deployment.yaml      # vLLM K8s deployment
│   ├── vllm-service.yaml         # LoadBalancer service
│   ├── vllm-hpa.yaml             # Horizontal Pod Autoscaler
│   └── cluster-autoscaler.yaml   # Node-level autoscaling
├── standalone/
│   ├── setup.sh                  # One-click GPU instance setup
│   ├── docker-compose.yml        # DCGM + Prometheus + Grafana
│   └── prometheus.yml            # Prometheus scrape config
├── grafana/
│   └── gpu-dashboard.json        # GPU monitoring dashboard
├── .github/workflows/
│   └── deploy.yml                # CI/CD for model deployment
├── screenshots/
│   ├── nvidia-smi.png
│   ├── gpu-grafana-dashboard.png
│   ├── eks-gpu-pods.png
│   ├── vllm-response.png
│   └── dcgm-metrics.png
├── benchmarks/
│   └── results.md
└── cost-breakdown.md

Key Learnings

GPU Operator eliminates manual setup — On standalone EC2, installing drivers + toolkit + DCGM took an hour. On EKS with GPU Operator, it's automatic on every GPU node.
Taints protect expensive GPU nodes — Without taints, Kubernetes schedules random system pods on your $0.44/hr GPU node. Taints ensure only GPU workloads run there.
Spot interruptions are real — First attempt got terminated mid-setup. Production GPU workloads need interruption handling and checkpointing.
vLLM is production-ready — bfloat16, Flash Attention v2, CUDA graphs, chunked prefill all enabled by default. No tuning needed for basic serving.
DCGM + Prometheus + Grafana is the standard GPU monitoring stack — same patterns as CPU monitoring but with GPU-specific metrics (utilization, VRAM, temperature, power, ECC errors).

Certifications

NVIDIA NCA-AIIO (AI Infrastructure & Operations)
AWS Solutions Architect Associate
AWS Machine Learning Specialty
AWS Data Engineer Associate

Author

Meet Bhanushali — DevOps Engineer building AI infrastructure skills.

LinkedIn | GitHub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPU Inference Platform on AWS EKS

Architecture

What This Proves

Deployed and Tested

Manifests Ready (Not Applied — Cost Optimization)

Tech Stack

GPU Instance Details

Quick Start

Prerequisites

1. Create EKS Cluster

2. Install NVIDIA GPU Operator

3. Verify GPU is Registered

4. Deploy vLLM

5. Test Inference

6. Deploy Monitoring (Standalone)

7. Tear Down

Standalone Setup (Without Kubernetes)

Cost Breakdown

Project Structure

Key Learnings

Certifications

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
kubernetes		kubernetes
standalone		standalone
.gitignore		.gitignore
README.md		README.md
cost-breakdown.md		cost-breakdown.md

Folders and files

Latest commit

History

Repository files navigation

GPU Inference Platform on AWS EKS

Architecture

What This Proves

Deployed and Tested

Manifests Ready (Not Applied — Cost Optimization)

Tech Stack

GPU Instance Details

Quick Start

Prerequisites

1. Create EKS Cluster

2. Install NVIDIA GPU Operator

3. Verify GPU is Registered

4. Deploy vLLM

5. Test Inference

6. Deploy Monitoring (Standalone)

7. Tear Down

Standalone Setup (Without Kubernetes)

Cost Breakdown

Project Structure

Key Learnings

Certifications

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages