A production-grade GPU inference platform serving LLMs on Kubernetes with full GPU observability and autoscaling.
Built from scratch with real GPUs, real money, and every dollar tracked.
┌─────────────────────────────────────────────────────────────┐
│ AWS EKS Cluster │
│ │
│ System Node (t3.medium) GPU Node (g5.xlarge) │
│ ┌──────────────────┐ ┌──────────────────────┐ │
│ │ Prometheus │ │ vLLM (Mistral 7B) │ │
│ │ Grafana │ scrapes │ NVIDIA Device Plugin │ │
│ │ CoreDNS │◄───────────│ DCGM Exporter │ │
│ │ Cluster Autoscaler│ │ Container Toolkit │ │
│ └──────────────────┘ │ GPU Feature Discovery │ │
│ └──────────────────────┘ │
│ │ │
│ NVIDIA GPU Operator │
│ (auto-configures all) │
│ │
│ LoadBalancer ──► vLLM Service ──► vLLM Pod (GPU) │
└─────────────────────────────────────────────────────────────┘
- Deploy and operate GPU workloads on production Kubernetes
- NVIDIA GPU Operator for automated driver, toolkit, and monitoring setup
- GPU-aware scheduling with taints, tolerations, and resource limits
- Real-time GPU observability with DCGM Exporter + Prometheus + Grafana
- Cost-optimized with spot GPU instances
- LLM serving with vLLM behind a LoadBalancer
- HPA for pod-level autoscaling based on utilization
- Cluster Autoscaler for dynamic GPU node scaling
HPA and Cluster Autoscaler manifests are included and production-ready but were not applied during this build to minimize spend on a personal budget project.
| Component | Tool | Purpose |
|---|---|---|
| Model Serving | vLLM | Serve Mistral 7B with bfloat16, Flash Attention v2, CUDA graphs |
| GPU Management | NVIDIA GPU Operator | Auto-install drivers, toolkit, device plugin, DCGM |
| Orchestration | Amazon EKS | Managed Kubernetes with GPU spot node groups |
| GPU Monitoring | DCGM Exporter | GPU utilization, memory, temperature, power metrics |
| Metrics | Prometheus | Scrape and store GPU + inference metrics |
| Dashboards | Grafana | Real-time GPU monitoring dashboards |
| Autoscaling | HPA + Cluster Autoscaler | Scale pods and GPU nodes based on demand |
| Spec | Value |
|---|---|
| Instance | g5.xlarge (Spot) |
| GPU | NVIDIA A10G |
| VRAM | 23 GB |
| Driver | 580.126.09 |
| CUDA | 13.0 |
| Architecture | Ampere |
| Compute Capability | 8.6 |
- AWS CLI configured
eksctlandkubectlinstalled- GPU spot quota of at least 4 vCPUs in your region
eksctl create cluster -f kubernetes/cluster.yamlhelm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set dcgmExporter.enabled=true \
--set devicePlugin.enabled=truekubectl describe node <GPU_NODE> | grep nvidia.com/gpu
# Should show: nvidia.com/gpu: 1kubectl create namespace inference
kubectl apply -f kubernetes/vllm-deployment.yaml
kubectl apply -f kubernetes/vllm-service.yamlENDPOINT=$(kubectl get svc -n inference vllm-service -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
curl http://$ENDPOINT:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"mistralai/Mistral-7B-Instruct-v0.3","messages":[{"role":"user","content":"What is GPU infrastructure?"}],"max_tokens":100}'cd standalone
docker-compose up -d
# Grafana: http://localhost:3000 (admin/gpu-project)eksctl delete cluster --name gpu-inference-cluster --region us-east-1For quick testing without EKS overhead:
# Launch vLLM on any GPU instance
docker run -d --gpus all --name vllm -p 8000:8000 \
vllm/vllm-openai:latest \
--model mistralai/Mistral-7B-Instruct-v0.3 \
--max-model-len 4096
# Launch DCGM + Prometheus + Grafana
cd standalone
docker-compose up -d| Component | Rate | Usage | Cost |
|---|---|---|---|
| EKS Control Plane | $0.10/hr | ~4 hrs | $0.40 |
| t3.medium (system) | $0.04/hr | ~4 hrs | $0.16 |
| g5.xlarge spot (GPU) | $0.44/hr | ~4 hrs | $1.76 |
| NAT Gateway | $0.045/hr | ~4 hrs | $0.18 |
| Standalone EC2 (Day 1-2) | $1.00/hr | ~3 hrs | $3.00 |
| EBS Storage | $0.08/GB/mo | 100GB, 5 days | $1.30 |
| Total | ~$7 |
gpu-inference-eks/
├── README.md
├── kubernetes/
│ ├── cluster.yaml # EKS cluster config
│ ├── vllm-deployment.yaml # vLLM K8s deployment
│ ├── vllm-service.yaml # LoadBalancer service
│ ├── vllm-hpa.yaml # Horizontal Pod Autoscaler
│ └── cluster-autoscaler.yaml # Node-level autoscaling
├── standalone/
│ ├── setup.sh # One-click GPU instance setup
│ ├── docker-compose.yml # DCGM + Prometheus + Grafana
│ └── prometheus.yml # Prometheus scrape config
├── grafana/
│ └── gpu-dashboard.json # GPU monitoring dashboard
├── .github/workflows/
│ └── deploy.yml # CI/CD for model deployment
├── screenshots/
│ ├── nvidia-smi.png
│ ├── gpu-grafana-dashboard.png
│ ├── eks-gpu-pods.png
│ ├── vllm-response.png
│ └── dcgm-metrics.png
├── benchmarks/
│ └── results.md
└── cost-breakdown.md
-
GPU Operator eliminates manual setup — On standalone EC2, installing drivers + toolkit + DCGM took an hour. On EKS with GPU Operator, it's automatic on every GPU node.
-
Taints protect expensive GPU nodes — Without taints, Kubernetes schedules random system pods on your $0.44/hr GPU node. Taints ensure only GPU workloads run there.
-
Spot interruptions are real — First attempt got terminated mid-setup. Production GPU workloads need interruption handling and checkpointing.
-
vLLM is production-ready — bfloat16, Flash Attention v2, CUDA graphs, chunked prefill all enabled by default. No tuning needed for basic serving.
-
DCGM + Prometheus + Grafana is the standard GPU monitoring stack — same patterns as CPU monitoring but with GPU-specific metrics (utilization, VRAM, temperature, power, ECC errors).
- NVIDIA NCA-AIIO (AI Infrastructure & Operations)
- AWS Solutions Architect Associate
- AWS Machine Learning Specialty
- AWS Data Engineer Associate
Meet Bhanushali — DevOps Engineer building AI infrastructure skills.