Skip to content

meet302001/gpu-inference-eks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPU Inference Platform on AWS EKS

A production-grade GPU inference platform serving LLMs on Kubernetes with full GPU observability and autoscaling.

Built from scratch with real GPUs, real money, and every dollar tracked.

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    AWS EKS Cluster                           │
│                                                              │
│  System Node (t3.medium)          GPU Node (g5.xlarge)       │
│  ┌──────────────────┐             ┌──────────────────────┐   │
│  │ Prometheus        │             │ vLLM (Mistral 7B)    │   │
│  │ Grafana           │   scrapes   │ NVIDIA Device Plugin  │   │
│  │ CoreDNS           │◄───────────│ DCGM Exporter         │   │
│  │ Cluster Autoscaler│             │ Container Toolkit     │   │
│  └──────────────────┘             │ GPU Feature Discovery │   │
│                                    └──────────────────────┘   │
│                                              │                │
│                                     NVIDIA GPU Operator       │
│                                     (auto-configures all)     │
│                                                              │
│  LoadBalancer ──► vLLM Service ──► vLLM Pod (GPU)            │
└─────────────────────────────────────────────────────────────┘

What This Proves

Deployed and Tested

  • Deploy and operate GPU workloads on production Kubernetes
  • NVIDIA GPU Operator for automated driver, toolkit, and monitoring setup
  • GPU-aware scheduling with taints, tolerations, and resource limits
  • Real-time GPU observability with DCGM Exporter + Prometheus + Grafana
  • Cost-optimized with spot GPU instances
  • LLM serving with vLLM behind a LoadBalancer

Manifests Ready (Not Applied — Cost Optimization)

  • HPA for pod-level autoscaling based on utilization
  • Cluster Autoscaler for dynamic GPU node scaling

HPA and Cluster Autoscaler manifests are included and production-ready but were not applied during this build to minimize spend on a personal budget project.

Tech Stack

Component Tool Purpose
Model Serving vLLM Serve Mistral 7B with bfloat16, Flash Attention v2, CUDA graphs
GPU Management NVIDIA GPU Operator Auto-install drivers, toolkit, device plugin, DCGM
Orchestration Amazon EKS Managed Kubernetes with GPU spot node groups
GPU Monitoring DCGM Exporter GPU utilization, memory, temperature, power metrics
Metrics Prometheus Scrape and store GPU + inference metrics
Dashboards Grafana Real-time GPU monitoring dashboards
Autoscaling HPA + Cluster Autoscaler Scale pods and GPU nodes based on demand

GPU Instance Details

Spec Value
Instance g5.xlarge (Spot)
GPU NVIDIA A10G
VRAM 23 GB
Driver 580.126.09
CUDA 13.0
Architecture Ampere
Compute Capability 8.6

Quick Start

Prerequisites

  • AWS CLI configured
  • eksctl and kubectl installed
  • GPU spot quota of at least 4 vCPUs in your region

1. Create EKS Cluster

eksctl create cluster -f kubernetes/cluster.yaml

2. Install NVIDIA GPU Operator

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set dcgmExporter.enabled=true \
  --set devicePlugin.enabled=true

3. Verify GPU is Registered

kubectl describe node <GPU_NODE> | grep nvidia.com/gpu
# Should show: nvidia.com/gpu: 1

4. Deploy vLLM

kubectl create namespace inference
kubectl apply -f kubernetes/vllm-deployment.yaml
kubectl apply -f kubernetes/vllm-service.yaml

5. Test Inference

ENDPOINT=$(kubectl get svc -n inference vllm-service -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
curl http://$ENDPOINT:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"mistralai/Mistral-7B-Instruct-v0.3","messages":[{"role":"user","content":"What is GPU infrastructure?"}],"max_tokens":100}'

6. Deploy Monitoring (Standalone)

cd standalone
docker-compose up -d
# Grafana: http://localhost:3000 (admin/gpu-project)

7. Tear Down

eksctl delete cluster --name gpu-inference-cluster --region us-east-1

Standalone Setup (Without Kubernetes)

For quick testing without EKS overhead:

# Launch vLLM on any GPU instance
docker run -d --gpus all --name vllm -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-7B-Instruct-v0.3 \
  --max-model-len 4096

# Launch DCGM + Prometheus + Grafana
cd standalone
docker-compose up -d

Cost Breakdown

Component Rate Usage Cost
EKS Control Plane $0.10/hr ~4 hrs $0.40
t3.medium (system) $0.04/hr ~4 hrs $0.16
g5.xlarge spot (GPU) $0.44/hr ~4 hrs $1.76
NAT Gateway $0.045/hr ~4 hrs $0.18
Standalone EC2 (Day 1-2) $1.00/hr ~3 hrs $3.00
EBS Storage $0.08/GB/mo 100GB, 5 days $1.30
Total ~$7

Project Structure

gpu-inference-eks/
├── README.md
├── kubernetes/
│   ├── cluster.yaml              # EKS cluster config
│   ├── vllm-deployment.yaml      # vLLM K8s deployment
│   ├── vllm-service.yaml         # LoadBalancer service
│   ├── vllm-hpa.yaml             # Horizontal Pod Autoscaler
│   └── cluster-autoscaler.yaml   # Node-level autoscaling
├── standalone/
│   ├── setup.sh                  # One-click GPU instance setup
│   ├── docker-compose.yml        # DCGM + Prometheus + Grafana
│   └── prometheus.yml            # Prometheus scrape config
├── grafana/
│   └── gpu-dashboard.json        # GPU monitoring dashboard
├── .github/workflows/
│   └── deploy.yml                # CI/CD for model deployment
├── screenshots/
│   ├── nvidia-smi.png
│   ├── gpu-grafana-dashboard.png
│   ├── eks-gpu-pods.png
│   ├── vllm-response.png
│   └── dcgm-metrics.png
├── benchmarks/
│   └── results.md
└── cost-breakdown.md

Key Learnings

  1. GPU Operator eliminates manual setup — On standalone EC2, installing drivers + toolkit + DCGM took an hour. On EKS with GPU Operator, it's automatic on every GPU node.

  2. Taints protect expensive GPU nodes — Without taints, Kubernetes schedules random system pods on your $0.44/hr GPU node. Taints ensure only GPU workloads run there.

  3. Spot interruptions are real — First attempt got terminated mid-setup. Production GPU workloads need interruption handling and checkpointing.

  4. vLLM is production-ready — bfloat16, Flash Attention v2, CUDA graphs, chunked prefill all enabled by default. No tuning needed for basic serving.

  5. DCGM + Prometheus + Grafana is the standard GPU monitoring stack — same patterns as CPU monitoring but with GPU-specific metrics (utilization, VRAM, temperature, power, ECC errors).

Certifications

  • NVIDIA NCA-AIIO (AI Infrastructure & Operations)
  • AWS Solutions Architect Associate
  • AWS Machine Learning Specialty
  • AWS Data Engineer Associate

Author

Meet Bhanushali — DevOps Engineer building AI infrastructure skills.

LinkedIn | GitHub

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages