Deploy AI Models on Kubernetes
Complete guide to deploying and scaling AI models on Kubernetes
Deploy Open-Source AI Models on Kubernetes
Kubernetes provides powerful orchestration for deploying, scaling, and managing AI models in production.
Prerequisites
- Kubernetes cluster (1.24+)
- kubectl configured
- NVIDIA GPU Operator installed (for GPU support)
- Helm 3.x installed
- Basic Kubernetes knowledge
Setup GPU Support
Install NVIDIA GPU Operator
# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install GPU Operator
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator
Verify GPU Nodes
kubectl get nodes -o json | jq '.items[].status.capacity'
Basic Deployment
Deployment Manifest
apiVersion: apps/v1
kind: Deployment
metadata:
name: llama-deployment
namespace: ai-models
spec:
replicas: 2
selector:
matchLabels:
app: llama
template:
metadata:
labels:
app: llama
spec:
containers:
- name: llama
image: vllm/vllm-openai:latest
args:
- --model
- meta-llama/Llama-3.1-70b-Instruct
- --tensor-parallel-size
- "1"
ports:
- containerPort: 8000
name: http
resources:
requests:
memory: "32Gi"
cpu: "8"
nvidia.com/gpu: "1"
limits:
memory: "32Gi"
nvidia.com/gpu: "1"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
volumeMounts:
- name: cache
mountPath: /root/.cache/huggingface
volumes:
- name: cache
persistentVolumeClaim:
claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
name: llama-service
namespace: ai-models
spec:
selector:
app: llama
ports:
- port: 80
targetPort: 8000
type: LoadBalancer
Apply Configuration
kubectl create namespace ai-models
kubectl apply -f deployment.yaml
Persistent Storage
PersistentVolumeClaim
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache-pvc
namespace: ai-models
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: fast-ssd
Secrets Management
Create Secret
kubectl create secret generic hf-token --from-literal=token=your-huggingface-token -n ai-models
Auto-Scaling
Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llama-hpa
namespace: ai-models
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llama-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
Cluster Autoscaler
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-config
namespace: kube-system
data:
min-nodes: "1"
max-nodes: "10"
scale-down-delay: "10m"
Ingress Configuration
NGINX Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: llama-ingress
namespace: ai-models
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
nginx.ingress.kubernetes.io/ssl-redirect: "true"
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
ingressClassName: nginx
tls:
- hosts:
- api.yourdomain.com
secretName: llama-tls
rules:
- host: api.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: llama-service
port:
number: 80
Monitoring
Prometheus ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: llama-metrics
namespace: ai-models
spec:
selector:
matchLabels:
app: llama
endpoints:
- port: metrics
interval: 30s
Grafana Dashboard
Import dashboard ID: 15759 for GPU monitoring
Helm Chart
Chart Structure
llama-chart/
├── Chart.yaml
├── values.yaml
└── templates/
├── deployment.yaml
├── service.yaml
├── ingress.yaml
└── hpa.yaml
values.yaml
replicaCount: 2
image:
repository: vllm/vllm-openai
tag: latest
pullPolicy: IfNotPresent
model:
name: meta-llama/Llama-3.1-70b-Instruct
tensorParallelSize: 1
resources:
requests:
memory: 32Gi
cpu: 8
nvidia.com/gpu: 1
limits:
memory: 32Gi
nvidia.com/gpu: 1
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
ingress:
enabled: true
className: nginx
hosts:
- host: api.yourdomain.com
paths:
- path: /
pathType: Prefix
Install Chart
helm install llama-model ./llama-chart -n ai-models
CI/CD Integration
GitOps with ArgoCD
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: llama-model
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/your-org/ai-models
targetRevision: HEAD
path: kubernetes/llama
destination:
server: https://kubernetes.default.svc
namespace: ai-models
syncPolicy:
automated:
prune: true
selfHeal: true
Troubleshooting
Pod Not Scheduling
# Check pod status
kubectl describe pod <pod-name> -n ai-models
# Check node resources
kubectl describe nodes
# Check GPU availability
kubectl get nodes -o json | jq '.items[].status.allocatable'
OOMKilled Errors
- Increase memory limits
- Reduce batch size
- Enable model quantization
- Use larger nodes
Slow Startup
- Use init containers to pre-download models
- Implement readiness probes with longer initial delay
- Use faster storage class
Production Checklist
- [ ] Set up GPU support
- [ ] Configure persistent storage
- [ ] Implement secrets management
- [ ] Set up auto-scaling (HPA and CA)
- [ ] Configure ingress with TLS
- [ ] Set up monitoring and alerting
- [ ] Implement logging aggregation
- [ ] Configure resource quotas
- [ ] Set up network policies
- [ ] Implement backup strategy
- [ ] Document deployment process
- [ ] Set up CI/CD pipeline
Related Guides
Deploy AI Models on AWS
Complete guide to deploying open-source AI models on Amazon Web Services
Deploy AI Models on Google Cloud Platform
Complete guide to deploying open-source AI models on GCP
Deploy AI Models on Microsoft Azure
Complete guide to deploying open-source AI models on Azure
Deploy AI Models with Docker
Complete guide to containerizing and deploying AI models with Docker