S

Deploy AI Models on Kubernetes

Complete guide to deploying and scaling AI models on Kubernetes

Deploy Open-Source AI Models on Kubernetes

Kubernetes provides powerful orchestration for deploying, scaling, and managing AI models in production.

Prerequisites

  • Kubernetes cluster (1.24+)
  • kubectl configured
  • NVIDIA GPU Operator installed (for GPU support)
  • Helm 3.x installed
  • Basic Kubernetes knowledge

Setup GPU Support

Install NVIDIA GPU Operator

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator
helm install --wait --generate-name   -n gpu-operator --create-namespace   nvidia/gpu-operator

Verify GPU Nodes

kubectl get nodes -o json | jq '.items[].status.capacity'

Basic Deployment

Deployment Manifest

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-deployment
  namespace: ai-models
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llama
  template:
    metadata:
      labels:
        app: llama
    spec:
      containers:
      - name: llama
        image: vllm/vllm-openai:latest
        args:
          - --model
          - meta-llama/Llama-3.1-70b-Instruct
          - --tensor-parallel-size
          - "1"
        ports:
        - containerPort: 8000
          name: http
        resources:
          requests:
            memory: "32Gi"
            cpu: "8"
            nvidia.com/gpu: "1"
          limits:
            memory: "32Gi"
            nvidia.com/gpu: "1"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        volumeMounts:
        - name: cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: cache
        persistentVolumeClaim:
          claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: llama-service
  namespace: ai-models
spec:
  selector:
    app: llama
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Apply Configuration

kubectl create namespace ai-models
kubectl apply -f deployment.yaml

Persistent Storage

PersistentVolumeClaim

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache-pvc
  namespace: ai-models
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: fast-ssd

Secrets Management

Create Secret

kubectl create secret generic hf-token   --from-literal=token=your-huggingface-token   -n ai-models

Auto-Scaling

Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llama-hpa
  namespace: ai-models
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llama-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Cluster Autoscaler

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-config
  namespace: kube-system
data:
  min-nodes: "1"
  max-nodes: "10"
  scale-down-delay: "10m"

Ingress Configuration

NGINX Ingress

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llama-ingress
  namespace: ai-models
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - api.yourdomain.com
    secretName: llama-tls
  rules:
  - host: api.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: llama-service
            port:
              number: 80

Monitoring

Prometheus ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: llama-metrics
  namespace: ai-models
spec:
  selector:
    matchLabels:
      app: llama
  endpoints:
  - port: metrics
    interval: 30s

Grafana Dashboard

Import dashboard ID: 15759 for GPU monitoring

Helm Chart

Chart Structure

llama-chart/
├── Chart.yaml
├── values.yaml
└── templates/
    ├── deployment.yaml
    ├── service.yaml
    ├── ingress.yaml
    └── hpa.yaml

values.yaml

replicaCount: 2

image:
  repository: vllm/vllm-openai
  tag: latest
  pullPolicy: IfNotPresent

model:
  name: meta-llama/Llama-3.1-70b-Instruct
  tensorParallelSize: 1

resources:
  requests:
    memory: 32Gi
    cpu: 8
    nvidia.com/gpu: 1
  limits:
    memory: 32Gi
    nvidia.com/gpu: 1

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: api.yourdomain.com
      paths:
        - path: /
          pathType: Prefix

Install Chart

helm install llama-model ./llama-chart -n ai-models

CI/CD Integration

GitOps with ArgoCD

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: llama-model
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/your-org/ai-models
    targetRevision: HEAD
    path: kubernetes/llama
  destination:
    server: https://kubernetes.default.svc
    namespace: ai-models
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Troubleshooting

Pod Not Scheduling

# Check pod status
kubectl describe pod <pod-name> -n ai-models

# Check node resources
kubectl describe nodes

# Check GPU availability
kubectl get nodes -o json | jq '.items[].status.allocatable'

OOMKilled Errors

  • Increase memory limits
  • Reduce batch size
  • Enable model quantization
  • Use larger nodes

Slow Startup

  • Use init containers to pre-download models
  • Implement readiness probes with longer initial delay
  • Use faster storage class

Production Checklist

  • [ ] Set up GPU support
  • [ ] Configure persistent storage
  • [ ] Implement secrets management
  • [ ] Set up auto-scaling (HPA and CA)
  • [ ] Configure ingress with TLS
  • [ ] Set up monitoring and alerting
  • [ ] Implement logging aggregation
  • [ ] Configure resource quotas
  • [ ] Set up network policies
  • [ ] Implement backup strategy
  • [ ] Document deployment process
  • [ ] Set up CI/CD pipeline