Deploy Open-Source AI Models on Kubernetes

Kubernetes provides powerful orchestration for deploying, scaling, and managing AI models in production.

Prerequisites

Kubernetes cluster (1.24+)
kubectl configured
NVIDIA GPU Operator installed (for GPU support)
Helm 3.x installed
Basic Kubernetes knowledge

Setup GPU Support

Install NVIDIA GPU Operator

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator
helm install --wait --generate-name   -n gpu-operator --create-namespace   nvidia/gpu-operator

Verify GPU Nodes

kubectl get nodes -o json | jq '.items[].status.capacity'

Basic Deployment

Deployment Manifest

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-deployment
  namespace: ai-models
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llama
  template:
    metadata:
      labels:
        app: llama
    spec:
      containers:
      - name: llama
        image: vllm/vllm-openai:latest
        args:
          - --model
          - meta-llama/Llama-3.1-70b-Instruct
          - --tensor-parallel-size
          - "1"
        ports:
        - containerPort: 8000
          name: http
        resources:
          requests:
            memory: "32Gi"
            cpu: "8"
            nvidia.com/gpu: "1"
          limits:
            memory: "32Gi"
            nvidia.com/gpu: "1"
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        volumeMounts:
        - name: cache
          mountPath: /root/.cache/huggingface
      volumes:
      - name: cache
        persistentVolumeClaim:
          claimName: model-cache-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: llama-service
  namespace: ai-models
spec:
  selector:
    app: llama
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Apply Configuration

kubectl create namespace ai-models
kubectl apply -f deployment.yaml

Persistent Storage

PersistentVolumeClaim

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache-pvc
  namespace: ai-models
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: fast-ssd

Secrets Management

Create Secret

kubectl create secret generic hf-token   --from-literal=token=your-huggingface-token   -n ai-models

Auto-Scaling

Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llama-hpa
  namespace: ai-models
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llama-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

Cluster Autoscaler

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-config
  namespace: kube-system
data:
  min-nodes: "1"
  max-nodes: "10"
  scale-down-delay: "10m"

Ingress Configuration

NGINX Ingress

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: llama-ingress
  namespace: ai-models
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - api.yourdomain.com
    secretName: llama-tls
  rules:
  - host: api.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: llama-service
            port:
              number: 80

Monitoring

Prometheus ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: llama-metrics
  namespace: ai-models
spec:
  selector:
    matchLabels:
      app: llama
  endpoints:
  - port: metrics
    interval: 30s

Grafana Dashboard

Import dashboard ID: 15759 for GPU monitoring

Helm Chart

Chart Structure

llama-chart/
├── Chart.yaml
├── values.yaml
└── templates/
    ├── deployment.yaml
    ├── service.yaml
    ├── ingress.yaml
    └── hpa.yaml

values.yaml

replicaCount: 2

image:
  repository: vllm/vllm-openai
  tag: latest
  pullPolicy: IfNotPresent

model:
  name: meta-llama/Llama-3.1-70b-Instruct
  tensorParallelSize: 1

resources:
  requests:
    memory: 32Gi
    cpu: 8
    nvidia.com/gpu: 1
  limits:
    memory: 32Gi
    nvidia.com/gpu: 1

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: api.yourdomain.com
      paths:
        - path: /
          pathType: Prefix

Install Chart

helm install llama-model ./llama-chart -n ai-models

CI/CD Integration

GitOps with ArgoCD

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: llama-model
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/your-org/ai-models
    targetRevision: HEAD
    path: kubernetes/llama
  destination:
    server: https://kubernetes.default.svc
    namespace: ai-models
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Troubleshooting

Pod Not Scheduling

# Check pod status
kubectl describe pod <pod-name> -n ai-models

# Check node resources
kubectl describe nodes

# Check GPU availability
kubectl get nodes -o json | jq '.items[].status.allocatable'

OOMKilled Errors

Increase memory limits
Reduce batch size
Enable model quantization
Use larger nodes

Slow Startup

Use init containers to pre-download models
Implement readiness probes with longer initial delay
Use faster storage class

Production Checklist

[ ] Set up GPU support
[ ] Configure persistent storage
[ ] Implement secrets management
[ ] Set up auto-scaling (HPA and CA)
[ ] Configure ingress with TLS
[ ] Set up monitoring and alerting
[ ] Implement logging aggregation
[ ] Configure resource quotas
[ ] Set up network policies
[ ] Implement backup strategy
[ ] Document deployment process
[ ] Set up CI/CD pipeline

Deploy AI Models on Kubernetes

Deploy Open-Source AI Models on Kubernetes

Prerequisites

Setup GPU Support

Install NVIDIA GPU Operator

Verify GPU Nodes

Basic Deployment

Deployment Manifest

Apply Configuration

Persistent Storage

PersistentVolumeClaim

Secrets Management

Create Secret

Auto-Scaling

Horizontal Pod Autoscaler

Cluster Autoscaler

Ingress Configuration

NGINX Ingress

Monitoring

Prometheus ServiceMonitor

Grafana Dashboard

Helm Chart

Chart Structure

values.yaml

Install Chart

CI/CD Integration

GitOps with ArgoCD

Troubleshooting

Pod Not Scheduling

OOMKilled Errors

Slow Startup

Production Checklist

Related Guides

Deploy AI Models on AWS

Deploy AI Models on Google Cloud Platform

Deploy AI Models on Microsoft Azure

Deploy AI Models with Docker