S

Deploy AI Models on Google Cloud Platform

Complete guide to deploying open-source AI models on GCP

Deploy Open-Source AI Models on Google Cloud Platform

Google Cloud Platform offers powerful infrastructure and AI-specific services for deploying machine learning models at scale.

Prerequisites

  • GCP Account with billing enabled
  • gcloud CLI installed
  • Project created in GCP Console
  • Basic understanding of Kubernetes (for GKE)

Deployment Options

1. Compute Engine Deployment

Best for: Full control over infrastructure

Step 1: Create GPU Instance

gcloud compute instances create ai-model-instance   --zone=us-central1-a   --machine-type=n1-standard-8   --accelerator=type=nvidia-tesla-t4,count=1   --image-family=ubuntu-2004-lts   --image-project=ubuntu-os-cloud   --boot-disk-size=100GB   --maintenance-policy=TERMINATE

Step 2: Install NVIDIA Drivers

# SSH into instance
gcloud compute ssh ai-model-instance --zone=us-central1-a

# Install drivers
curl https://raw.githubusercontent.com/GoogleCloudPlatform/compute-gpu-installation/main/linux/install_gpu_driver.py --output install_gpu_driver.py
sudo python3 install_gpu_driver.py

2. Google Kubernetes Engine (GKE)

Best for: Scalable, production workloads

Step 1: Create GKE Cluster

gcloud container clusters create ai-cluster   --zone=us-central1-a   --machine-type=n1-standard-4   --num-nodes=3   --enable-autoscaling   --min-nodes=1   --max-nodes=10

Step 2: Add GPU Node Pool

gcloud container node-pools create gpu-pool   --cluster=ai-cluster   --zone=us-central1-a   --machine-type=n1-standard-4   --accelerator=type=nvidia-tesla-t4,count=1   --num-nodes=1   --enable-autoscaling   --min-nodes=0   --max-nodes=5

Step 3: Deploy Model

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llama
  template:
    metadata:
      labels:
        app: llama
    spec:
      containers:
      - name: llama
        image: vllm/vllm-openai:latest
        resources:
          limits:
            nvidia.com/gpu: 1
        ports:
        - containerPort: 8000

3. Vertex AI Deployment

Best for: Managed ML platform

from google.cloud import aiplatform

aiplatform.init(project='your-project', location='us-central1')

model = aiplatform.Model.upload(
    display_name='llama-model',
    artifact_uri='gs://your-bucket/model',
    serving_container_image_uri='vllm/vllm-openai:latest'
)

endpoint = model.deploy(
    machine_type='n1-standard-4',
    accelerator_type='NVIDIA_TESLA_T4',
    accelerator_count=1
)

Cost Optimization

Preemptible VMs

Save up to 80% on compute:

gcloud compute instances create preemptible-instance   --preemptible   --machine-type=n1-standard-8   --accelerator=type=nvidia-tesla-t4,count=1

Committed Use Discounts

  • 1-year: 25% discount
  • 3-year: 52% discount

Auto-scaling

gcloud compute instance-groups managed set-autoscaling ai-group   --max-num-replicas=10   --min-num-replicas=1   --target-cpu-utilization=0.6

Monitoring

Cloud Monitoring

from google.cloud import monitoring_v3

client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{project_id}"

# Create custom metric
series = monitoring_v3.TimeSeries()
series.metric.type = "custom.googleapis.com/inference_latency"

Security

  • Use VPC Service Controls
  • Enable Binary Authorization
  • Implement Workload Identity
  • Use Secret Manager for credentials

Troubleshooting

GPU Quota Issues

Request quota increase in GCP Console under IAM & Admin > Quotas

Pod Scheduling Failures

Check node pool has GPU resources available and correct taints/tolerations

Production Checklist

  • [ ] Set up Cloud Monitoring
  • [ ] Configure Cloud Logging
  • [ ] Enable auto-scaling
  • [ ] Implement load balancing
  • [ ] Set up backup strategy
  • [ ] Configure VPC and firewall rules
  • [ ] Enable encryption
  • [ ] Set up CI/CD with Cloud Build