S

Deploy AI Models on AWS

Complete guide to deploying open-source AI models on Amazon Web Services

Deploy Open-Source AI Models on AWS

Amazon Web Services (AWS) provides robust infrastructure for deploying and scaling AI models. This comprehensive guide covers everything from basic EC2 deployments to advanced auto-scaling configurations.

Prerequisites

  • AWS Account with appropriate permissions
  • AWS CLI installed and configured
  • Basic understanding of cloud computing
  • SSH key pair for EC2 access
  • Docker knowledge (recommended)

Deployment Options

1. EC2 Instance Deployment

Best for: Development, testing, and small-scale production

Step 1: Choose the Right Instance Type

For AI model deployment, GPU instances are essential:

  • g5.xlarge: Entry-level, 1x NVIDIA A10G GPU, 4 vCPUs, 16GB RAM (~$1.00/hr)
  • g5.2xlarge: Mid-tier, 1x NVIDIA A10G GPU, 8 vCPUs, 32GB RAM (~$1.21/hr)
  • p3.2xlarge: High-performance, 1x NVIDIA V100 GPU, 8 vCPUs, 61GB RAM (~$3.06/hr)
  • p4d.24xlarge: Enterprise, 8x NVIDIA A100 GPUs, 96 vCPUs, 1152GB RAM (~$32.77/hr)

Step 2: Launch EC2 Instance

# Launch instance with AWS CLI
aws ec2 run-instances   --image-id ami-0c55b159cbfafe1f0   --instance-type g5.xlarge   --key-name your-key-pair   --security-group-ids sg-xxxxxxxxx   --subnet-id subnet-xxxxxxxxx   --block-device-mappings '[{"DeviceName":"/dev/sda1","Ebs":{"VolumeSize":100}}]'

Step 3: Install Dependencies

# SSH into instance
ssh -i your-key.pem ubuntu@your-instance-ip

# Update system
sudo apt update && sudo apt upgrade -y

# Install NVIDIA drivers
sudo apt install -y nvidia-driver-535

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker

Step 4: Deploy Model with Docker

# Pull and run LLaMA model with vLLM
docker run --gpus all -p 8000:8000   vllm/vllm-openai:latest   --model meta-llama/Llama-3.1-70b   --tensor-parallel-size 1

2. SageMaker Deployment

Best for: Production workloads with managed infrastructure

Step 1: Prepare Model

import sagemaker
from sagemaker.huggingface import HuggingFaceModel

role = sagemaker.get_execution_role()

# Create HuggingFace Model
huggingface_model = HuggingFaceModel(
    model_data="s3://your-bucket/model.tar.gz",
    role=role,
    transformers_version="4.37",
    pytorch_version="2.1",
    py_version="py310",
)

# Deploy model
predictor = huggingface_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g5.xlarge"
)

3. ECS/Fargate Deployment

Best for: Containerized workloads with auto-scaling

Step 1: Create ECS Cluster

aws ecs create-cluster --cluster-name ai-models-cluster

Step 2: Create Task Definition

{
  "family": "llama-model",
  "requiresCompatibilities": ["FARGATE"],
  "networkMode": "awsvpc",
  "cpu": "4096",
  "memory": "16384",
  "containerDefinitions": [{
    "name": "llama-container",
    "image": "vllm/vllm-openai:latest",
    "portMappings": [{
      "containerPort": 8000,
      "protocol": "tcp"
    }],
    "resourceRequirements": [{
      "type": "GPU",
      "value": "1"
    }]
  }]
}

Cost Optimization

1. Use Spot Instances

Save up to 90% on compute costs:

aws ec2 request-spot-instances   --spot-price "0.50"   --instance-count 1   --type "one-time"   --launch-specification file://specification.json

2. Auto-Scaling

Configure auto-scaling based on demand:

aws autoscaling create-auto-scaling-group   --auto-scaling-group-name ai-model-asg   --launch-template LaunchTemplateName=ai-model-template   --min-size 1   --max-size 10   --desired-capacity 2

3. Reserved Instances

For predictable workloads, save up to 72%:

  • 1-year commitment: ~40% savings
  • 3-year commitment: ~60-72% savings

Security Best Practices

1. IAM Roles and Policies

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "s3:GetObject",
      "s3:PutObject"
    ],
    "Resource": "arn:aws:s3:::your-model-bucket/*"
  }]
}

2. VPC Configuration

  • Deploy in private subnets
  • Use NAT Gateway for outbound traffic
  • Configure Security Groups to allow only necessary ports
  • Enable VPC Flow Logs for monitoring

3. Encryption

  • Enable EBS encryption for volumes
  • Use AWS KMS for key management
  • Enable encryption in transit with TLS/SSL

Monitoring and Logging

CloudWatch Metrics

# Create custom metric
aws cloudwatch put-metric-data   --namespace "AIModels"   --metric-name "InferenceLatency"   --value 150   --unit Milliseconds

CloudWatch Logs

# Create log group
aws logs create-log-group --log-group-name /aws/ai-models

# Stream logs
aws logs create-log-stream   --log-group-name /aws/ai-models   --log-stream-name model-inference

Troubleshooting

GPU Not Detected

# Check NVIDIA driver
nvidia-smi

# Reinstall if needed
sudo apt purge nvidia-*
sudo apt install nvidia-driver-535
sudo reboot

Out of Memory Errors

  • Reduce batch size
  • Enable model quantization
  • Use tensor parallelism
  • Upgrade to larger instance type

High Latency

  • Enable caching
  • Use load balancer
  • Implement request batching
  • Consider model quantization

Production Checklist

  • [ ] Set up auto-scaling
  • [ ] Configure load balancer
  • [ ] Enable CloudWatch monitoring
  • [ ] Set up CloudWatch alarms
  • [ ] Implement backup strategy
  • [ ] Configure VPC and security groups
  • [ ] Enable encryption at rest and in transit
  • [ ] Set up CI/CD pipeline
  • [ ] Document deployment process
  • [ ] Create disaster recovery plan

Next Steps

  • Explore AWS Inferentia for cost-effective inference
  • Implement A/B testing with multiple model versions
  • Set up multi-region deployment for high availability
  • Integrate with AWS Lambda for serverless inference