S

Deploy AI Models On-Premise

Complete guide to deploying AI models on your own infrastructure

Deploy AI Models On-Premise

Deploy AI models on your own hardware for maximum control, security, and cost efficiency at scale.

Prerequisites

  • Physical servers or VMs
  • NVIDIA GPUs (recommended)
  • Linux OS (Ubuntu 22.04 LTS recommended)
  • Network infrastructure
  • Storage system (NAS/SAN)

Hardware Requirements

Minimum Setup

  • CPU: 8+ cores
  • RAM: 32GB+
  • GPU: NVIDIA RTX 3090 or better
  • Storage: 500GB SSD
  • Network: 1Gbps

Production Setup

  • CPU: 32+ cores (AMD EPYC or Intel Xeon)
  • RAM: 128GB+ ECC
  • GPU: 4x NVIDIA A100 or H100
  • Storage: 2TB+ NVMe SSD
  • Network: 10Gbps+ with redundancy

Infrastructure Setup

1. Operating System

# Update system
sudo apt update && sudo apt upgrade -y

# Install essential tools
sudo apt install -y build-essential git curl wget vim

2. NVIDIA Drivers

# Install drivers
sudo apt install -y nvidia-driver-535

# Verify installation
nvidia-smi

# Install CUDA Toolkit
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda_12.1.0_530.30.02_linux.run
sudo sh cuda_12.1.0_530.30.02_linux.run

3. Docker Setup

# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker

4. Kubernetes Setup (Optional)

# Install k3s (lightweight Kubernetes)
curl -sfL https://get.k3s.io | sh -

# Verify
sudo k3s kubectl get nodes

Model Deployment

Docker Compose Setup

version: '3.8'

services:
  llama-model:
    image: vllm/vllm-openai:latest
    command: >
      --model meta-llama/Llama-3.1-70b-Instruct
      --tensor-parallel-size 4
      --max-model-len 4096
    ports:
      - "8000:8000"
    volumes:
      - /data/models:/root/.cache/huggingface
      - /data/logs:/app/logs
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 4
              capabilities: [gpu]
    environment:
      - HUGGING_FACE_HUB_TOKEN=your-token
      - CUDA_VISIBLE_DEVICES=0,1,2,3
    restart: unless-stopped
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
        max-file: "10"

  nginx:
    image: nginx:latest
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
      - ./ssl:/etc/nginx/ssl
    depends_on:
      - llama-model
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    restart: unless-stopped

volumes:
  prometheus-data:
  grafana-data:

Load Balancing

NGINX Configuration

upstream ai_backend {
    least_conn;
    server 192.168.1.10:8000 max_fails=3 fail_timeout=30s;
    server 192.168.1.11:8000 max_fails=3 fail_timeout=30s;
    server 192.168.1.12:8000 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name api.yourdomain.com;
    
    location / {
        proxy_pass http://ai_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_connect_timeout 300s;
        proxy_send_timeout 300s;
        proxy_read_timeout 300s;
    }
}

Storage Configuration

NFS Setup

# On NFS server
sudo apt install -y nfs-kernel-server
sudo mkdir -p /data/models
sudo chown nobody:nogroup /data/models
echo "/data/models *(rw,sync,no_subtree_check)" | sudo tee -a /etc/exports
sudo exportfs -a
sudo systemctl restart nfs-kernel-server

# On clients
sudo apt install -y nfs-common
sudo mkdir -p /mnt/models
sudo mount 192.168.1.100:/data/models /mnt/models

Monitoring

Prometheus Configuration

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'ai-models'
    static_configs:
      - targets: ['localhost:8000']
  
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']
  
  - job_name: 'nvidia-gpu'
    static_configs:
      - targets: ['localhost:9445']

GPU Monitoring

# Install NVIDIA DCGM Exporter
docker run -d --gpus all   -p 9445:9445   nvidia/dcgm-exporter:latest

Security

Firewall Configuration

# UFW setup
sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow 22/tcp
sudo ufw allow 80/tcp
sudo ufw allow 443/tcp
sudo ufw enable

SSL/TLS

# Generate self-signed certificate
sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048   -keyout /etc/nginx/ssl/private.key   -out /etc/nginx/ssl/certificate.crt

Backup Strategy

#!/bin/bash
# backup.sh

BACKUP_DIR="/backup/$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR

# Backup models
rsync -av /data/models/ $BACKUP_DIR/models/

# Backup configs
tar -czf $BACKUP_DIR/configs.tar.gz /etc/nginx /etc/docker

# Backup database (if applicable)
docker exec postgres pg_dump -U user database > $BACKUP_DIR/database.sql

# Cleanup old backups (keep 30 days)
find /backup -type d -mtime +30 -exec rm -rf {} +

High Availability

Keepalived Setup

# Install keepalived
sudo apt install -y keepalived

# Configure virtual IP
cat << EOF | sudo tee /etc/keepalived/keepalived.conf
vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    authentication {
        auth_type PASS
        auth_pass secret
    }
    virtual_ipaddress {
        192.168.1.100
    }
}
EOF

sudo systemctl enable keepalived
sudo systemctl start keepalived

Disaster Recovery

Recovery Plan

  1. Data Backup: Daily automated backups to offsite location
  2. Configuration Management: Version control for all configs
  3. Documentation: Detailed runbooks for common scenarios
  4. Testing: Quarterly DR drills
  5. Monitoring: 24/7 alerting system

Cost Analysis

Initial Investment

  • Hardware: $50,000 - $200,000
  • Networking: $5,000 - $20,000
  • Storage: $10,000 - $50,000
  • Setup: $10,000 - $30,000

Ongoing Costs

  • Power: $500 - $2,000/month
  • Cooling: $200 - $800/month
  • Maintenance: $1,000 - $5,000/month
  • Staff: $10,000 - $50,000/month

Break-even Analysis

On-premise becomes cost-effective at:

  • 100+ GPU hours/day
  • 3+ year timeline
  • High security requirements
  • Predictable workloads

Production Checklist

  • [ ] Hardware procurement and setup
  • [ ] Network infrastructure configured
  • [ ] Storage system deployed
  • [ ] Monitoring and alerting active
  • [ ] Backup system operational
  • [ ] Security measures implemented
  • [ ] Load balancing configured
  • [ ] High availability setup
  • [ ] Disaster recovery plan tested
  • [ ] Documentation completed
  • [ ] Staff trained
  • [ ] Maintenance schedule established