Lutz RoederDec 27, 2016

TensorFlow on Azure

Training neural networks (deep learning) is compute-intensive. Fast GPUs can make those sessions, which sometimes take hours or days go orders of magnitude faster. However, laptops usually don't come with the fastest GPUs and having to maintain a desktop machine only to occasionally run deep learning tasks is extra hassle.

Cloud providers now offer virtual machines (VMs) with GPUs which run in data centers and can be used by anybody on an hourly basis. Below is a quick tutorial that walks through setting up a VM in Microsoft Azure with the necessary drivers to train neural networks using TensorFlow.

First, if you haven't done so already, create an Azure account, install the Azure CLI and follow the login procedure running az login.

Azure manages resources (virtual machines, storage etc.) via resource groups. GPU virtual machine instances are available in the East US region. If you already have a group for that region feel free to use it, otherwise create a new resource group:

az group create -n ai -l EastUS

We will connect to the machine via SSH and need to create a key pair:

ssh-keygen -f ~/.ssh/az_ai_id_rsa -t rsa -b 2048 -C '' -N ''

Next, we create the actual virtual machine running Ubuntu with the cheapest and least powerful GPU size (NC6).

az vm create -g ai -n ai --image Canonical:UbuntuServer:18.04-LTS:latest --size Standard_NC6 --admin-username ai --ssh-key-value ~/.ssh/az_ai_id_rsa.pub

Once completed, the command will print the IP address for the newly created machine:

{
  "publicIpAddress": "127.0.0.1",
  "resourceGroup": "ai"
}

The VM is now running in a data center (and charging for cycles). The following commands can be used to deallocate and restart anytime:

az vm deallocate -g ai -n ai
az vm start -g ai -n ai

Connect to the machine via SSH (type 'yes', if asked to continue):

ssh ai@$(az vm show -d -g ai -n ai --query "publicIps" --o tsv) -i ~/.ssh/az_ai_id_rsa

Install CUDA 10

Next, download CUDA, make it known to apt-get and run install:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1804_10.0.130-1_amd64.deb
sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo apt-get update
sudo apt-get install -y cuda=10.0.130-1
rm cuda-repo-ubuntu1804_*_amd64.deb

Now we can check the status of the GPU(s) by running nvidia-smi.

Install cuDNN 7.6.5

Next, download and install cuDNN...

wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo dpkg -i nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
sudo apt-get update
sudo apt-get install -y libcudnn7=7.6.5.32-1+cuda10.0
sudo apt-get install -y libcudnn7-dev=7.6.5.32-1+cuda10.0
rm nvidia-machine-learning-repo-ubuntu1804_*_amd64.deb
sudo ldconfig

Environment variables

...and add the following exports to ~/.bashrc:

export LD_LIBRARY_PATH=/usr/local/cuda-10.0:${LD_LIBRARY_PATH}

Install TensorFlow

The final step is to install Pip and the GPU version of TensorFlow:

sudo apt-get install -y python3-dev python3-pip
sudo pip3 install tensorflow-gpu

We can now start a Python console and create a TensorFlow session:

python3
>>> import tensorflow as tf
>>> session = tf.Session()

If everything went well, it will recognize the Tesla K80 GPU:

Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10804 MB memory)
-> physical GPU (device: 0, name: Tesla K80, pci bus id: 8a8f:00:00.0, compute capability: 3.7)

Remember to deallocate the VM when done to avoid using cycles:

az vm deallocate -g ai -n ai

Once no longer needed, you can delete the virtual machine by running:

az vm delete -g ai -n ai
az group delete -n ai