Christopher M Luciano | Published March 22, 2018
Artificial intelligenceContainersData scienceCloudOn premises
Distributed deep machine learning requires careful setup and maintenance of a variety of tools. The Kubernetes Resource Management Working Group was incubated during the 2016 Kubernetes Developer Summit in Seattle, WA, with the goal of running complex high-performance computing (HPC) workloads on top of Kubernetes. The group’s goal was to support hardware-accelerated devices including graphics processing units (GPUs) and specialized network interface cards.
In a typical HPC environment, researchers and other data scientists would need to set up these vendor devices themselves and troubleshoot when they failed. With the Kubernetes Device Plugin API, Kubernetes operators can deploy plugins that automatically enable specialized hardware support. The newly discovered devices are then offered up as normal Kubernetes consumable resources like memory or CPUs.
Use this tutorial as a reference for setting up GPU-enabled machines in an IBM Cloud environment. Learn how to use kubeadm to quickly bootstrap a Kubernetes master/node cluster and use a Kubernetes GPU device-plugin to install GPU drivers.
Copy the ./static directory to your master node:
scp -r ./static root@MYMACHINEIP:/root/.
Ensure that curl is installed on the master node:
sudo apt install curl -y
The script also sets the Docker storage driver to overlay2.
After the installation process is complete, run https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/kubeadm_master.sh. The script sets up a fully-functioning Kubernetes master node and sets up CNI networking with Calico:
Look for the bootstrap command listed in the final output from the kubedam_master.sh script and copy the IP, TOKEN, and SHA256 values.
Edit https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/kubeadm_node.sh and add the values for IP, TOKEN, and SHA256.
Copy the ./static directory to your node machine:
After the installation process is complete, run https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/kubeadm_node.sh:
If you would like to run kubectl locally, copy the Kubernetes config file on the master machine to your local machine:
scp root@MYMASTERMACHINEIP:/root/.kube/config .
On your local machine ensure that all machines are listed as Ready:
kubectl get nodes
Determine the version of CUDA that you want to run, and edit the https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/gpu-installer.yaml file accordingly.
Install the nvidia-drivers on your machine with https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/gpu-installer.yaml:
kubectl create -f ./https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/gpu-installer.yaml
Confirm that the driver installation succeeded:
kubectl -n kube-system logs --follow nvidia-driver-installer
If the installation failed, log in to your GPU node and view the installer log.
Deploy the gpu-deviceplugin:
kubectl create -f ./https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/gpu-deviceplugin.yaml
Confirm GPUs are detected on the machine as nvidia.com/gpu:
kubectl describe node kgpu2 |grep Capacity -A 9
Launch a test container to confirm that the GPUs can be allocated and used by a CUDA project:
kubectl create -f ./https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/cudademo.yaml
This tutorial demonstrated the setup and configuration steps to yield a Kubernetes cluster with GPU scheduling support. You learned how a cluster operator can consume vendor devices through device plugins without expert setup knowledge. These steps should be reproducable across various bare metal environments, including IBM Cloud.
Back to top