Distributed deep machine learning requires careful setup and maintenance of a variety of tools. The Kubernetes Resource Management Working Group was incubated during the 2016 Kubernetes Developer Summit in Seattle, WA, with the goal of running complex high-performance computing (HPC) workloads on top of Kubernetes. The group’s goal was to support hardware-accelerated devices including graphics processing units (GPUs) and specialized network interface cards.
In a typical HPC environment, researchers and other data scientists would need to set up these vendor devices themselves and troubleshoot when they failed. With the Kubernetes Device Plugin API, Kubernetes operators can deploy plugins that automatically enable specialized hardware support. The newly discovered devices are then offered up as normal Kubernetes consumable resources like memory or CPUs.
Use this tutorial as a reference for setting up GPU-enabled machines in an IBM Cloud environment. Learn how to use kubeadm to quickly bootstrap a Kubernetes master/node cluster and use a Kubernetes GPU device-plugin to install GPU drivers.
- Set up two Ubuntu 16.04 LTS machines that have a flat network between them and the ability to reach out to public internet endpoints:
- One machine serves as your Kubernetes master and only requires modest system resources. Our demo machine was an IBM Cloud VM with 4 VCPUs and 32 GB of RAM. (16 GB should me sufficient.)
- One machine serves as the Kubernetes node and must have at least one GPU card. Our demo machine was an IBM Cloud bare metal machine with Dual Intel Xeon E5-2620 v3 (12 cores, 2.40 GHz), 64 GB RAM, and an NVIDIA Tesla K80 Graphics Card.
- Set up the developer machine:
- Prerequisites probably take a few hours because bare metal machines take longer to provision than VMs.
- Assuming the prerequisites are met, 1-2 hours is a conservative estimate for the machine setup.
- Edit https://s3.us.cloud-object-storage.appdomain.cloud/developer/default/tutorials/k8s-kubeadm-gpu-setup/static/kubeadm_master.sh and change the TOKEN line if you want to change the hash. The TOKEN is used to bootstrap your cluster and connect to it from your node.
./staticdirectory to your master node:
scp -r ./static root@MYMACHINEIP:/root/.
Ensure that curl is installed on the master node:
sudo apt install curl -y
- This script adds the Kubernetes apt repository and install all necessary prerequisites for kubeadm.
The script also sets the Docker storage driver to overlay2.
After the installation process is complete, run https://s3.us.cloud-object-storage.appdomain.cloud/developer/default/tutorials/k8s-kubeadm-gpu-setup/static/kubeadm_master.sh. The script sets up a fully-functioning Kubernetes master node and sets up CNI networking with Calico:
Look for the bootstrap command listed in the final output from the kubedam_master.sh script and copy the IP, TOKEN, and SHA256 values.
Edit https://s3.us.cloud-object-storage.appdomain.cloud/developer/default/tutorials/k8s-kubeadm-gpu-setup/static/kubeadm_node.sh and add the values for IP, TOKEN, and SHA256.
./staticdirectory to your node machine:
scp -r ./static root@MYMACHINEIP:/root/.
After the installation process is complete, run https://s3.us.cloud-object-storage.appdomain.cloud/developer/default/tutorials/k8s-kubeadm-gpu-setup/static/kubeadm_node.sh:
If you would like to run kubectl locally, copy the Kubernetes config file on the master machine to your local machine:
scp root@MYMASTERMACHINEIP:/root/.kube/config .
On your local machine ensure that all machines are listed as
kubectl get nodes
Determine the version of CUDA that you want to run, and edit the https://s3.us.cloud-object-storage.appdomain.cloud/developer/default/tutorials/k8s-kubeadm-gpu-setup/static/gpu-installer.yaml file accordingly.
- Note the open issue with 4.4.0-116 kernels and certain NVIDIA drivers here.
- This demo assumes that you have a kernel version suitable for the drivers listed. You can attempt to downgrade to 4.4.0-112 if you want.
Install the nvidia-drivers on your machine with https://s3.us.cloud-object-storage.appdomain.cloud/developer/default/tutorials/k8s-kubeadm-gpu-setup/static/gpu-installer.yaml:
kubectl create -f ./https://s3.us.cloud-object-storage.appdomain.cloud/developer/default/tutorials/k8s-kubeadm-gpu-setup/static/gpu-installer.yaml
Confirm that the driver installation succeeded:
kubectl -n kube-system logs --follow nvidia-driver-installer
If the installation failed, log in to your GPU node and view the installer log.
Deploy the gpu-deviceplugin:
kubectl create -f ./https://s3.us.cloud-object-storage.appdomain.cloud/developer/default/tutorials/k8s-kubeadm-gpu-setup/static/gpu-deviceplugin.yaml
Confirm GPUs are detected on the machine as
kubectl describe node kgpu2 |grep Capacity -A 9 Capacity: cpu: 32 memory: 65526436Ki nvidia.com/gpu: 2 pods: 110 Allocatable: cpu: 32 memory: 65424036Ki nvidia.com/gpu: 2 pods: 110
Launch a test container to confirm that the GPUs can be allocated and used by a CUDA project:
kubectl create -f ./https://s3.us.cloud-object-storage.appdomain.cloud/developer/default/tutorials/k8s-kubeadm-gpu-setup/static/cudademo.yaml
This tutorial demonstrated the setup and configuration steps to yield a Kubernetes cluster with GPU scheduling support. You learned how a cluster operator can consume vendor devices through device plugins without expert setup knowledge. These steps should be reproducable across various bare metal environments, including IBM Cloud.