Distributed deep machine learning requires careful setup and maintenance of a variety of tools. The Kubernetes Resource Management Working Group was incubated during the 2016 Kubernetes Developer Summit in Seattle, WA, with the goal of running complex high-performance computing (HPC) workloads on top of Kubernetes. The group’s goal was to support hardware-accelerated devices including graphics processing units (GPUs) and specialized network interface cards.

In a typical HPC environment, researchers and other data scientists would need to set up these vendor devices themselves and troubleshoot when they failed. With the Kubernetes Device Plugin API, Kubernetes operators can deploy plugins that automatically enable specialized hardware support. The newly discovered devices are then offered up as normal Kubernetes consumable resources like memory or CPUs.

Learning objectives

Use this tutorial as a reference for setting up GPU-enabled machines in an IBM Cloud environment. Learn how to use kubeadm to quickly bootstrap a Kubernetes master/node cluster and use a Kubernetes GPU device-plugin to install GPU drivers.

Prerequisites

  • Set up two Ubuntu 16.04 LTS machines that have a flat network between them and the ability to reach out to public internet endpoints:
    • One machine serves as your Kubernetes master and only requires modest system resources. Our demo machine was an IBM Cloud VM with 4 VCPUs and 32 GB of RAM. (16 GB should me sufficient.)
    • One machine serves as the Kubernetes node and must have at least one GPU card. Our demo machine was an IBM Cloud bare metal machine with Dual Intel Xeon E5-2620 v3 (12 cores, 2.40 GHz), 64 GB RAM, and an NVIDIA Tesla K80 Graphics Card.
  • Set up the developer machine:

Estimated time

  • Prerequisites probably take a few hours because bare metal machines take longer to provision than VMs.
  • Assuming the prerequisites are met, 1-2 hours is a conservative estimate for the machine setup.

Steps

  1. Edit https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/kubeadm_master.sh and change the TOKEN line if you want to change the hash. The TOKEN is used to bootstrap your cluster and connect to it from your node.
  2. Copy the ./static directory to your master node:

     scp -r ./static root@MYMACHINEIP:/root/.
    
  3. Ensure that curl is installed on the master node:

     sudo apt install curl -y
    
  4. Run https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/install_kube.sh:

    • This script adds the Kubernetes apt repository and install all necessary prerequisites for kubeadm.
    • The script also sets the Docker storage driver to overlay2.

      sudo ./https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/install_kube.sh
      
  5. After the installation process is complete, run https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/kubeadm_master.sh. The script sets up a fully-functioning Kubernetes master node and sets up CNI networking with Calico:

     sudo ./https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/kubeadm_master.sh
    
  6. Look for the bootstrap command listed in the final output from the kubedam_master.sh script and copy the IP, TOKEN, and SHA256 values.

  7. Edit https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/kubeadm_node.sh and add the values for IP, TOKEN, and SHA256.

  8. Copy the ./static directory to your node machine:

     scp -r ./static root@MYMACHINEIP:/root/.
    
  9. Run https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/install_kube.sh:

     sudo ./https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/install_kube.sh
    
  10. After the installation process is complete, run https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/kubeadm_node.sh:

     sudo ./https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/kubeadm_node.sh
    
  11. If you would like to run kubectl locally, copy the Kubernetes config file on the master machine to your local machine:

     scp root@MYMASTERMACHINEIP:/root/.kube/config .
    
  12. On your local machine ensure that all machines are listed as Ready:

     kubectl get nodes
    
  13. Determine the version of CUDA that you want to run, and edit the https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/gpu-installer.yaml file accordingly.

    • Note the open issue with 4.4.0-116 kernels and certain NVIDIA drivers here.
    • This demo assumes that you have a kernel version suitable for the drivers listed. You can attempt to downgrade to 4.4.0-112 if you want.
  14. Install the nvidia-drivers on your machine with https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/gpu-installer.yaml:

     kubectl create -f ./https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/gpu-installer.yaml
    
  15. Confirm that the driver installation succeeded:

     kubectl -n kube-system logs --follow nvidia-driver-installer
    
  16. If the installation failed, log in to your GPU node and view the installer log.

     less /home/kubernetes/bin/nvidia/nvidia-installer.log
    
  17. Deploy the gpu-deviceplugin:

     kubectl create -f ./https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/gpu-deviceplugin.yaml
    
  18. Confirm GPUs are detected on the machine as nvidia.com/gpu:

     kubectl describe node kgpu2 |grep Capacity -A 9
    
     Capacity:
      cpu:             32
      memory:          65526436Ki
      nvidia.com/gpu:  2
      pods:            110
     Allocatable:
      cpu:             32
      memory:          65424036Ki
      nvidia.com/gpu:  2
      pods:            110
    
  19. Launch a test container to confirm that the GPUs can be allocated and used by a CUDA project:

     kubectl create -f ./https://s3.us.cloud-object-storage.appdomain.cloud/developer/tutorials/k8s-kubeadm-gpu-setup/static/cudademo.yaml
    

Summary

This tutorial demonstrated the setup and configuration steps to yield a Kubernetes cluster with GPU scheduling support. You learned how a cluster operator can consume vendor devices through device plugins without expert setup knowledge. These steps should be reproducable across various bare metal environments, including IBM Cloud.