Digital Developer Conference: Hybrid Cloud. On Sep 22 & 24, start your journey to OpenShift certification. Free registration

Enabling GPUs in OpenShift 3.11

In this tutorial, we’ll walk you through enabling GPUs in Red Hat® OpenShift. OpenShift is Kubernetes-based and it includes features that ease GPU integration. One of these features is the device plug-in, which can be used for more specialized devices such as GPUs.

This tutorial describes the process of setting up NVIDIA’s device plug-in. Before you begin, you’ll need to install OpenShift onto your cluster. Refer to https://docs.openshift.com/container-platform/3.11/install/index.html for prerequisites and instructions.

GPU configuration

After OpenShift is installed, we can configure the device plug-in. This process consists of a series of steps that should be completed in the order listed, and carefully verified before moving on to the next step.

Also note, there is a troubleshooting section at the end of this tutorial. If you run into issues along the way, skip ahead to see if your problem is a common one.

Perform the following steps on each GPU node:

  1. Ensure the latest kernel is installed on your system:
    yum update kernel
  2. Cleanup CUDA libraries from any prior installations:
    yum remove -y"cuda-*" "nvidia*" "libnvidia*" dkms.noarch epel-release nvidia-kmod* nvidia-container-runtime-hook
  3. Install CUDA packages:

    wget https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.ppc64le.rpm
    rpm -ivh cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.ppc64le.rpm
    rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    yum install -y dkms cuda-drivers
    

    Reboot your host.

  4. Start and verify the NVIDIA persistence daemon:

    systemctl enable nvidia-persistenced
    systemctl start nvidia-persistenced
    systemctl status nvidia-persistenced
    

    This should show that the daemon has started.

  5. Install and verify the nvidia-container-runtime-hook:

    • Install and configure:

      distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
      curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo |   sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo
      

      Configure the repo’s key by following instructions at https://nvidia.github.io/nvidia-container-runtime/

      yum install -y nvidia-container-runtime-hook

      Note: At the time of this writing, there was a known issue with the gpg key, so it may be necessary to disable the repo gpgcheck in the nvidia-container-runtime.repo file.

      sudo chcon -t container_file_t  /dev/nvidia*
      sudo pkill -SIGHUP dockerd
      
    • Verify:

      nvidia-container-cli list
      nvidia-container-cli -k -d /dev/tty list
      

      You should be able to see a list like this:

      # nvidia-container-cli list
      /dev/nvidiactl
      /dev/nvidia-uvm
      /dev/nvidia-uvm-tools
      /dev/nvidia-modeset
      /dev/nvidia0
      /dev/nvidia1
      ...
      
  6. Verify that GPU configuration is displayed by nvidia-smi:
    nvidia-smi
    This should provide you with a table of information about the GPUs on your system.

  7. Verify that the device log file exists:
    ls /var/lib/docker/volumes/metadata.db
    If this file is missing, your CUDA installation did not complete properly. Uninstall and reinstall by repeating steps 1 and 2.

  8. Enable a GPU-specific SELinux policy:

    wget https://github.com/clnperez/dgx-selinux/releases/download/ppc64le/nvidia-container-ppc64le.pp
    semodule -i nvidia-container-ppc64le.pp
    nvidia-container-cli -k list | restorecon -v -f -
    restorecon -Rv /dev
    restorecon -Rv /var/lib/kubelet
    
  9. Correct SELinux labels (CUDA 10 and later):
    Due to a change of location for library files, SELinux labels will not be set correctly for use inside containers. After you run the restorecon command from the previous step, nvidia-container-cli -k list | restorecon -v -f -, you will need to re-label the CUDA library files.

    Do so by executing the following command: chcon -t textrel_shlib_t /usr/lib64/libcuda.so*

    You can find detailed background information about the reason for this workaround in the following Bugzilla discussion: https://bugzilla.redhat.com/show_bug.cgi?id=1740643

  10. The policy file that was used to create the module above can be found here: https://github.com/zvonkok/origin-ci-gpu/blob/34a7609523f733c490fb09eb42acc30d9c397912/selinux/nvidia-container.te

    Verify that the following runs successfully:

    docker run --user 1000:1000 --security-opt=no-new-privileges --cap-drop=ALL --security-opt label=type:nvidia_container_t nvidia/cuda-ppc64le sleep 100

  11. Edit the udev rules (on IBM® POWER9™ only):
    Remove the “Memory hotadd request” section by referring to the instructions here: https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.0/navigation/pai_setupRHEL.html

Your system should now be configured to use GPUs, as well as use GPUs inside Kubernetes pods. Next, we’ll perform steps that are OpenShift specific. Again, make sure you follow these steps in the order shown and pay close attention to the verification. In some cases, the completion of a step doesn’t necessarily indicate that everything is working.

Configure an OpenShift project using GPUs

Perform the following steps from an OpenShift master node:

  1. Login to the admin account:
    oc login -u system:admin
  2. To enable scheduling the Device Plugin on worker nodes with GPUs, label each node as follows. Run this for each GPU node and make sure the node names match the output of oc get nodes:
    oc label node <node> openshift.com/gpu-accelerator=true
    oc label node <node> nvidia.com/gpu=true
  3. Create an nvidia project:
    oc new-project nvidia
  4. Create an OpenShift service account:
    oc create serviceaccount nvidia-deviceplugin
  5. Create and verify the nvidia-device-plugin DaemonSet:

    • Create a clone using the following command:
      git clone https://github.com/redhat-performance/openshift-psap
    • Edit openshift-psap/blog/gpu/device-plugin/nvidia-device-plugin.yml by replacing ‘nvidia/k8s-device-plugin:1.11’ with ‘nvidia/k8s-device-plugin-ppc64le:1.11’ as the image: value.
    • Create the DaemonSet:
      oc create -f openshift-psap/blog/gpu/device-plugin/nvidia-device-plugin.yml
    • Verify the DaemonSet:

      oc get -n kube-system daemonset.apps/nvidia-device-plugin-daemonset
      
      NAME                             DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
      nvidia-device-plugin-daemonset   1         1         1         1            1                     4d
      

      Because a DaemonSet is a resource that will result in a pod on each node, verify that they are all running:

      oc get pods -n kube-system -o wide
      
      [root@dlw06 ~]# oc get pods -n kube-system -o wide | grep nvidia
      nvidia-device-plugin-daemonset-8lgqp           1/1       Running   3          4d        10.128.0.140   example_sys.com
      

      Inspect the logs to ensure that the plug-in is running inside the pod.

      1. First, find the name of the pods:

        # oc get pods -n kube-system | grep nvidia
        nvidia-device-plugin-daemonset-8lgqp           1/1       Running   0          47s
        
      2. Query the pods’ logs:
        oc logs -n kube-system nvidia-device-plugin-daemonset-8lgqp

        You should see something similar to the following:

        2019/11/01 21:57:07 Loading NVML
        2019/11/01 21:57:07 Fetching devices.
        2019/11/01 21:57:07 Starting FS watcher.
        2019/11/01 21:57:07 Starting OS watcher.
        2019/11/01 21:57:07 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
        2019/11/01 21:57:07 Registered device plugin with Kubelet
        
  6. Verify that the pod can request and access a GPU:

    • Download a sample pod yaml from here: https://github.com/NVIDIA/k8s-device-plugin/blob/master/pod1.yml
    • Edit the image name by changing nvidia/cuda to nvidia/cuda-ppc64le
    • Create the pod:
      oc create -f pod1.yml
    • You should see the pod with name pod1 in Running state:
      # oc get pods

      The output of the describe command should contain the following events:
      # oc describe pod pod1

      Events:
      Type    Reason     Age   From                                Message
      ----    ------     ----  ----                                -------
      Normal  Scheduled  2m    default-scheduler                   Successfully assigned nvidia/pod1 to example_sys.com
      Normal  Pulling    2m    kubelet, example_sys.com  pulling image "nvidia/cuda-ppc64le"
      Normal  Pulled     1m    kubelet, example_sys.com  Successfully pulled image "nvidia/cuda-ppc64le"
      Normal  Created    1m    kubelet, example_sys.com  Created container
      Normal  Started    1m    kubelet, example_sys.com  Started container
      

You can now schedule pods with GPUs in your OpenShift cluster. For finer-grained control of how GPUs are exposed and shared inside containers, you can configure a set of environment variables. These can be found here: https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec

Troubleshooting

  • Memory race condition with cpusets
    Depending on the order that kubernetes starts, the GPU memory system slice under kubernetes may not be accurate. If any action results in an error that you GPU is not available, you can use this script to reconcile the cpuset: https://github.com/IBM/powerai/blob/master/support/cpuset_fix/cpuset_check.sh. You can also work around this by simply removing the kubepod slice directory:

    cd /sys/fs/cgroup/cpuset
    mv kubepods.slice kubepods.old
    

    It will be automatically recreated and populated with the correct information.

  • SELinux relabeling
    Any time you see an insufficient permissions issue, check that the labels set in step 10 above are correct. The following table shows the correct labels of the files needed to have a working GPU container.

    File                                  SELinux label
    /dev/nvidia*                          container_file_t
    /usr/bin/nvidia-*                     xserver_exec_t
    /var/lib/kubelet/*/*                  container_file_t
    

    Verify that the SELinux labels are correct by using the Z flag to the ls command, for example:

    # ls -lahZ /dev/nvidia0
    crw-rw-rw-. root root system_u:object_r:container_file_t:s0 /dev/nvidia0
    

If you’re still having permissions issues on your system, you can also edit the device plug-in DaemonSet and change the privilege escalation of your containers by changing allowPrivilegeEscalation: false to allowPrivilegeEscalation: true. You can do this by deleting the the DaemonSet using oc delete and re-create it, or you can edit the running DaemonSet by using oc edit daemonset.apps/nvidia-device-plugin-daemonset. Note that this should be used as a troubleshooting only step, as allowing pods to escalate privileges is a security risk.

Acknowledgment

Thanks to Zvonko Kaiser from Red Hat for the initial writeup and SELinux policy.