In previous blog posts, we’ve discussed how to enable GPUs with Docker alone. In this post, we’ll walk you through enabling GPUs in Red Hat OpenShift. The notable difference is that OpenShift is Kubernetes-based and it includes additional features that ease GPU integration. One of these features is the device plugin, which can be used for more specialized devices such as GPUs. You can read about device plugins in Red Hat’s documentation here: https://docs.openshift.com/container-platform/3.11/dev_guide/device_plugins.html.
This post will describe the process of setting up NVIDIA’s device plugin. Before you begin, you’ll need to install OpenShift onto your cluster. Refer to https://docs.openshift.com/container-platform/3.11/install/index.html for prerequisites and instructions.

GPU configuration

After OpenShift is installed, we can configure the device plugin. This process consists of a series of steps that should be completed in the order listed, and carefully verified before moving on to the next step.
Also note, there is a troubleshooting section at the end of this post. Should you run into issues along the way, skip ahead to see if your problem is a common one.

Perform the following steps on each GPU node

  1. Cleanup CUDA libraries from any prior installations:
    yum remove -y"cuda-*" "nvidia*" "libnvidia*" dkms.noarch epel-release nvidia-kmod* nvidia-container-runtime-hook
  2. Install cuda packages:
    wget https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.ppc64le.rpm
    rpm -ivh cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.ppc64le.rpm
    rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    yum install -y dkms cuda-drivers
    

    Reboot your host

  3. Start and verify the NVIDIA persistence daemon:
    systemctl enable nvidia-persistenced
    systemctl start nvidia-persistenced
    systemctl status nvidia-persistenced 
    

    This should show that the daemon has started.

  4. Install and verify the nvidia-container-runtime-hook:
    • Install and configure:
      distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
      curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo |   sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo
      

      Configure the repo’s key by following instructions at https://nvidia.github.io/nvidia-container-runtime/

      yum install -y nvidia-container-runtime-hook

      Note: At the time of this writing, there was a known issue with the gpg key, so it may be necessary to disable the repo gpgcheck in the nvidia-container-runtime.repo file.

      sudo chcon -t container_file_t  /dev/nvidia* 
      sudo pkill -SIGHUP dockerd 
      
    • Verify:
      nvidia-container-cli list
      nvidia-container-cli -k -d /dev/tty list

      You should be able to see a list like this:

      # nvidia-container-cli list
      /dev/nvidiactl
      /dev/nvidia-uvm
      /dev/nvidia-uvm-tools
      /dev/nvidia-modeset
      /dev/nvidia0
      /dev/nvidia1
      ...
      
  5. Verify that GPU configuration is displayed by nvidia-smi:
    nvidia-smiThis should provide you with a table of information about the GPU(s) on your system.
  6. Verify that the `nvidia-uvm` device file has been created:
    ls /dev/nvidia-uvmIf the device file is not found, download the cudaInit utility for Power here: https://www.ibm.com/developerworks/community/files/app#/file/7dff6a33-b2b8-4623-8b29-3efddc6e95b7 and execute it:
    ./cudaInit_ppc64le)
  7. Verify that the device log file exists:
    ls /var/lib/docker/volumes/metadata.dbIf this file is missing, your cuda installation did not complete properly. Un-install and re-install by repeating steps 1 and 2.
  8. Enable a GPU-specific selinux policy:
    wget https://github.com/clnperez/dgx-selinux/releases/download/ppc64le/nvidia-container-ppc64le.pp
    semodule -i nvidia-container-ppc64le.pp
    nvidia-container-cli -k list | restorecon -v -f -
    restorecon -Rv /dev
    restorecon -Rv /var/lib/kubelet
    

    The policy file that was used to create the module above can be found here: https://github.com/zvonkok/origin-ci-gpu/blob/34a7609523f733c490fb09eb42acc30d9c397912/selinux/nvidia-container.te
    Verify that the following runs successfully:

    docker run --user 1000:1000 --security-opt=no-new-privileges --cap-drop=ALL --security-opt label=type:nvidia_container_t nvidia/cuda-ppc64le sleep 100
    
  9. Edit the udev rules (on IBM POWER9 Only):
    Remove the “Memory hotadd request” section by referring to the instructions here: https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.0/navigation/pai_setupRHEL.html

Your system should now be configured to use GPUs, as well as use GPUs inside of Kubernetes pods. Next, we’ll perform steps that are OpenShift specific. Again, make sure you follow these steps in the order shown and pay close attention to the verification. In some cases, the completion of a step doesn’t necessarily indicate that everything is working.

Configure an OpenShift project using GPUs

Execute the following steps from an OpenShift master node:

  1. Login to the admin account:
    oc login -u system:admin
  2. To enable scheduling the Device Plugin on worker nodes with GPUs, label each node as follows. Run this for each GPU node and make sure the node names match the output of `oc get nodes`:
    oc label node  openshift.com/gpu-accelerator=true
    oc label node  nvidia.com/gpu=true
  3. Create an nvidia project:
    oc new-project nvidia
  4. Create an OpenShift Service Account:
    oc create serviceaccount nvidia-deviceplugin
  5. Create an OpenShift Security Context Constraint:
    git clone https://github.com/redhat-performance/openshift-psap
    oc create -f openshift-psap/playbooks/roles/nvidia-device-plugin/files/nvidia-device-plugin-scc.yaml
    

    Verify the Security Context Constraint:

    oc get scc | grep nvidia
    nvidia-deviceplugin   true      [*]       RunAsAny    RunAsAny           RunAsAny    RunAsAny    10         false            [*]
    
  6. Create and verify the nvidia-device-plugin DaemonSet:
    • From the git repo cloned earlier, edit `openshift-psap/blog/gpu/device-plugin/nvidia-device-plugin.yml` by replacing ‘nvidia/k8s-device-plugin:1.11’ with ‘nvidia/k8s-device-plugin-ppc64le:1.11’ as the `image:` value.
    • Create the DaemonSet:
      oc create -f openshift-psap/blog/gpu/device-plugin/nvidia-device-plugin.yml
    • Verify the DaemonSet:
      oc get -n kube-system daemonset.apps/nvidia-device-plugin-daemonset

      NAME                             DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
      nvidia-device-plugin-daemonset   1         1         1         1            1                     4d

      Since a DaemonSet is a resource that will result in a pod on each node, verify that they are all running:
      oc get pods -n kube-system -o wide

      [root@dlw06 ~]# oc get pods -n kube-system -o wide | grep nvidia
      nvidia-device-plugin-daemonset-8lgqp           1/1       Running   3          4d        10.128.0.140   example_sys.com   

      Inspect the logs to ensure that the plugin is running inside the pod.

      1. First, find the name of the pod(s):
        # oc get pods -n kube-system | grep nvidia
        
        nvidia-device-plugin-daemonset-8lgqp           1/1       Running   0          47s
      2. Query the pod’s logs:
        oc logs -n kube-system nvidia-device-plugin-daemonset-8lgqp
        You should see something similar to the following:

        2019/11/01 21:57:07 Loading NVML
        2019/11/01 21:57:07 Fetching devices.
        2019/11/01 21:57:07 Starting FS watcher.
        2019/11/01 21:57:07 Starting OS watcher.
        2019/11/01 21:57:07 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
        2019/11/01 21:57:07 Registered device plugin with Kubelet
  7. Verify that the pod can request and access a GPU:
    • Download a sample pod yaml from here: https://github.com/NVIDIA/k8s-device-plugin/blob/master/pod1.yml
    • Edit the image name by changing nvidia/cuda to nvidia/cuda-ppc64le
    • Create the pod:
      oc create -f pod1.yml
    • You should see the pod with name pod1 in Running state:
      # oc get pods
      The output of the describe command should contain the following events:
      # oc describe pod pod1

      Events:
        Type    Reason     Age   From                                Message
        ----    ------     ----  ----                                -------
        Normal  Scheduled  2m    default-scheduler                   Successfully assigned nvidia/pod1 to example_sys.com
        Normal  Pulling    2m    kubelet, example_sys.com  pulling image "nvidia/cuda-ppc64le"
        Normal  Pulled     1m    kubelet, example_sys.com  Successfully pulled image "nvidia/cuda-ppc64le"
        Normal  Created    1m    kubelet, example_sys.com  Created container
        Normal  Started    1m    kubelet, example_sys.com  Started container

You can now schedule pods with GPUs in your OpenShift cluster. For finer-grained control of how GPUs are exposed and shared inside containers, you can configure a set of environment variables. These can be found here: https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec

Troubleshooting

  • Memory race condition with cpusetsDepending on the order that kubernetes starts, the GPU memory system slice under kubernetes may not be accurate. If any action results in an error that you GPU is not available, you can use this script to reconcile the cpuset: https://github.com/IBM/powerai/blob/master/support/cpuset_fix/cpuset_check.shYou can also work around this by simply removing the kubepod slice directory:
    cd /sys/fs/cgroup/cpuset
    mv kubepods.slice kubepods.old

    It will be automatically recreated and populated with the correct information.

  • selinux relabeling
    Any time you see an insufficient permissions issue, check that the labels set in step 10 above are correct. The  following table shows the correct labels of the files needed to have a working GPU container.

    File                                     SELinux label                  
    /dev/nvidia*                             xserver_misc_device_t
    /usr/bin/nvidia-*                        xserver_exec_t
    /var/lib/kubelet/*/*                     container_file_t

    Verify that the selinux labels are correct by using the Z flag to the `ls` command, e.g.

    # ls -lahZ /dev/nvidia0
    crw-rw-rw-. root root system_u:object_r:container_file_t:s0 /dev/nvidia0

    If you’re still having permissions issues on your system, you can also edit the device plugin DaemonSet and change the privilege escalation of your containers by changing¬†“allowPrivilegeEscalation: false” to “allowPrivilegeEscalation: true”. You can do this by deleting the the DaemonSet using `oc delete` and re-create it, or you can edit the running DaemonSet by using `oc edit daemonset.apps/nvidia-device-plugin-daemonset`. Note that this should be used as a troubleshooting only step, as allowing pods to escalate privileges is a security risk.

Thanks to Zvonko Kaiser from Red Hat for the initial writeup and selinux policy.

2 comments on"Enabling GPUs in OpenShift 3.11"

  1. How was the server hardware configuration used?

Join The Discussion

Your email address will not be published. Required fields are marked *