In previous blog posts, we’ve discussed how to enable GPUs with Docker alone. In this post, we’ll walk you through enabling GPUs in Red Hat OpenShift. The notable difference is that OpenShift is Kubernetes-based and it includes additional features that ease GPU integration. One of these features is the device plugin, which can be used for more specialized devices such as GPUs. You can read about device plugins in Red Hat’s documentation here: https://docs.openshift.com/container-platform/3.11/dev_guide/device_plugins.html.
This post will describe the process of setting up NVIDIA’s device plugin. Before you begin, you’ll need to install OpenShift onto your cluster. Refer to https://docs.openshift.com/container-platform/3.11/install/index.html for prerequisites and instructions.
After OpenShift is installed, we can configure the device plugin. This process consists of a series of steps that should be completed in the order listed, and carefully verified before moving on to the next step.
Also note, there is a troubleshooting section at the end of this post. Should you run into issues along the way, skip ahead to see if your problem is a common one.
Perform the following steps on each GPU node
- Cleanup CUDA libraries from any prior installations:
yum remove -y"cuda-*" "nvidia*" "libnvidia*" dkms.noarch epel-release nvidia-kmod* nvidia-container-runtime-hook
- Install cuda packages:
wget https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.ppc64le.rpm rpm -ivh cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.ppc64le.rpm rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm yum install -y dkms cuda-drivers
Reboot your host
- Start and verify the NVIDIA persistence daemon:
systemctl enable nvidia-persistenced systemctl start nvidia-persistenced systemctl status nvidia-persistenced
This should show that the daemon has started.
- Install and verify the nvidia-container-runtime-hook:
- Install and configure:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo | sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo
Configure the repo’s key by following instructions at https://nvidia.github.io/nvidia-container-runtime/
yum install -y nvidia-container-runtime-hook
Note: At the time of this writing, there was a known issue with the gpg key, so it may be necessary to disable the repo gpgcheck in the nvidia-container-runtime.repo file.
sudo chcon -t container_file_t /dev/nvidia* sudo pkill -SIGHUP dockerd
nvidia-container-cli list nvidia-container-cli -k -d /dev/tty list
You should be able to see a list like this:
# nvidia-container-cli list /dev/nvidiactl /dev/nvidia-uvm /dev/nvidia-uvm-tools /dev/nvidia-modeset /dev/nvidia0 /dev/nvidia1 ...
- Install and configure:
- Verify that GPU configuration is displayed by nvidia-smi:
nvidia-smiThis should provide you with a table of information about the GPU(s) on your system.
- Verify that the `nvidia-uvm` device file has been created:
ls /dev/nvidia-uvmIf the device file is not found, download the cudaInit utility for Power here: https://www.ibm.com/developerworks/community/files/app#/file/7dff6a33-b2b8-4623-8b29-3efddc6e95b7 and execute it:
- Verify that the device log file exists:
ls /var/lib/docker/volumes/metadata.dbIf this file is missing, your cuda installation did not complete properly. Un-install and re-install by repeating steps 1 and 2.
- Enable a GPU-specific selinux policy:
wget https://github.com/clnperez/dgx-selinux/releases/download/ppc64le/nvidia-container-ppc64le.pp semodule -i nvidia-container-ppc64le.pp nvidia-container-cli -k list | restorecon -v -f - restorecon -Rv /dev restorecon -Rv /var/lib/kubelet
The policy file that was used to create the module above can be found here: https://github.com/zvonkok/origin-ci-gpu/blob/34a7609523f733c490fb09eb42acc30d9c397912/selinux/nvidia-container.te
Verify that the following runs successfully:
docker run --user 1000:1000 --security-opt=no-new-privileges --cap-drop=ALL --security-opt label=type:nvidia_container_t nvidia/cuda-ppc64le sleep 100
- Edit the udev rules (on IBM POWER9 Only):
Remove the “Memory hotadd request” section by referring to the instructions here: https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.0/navigation/pai_setupRHEL.html
Your system should now be configured to use GPUs, as well as use GPUs inside of Kubernetes pods. Next, we’ll perform steps that are OpenShift specific. Again, make sure you follow these steps in the order shown and pay close attention to the verification. In some cases, the completion of a step doesn’t necessarily indicate that everything is working.
Configure an OpenShift project using GPUs
Execute the following steps from an OpenShift master node:
- Login to the admin account:
oc login -u system:admin
- To enable scheduling the Device Plugin on worker nodes with GPUs, label each node as follows. Run this for each GPU node and make sure the node names match the output of `oc get nodes`:
oc label node openshift.com/gpu-accelerator=true oc label node nvidia.com/gpu=true
- Create an nvidia project:
oc new-project nvidia
- Create an OpenShift Service Account:
oc create serviceaccount nvidia-deviceplugin
- Create an OpenShift Security Context Constraint:
git clone https://github.com/redhat-performance/openshift-psap oc create -f openshift-psap/playbooks/roles/nvidia-device-plugin/files/nvidia-device-plugin-scc.yaml
Verify the Security Context Constraint:
oc get scc | grep nvidia nvidia-deviceplugin true [*] RunAsAny RunAsAny RunAsAny RunAsAny 10 false [*]
- Create and verify the nvidia-device-plugin DaemonSet:
- From the git repo cloned earlier, edit `openshift-psap/blog/gpu/device-plugin/nvidia-device-plugin.yml` by replacing ‘nvidia/k8s-device-plugin:1.11’ with ‘nvidia/k8s-device-plugin-ppc64le:1.11’ as the `image:` value.
- Create the DaemonSet:
oc create -f openshift-psap/blog/gpu/device-plugin/nvidia-device-plugin.yml
- Verify the DaemonSet:
oc get -n kube-system daemonset.apps/nvidia-device-plugin-daemonset
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE nvidia-device-plugin-daemonset 1 1 1 1 1 4d
Since a DaemonSet is a resource that will result in a pod on each node, verify that they are all running:
oc get pods -n kube-system -o wide
[root@dlw06 ~]# oc get pods -n kube-system -o wide | grep nvidia nvidia-device-plugin-daemonset-8lgqp 1/1 Running 3 4d 10.128.0.140 example_sys.com
Inspect the logs to ensure that the plugin is running inside the pod.
- First, find the name of the pod(s):
# oc get pods -n kube-system | grep nvidia nvidia-device-plugin-daemonset-8lgqp 1/1 Running 0 47s
- Query the pod’s logs:
oc logs -n kube-system nvidia-device-plugin-daemonset-8lgqp
You should see something similar to the following:
2019/11/01 21:57:07 Loading NVML 2019/11/01 21:57:07 Fetching devices. 2019/11/01 21:57:07 Starting FS watcher. 2019/11/01 21:57:07 Starting OS watcher. 2019/11/01 21:57:07 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock 2019/11/01 21:57:07 Registered device plugin with Kubelet
- First, find the name of the pod(s):
- Verify that the pod can request and access a GPU:
- Download a sample pod yaml from here: https://github.com/NVIDIA/k8s-device-plugin/blob/master/pod1.yml
- Edit the image name by changing nvidia/cuda to nvidia/cuda-ppc64le
- Create the pod:
oc create -f pod1.yml
- You should see the pod with name pod1 in Running state:
# oc get pods
The output of the
describecommand should contain the following events:
# oc describe pod pod1
Events: ¬† Type¬†¬†¬† Reason¬†¬†¬†¬† Age¬†¬† From¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† Message ¬† ----¬†¬†¬† ------¬†¬†¬†¬† ----¬† ----¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† ------- ¬† Normal¬† Scheduled¬† 2m¬†¬†¬† default-scheduler¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† Successfully assigned nvidia/pod1 to example_sys.com ¬† Normal¬† Pulling¬†¬†¬† 2m¬†¬†¬† kubelet, example_sys.com¬† pulling image "nvidia/cuda-ppc64le" ¬† Normal¬† Pulled¬†¬†¬†¬† 1m¬†¬†¬† kubelet, example_sys.com¬† Successfully pulled image "nvidia/cuda-ppc64le" ¬† Normal¬† Created¬†¬†¬† 1m¬†¬†¬† kubelet, example_sys.com¬† Created container ¬† Normal¬† Started¬†¬†¬† 1m¬†¬†¬† kubelet, example_sys.com¬† Started container
You can now schedule pods with GPUs in your OpenShift cluster. For finer-grained control of how GPUs are exposed and shared inside containers, you can configure a set of environment variables. These can be found here: https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec
- Memory race condition with cpusetsDepending on the order that kubernetes starts, the GPU memory system slice under kubernetes may not be accurate. If any action results in an error that you GPU is not available, you can use this script to reconcile the cpuset: https://github.com/IBM/powerai/blob/master/support/cpuset_fix/cpuset_check.shYou can also work around this by simply removing the kubepod slice directory:
cd /sys/fs/cgroup/cpuset mv kubepods.slice kubepods.old
It will be automatically recreated and populated with the correct information.
- selinux relabeling
Any time you see an insufficient permissions issue, check that the labels set in step 10 above are correct. The¬† following table shows the correct labels of the files needed to have a working GPU container.
File¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† SELinux label¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† /dev/nvidia*¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† ¬† xserver_misc_device_t /usr/bin/nvidia-*¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† xserver_exec_t /var/lib/kubelet/*/*¬†¬†¬†¬†¬†¬†¬†¬†¬† container_file_t
Verify that the selinux labels are correct by using the Z flag to the `ls` command, e.g.
# ls -lahZ /dev/nvidia0 crw-rw-rw-. root root system_u:object_r:container_file_t:s0 /dev/nvidia0
If you’re still having permissions issues on your system, you can also edit the device plugin DaemonSet and change the privilege escalation of your containers by changing¬†“allowPrivilegeEscalation: false” to “allowPrivilegeEscalation: true”. You can do this by deleting the the DaemonSet using `oc delete` and re-create it, or you can edit the running DaemonSet by using `oc edit daemonset.apps/nvidia-device-plugin-daemonset`. Note that this should be used as a troubleshooting only step, as allowing pods to escalate privileges is a security risk.
Thanks to Zvonko Kaiser from Red Hat for the initial writeup and selinux policy.