In this tutorial, we’ll walk you through enabling GPUs in Red Hat® OpenShift. OpenShift is Kubernetes-based and it includes features that ease GPU integration. One of these features is the device plug-in, which can be used for more specialized devices such as GPUs.
This tutorial describes the process of setting up NVIDIA’s device plug-in. Before you begin, you’ll need to install OpenShift onto your cluster. Refer to https://docs.openshift.com/container-platform/3.11/install/index.html for prerequisites and instructions.
GPU configuration
After OpenShift is installed, we can configure the device plug-in. This process consists of a series of steps that should be completed in the order listed, and carefully verified before moving on to the next step.
Also note, there is a troubleshooting section at the end of this tutorial. If you run into issues along the way, skip ahead to see if your problem is a common one.
Perform the following steps on each GPU node:
- Ensure the latest kernel is installed on your system:
yum update kernel
- Cleanup CUDA libraries from any prior installations:
yum remove -y"cuda-*" "nvidia*" "libnvidia*" dkms.noarch epel-release nvidia-kmod* nvidia-container-runtime-hook
Install CUDA packages:
wget https://developer.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.ppc64le.rpm rpm -ivh cuda-repo-rhel7-10-1-local-10.1.168-418.67-1.0-1.ppc64le.rpm rpm -ivh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm yum install -y dkms cuda-drivers
Reboot your host.
Start and verify the NVIDIA persistence daemon:
systemctl enable nvidia-persistenced systemctl start nvidia-persistenced systemctl status nvidia-persistenced
This should show that the daemon has started.
Install and verify the nvidia-container-runtime-hook:
Install and configure:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.repo | sudo tee /etc/yum.repos.d/nvidia-container-runtime.repo
Configure the repo’s key by following instructions at https://nvidia.github.io/nvidia-container-runtime/
yum install -y nvidia-container-runtime-hook
Note: At the time of this writing, there was a known issue with the gpg key, so it may be necessary to disable the repo gpgcheck in the nvidia-container-runtime.repo file.
sudo chcon -t container_file_t /dev/nvidia* sudo pkill -SIGHUP dockerd
Verify:
nvidia-container-cli list nvidia-container-cli -k -d /dev/tty list
You should be able to see a list like this:
# nvidia-container-cli list /dev/nvidiactl /dev/nvidia-uvm /dev/nvidia-uvm-tools /dev/nvidia-modeset /dev/nvidia0 /dev/nvidia1 ...
Verify that GPU configuration is displayed by nvidia-smi:
nvidia-smi
This should provide you with a table of information about the GPUs on your system.Verify that the device log file exists:
ls /var/lib/docker/volumes/metadata.db
If this file is missing, your CUDA installation did not complete properly. Uninstall and reinstall by repeating steps 1 and 2.Enable a GPU-specific SELinux policy:
wget https://github.com/clnperez/dgx-selinux/releases/download/ppc64le/nvidia-container-ppc64le.pp semodule -i nvidia-container-ppc64le.pp nvidia-container-cli -k list | restorecon -v -f - restorecon -Rv /dev restorecon -Rv /var/lib/kubelet
Correct SELinux labels (CUDA 10 and later):
Due to a change of location for library files, SELinux labels will not be set correctly for use inside containers. After you run therestorecon
command from the previous step,nvidia-container-cli -k list | restorecon -v -f -
, you will need to re-label the CUDA library files.Do so by executing the following command:
chcon -t textrel_shlib_t /usr/lib64/libcuda.so*
You can find detailed background information about the reason for this workaround in the following Bugzilla discussion: https://bugzilla.redhat.com/show_bug.cgi?id=1740643
The policy file that was used to create the module above can be found here: https://github.com/zvonkok/origin-ci-gpu/blob/34a7609523f733c490fb09eb42acc30d9c397912/selinux/nvidia-container.te
Verify that the following runs successfully:
docker run --user 1000:1000 --security-opt=no-new-privileges --cap-drop=ALL --security-opt label=type:nvidia_container_t nvidia/cuda-ppc64le sleep 100
Edit the udev rules (on IBM® POWER9™ only):
Remove the “Memory hotadd request” section by referring to the instructions here: https://www.ibm.com/support/knowledgecenter/SS5SF7_1.6.0/navigation/pai_setupRHEL.html
Your system should now be configured to use GPUs, as well as use GPUs inside Kubernetes pods. Next, we’ll perform steps that are OpenShift specific. Again, make sure you follow these steps in the order shown and pay close attention to the verification. In some cases, the completion of a step doesn’t necessarily indicate that everything is working.
Configure an OpenShift project using GPUs
Perform the following steps from an OpenShift master node:
- Login to the admin account:
oc login -u system:admin
- To enable scheduling the Device Plugin on worker nodes with GPUs, label each node as follows. Run this for each GPU node and make sure the node names match the output of
oc get nodes
:oc label node <node> openshift.com/gpu-accelerator=true
oc label node <node> nvidia.com/gpu=true
- Create an nvidia project:
oc new-project nvidia
- Create an OpenShift service account:
oc create serviceaccount nvidia-deviceplugin
Create and verify the nvidia-device-plugin DaemonSet:
- Create a clone using the following command:
git clone https://github.com/redhat-performance/openshift-psap
- Edit
openshift-psap/blog/gpu/device-plugin/nvidia-device-plugin.yml
by replacing ‘nvidia/k8s-device-plugin:1.11’ with ‘nvidia/k8s-device-plugin-ppc64le:1.11’ as the image: value. - Create the DaemonSet:
oc create -f openshift-psap/blog/gpu/device-plugin/nvidia-device-plugin.yml
Verify the DaemonSet:
oc get -n kube-system daemonset.apps/nvidia-device-plugin-daemonset NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE nvidia-device-plugin-daemonset 1 1 1 1 1 4d
Because a DaemonSet is a resource that will result in a pod on each node, verify that they are all running:
oc get pods -n kube-system -o wide [root@dlw06 ~]# oc get pods -n kube-system -o wide | grep nvidia nvidia-device-plugin-daemonset-8lgqp 1/1 Running 3 4d 10.128.0.140 example_sys.com
Inspect the logs to ensure that the plug-in is running inside the pod.
First, find the name of the pods:
# oc get pods -n kube-system | grep nvidia nvidia-device-plugin-daemonset-8lgqp 1/1 Running 0 47s
Query the pods’ logs:
oc logs -n kube-system nvidia-device-plugin-daemonset-8lgqp
You should see something similar to the following:
2019/11/01 21:57:07 Loading NVML 2019/11/01 21:57:07 Fetching devices. 2019/11/01 21:57:07 Starting FS watcher. 2019/11/01 21:57:07 Starting OS watcher. 2019/11/01 21:57:07 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock 2019/11/01 21:57:07 Registered device plugin with Kubelet
- Create a clone using the following command:
Verify that the pod can request and access a GPU:
- Download a sample pod yaml from here: https://github.com/NVIDIA/k8s-device-plugin/blob/master/pod1.yml
- Edit the image name by changing nvidia/cuda to nvidia/cuda-ppc64le
- Create the pod:
oc create -f pod1.yml
You should see the pod with name pod1 in Running state:
# oc get pods
The output of the
describe
command should contain the following events:# oc describe pod pod1
Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 2m default-scheduler Successfully assigned nvidia/pod1 to example_sys.com Normal Pulling 2m kubelet, example_sys.com pulling image "nvidia/cuda-ppc64le" Normal Pulled 1m kubelet, example_sys.com Successfully pulled image "nvidia/cuda-ppc64le" Normal Created 1m kubelet, example_sys.com Created container Normal Started 1m kubelet, example_sys.com Started container
You can now schedule pods with GPUs in your OpenShift cluster. For finer-grained control of how GPUs are exposed and shared inside containers, you can configure a set of environment variables. These can be found here: https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec
Troubleshooting
Memory race condition with cpusets
Depending on the order that kubernetes starts, the GPU memory system slice under kubernetes may not be accurate. If any action results in an error that you GPU is not available, you can use this script to reconcile the cpuset: https://github.com/IBM/powerai/blob/master/support/cpuset_fix/cpuset_check.sh. You can also work around this by simply removing the kubepod slice directory:cd /sys/fs/cgroup/cpuset mv kubepods.slice kubepods.old
It will be automatically recreated and populated with the correct information.
SELinux relabeling
Any time you see an insufficient permissions issue, check that the labels set in step 10 above are correct. The following table shows the correct labels of the files needed to have a working GPU container.File SELinux label /dev/nvidia* container_file_t /usr/bin/nvidia-* xserver_exec_t /var/lib/kubelet/*/* container_file_t
Verify that the SELinux labels are correct by using the Z flag to the
ls
command, for example:# ls -lahZ /dev/nvidia0 crw-rw-rw-. root root system_u:object_r:container_file_t:s0 /dev/nvidia0
If you’re still having permissions issues on your system, you can also edit the device plug-in DaemonSet and change the privilege escalation of your containers by changing allowPrivilegeEscalation: false
to allowPrivilegeEscalation: true
. You can do this by deleting the the DaemonSet using oc delete
and re-create it, or you can edit the running DaemonSet by using oc edit daemonset.apps/nvidia-device-plugin-daemonset
. Note that this should be used as a troubleshooting only step, as allowing pods to escalate privileges is a security risk.
Acknowledgment
Thanks to Zvonko Kaiser from Red Hat for the initial writeup and SELinux policy.