cudaSuccess (3 vs. 0) initialization error
If you’re on an AC922 Server and are experiencing CUDA related initialization or memory errors when running in a containerized platform (such as Docker, Kubernetes, or OpenShift), you may have a mismatch in your platform’s cpuset slice due to a race condition onlining GPU memory.
Run https://github.com/IBM/powerai/blob/master/support/cpuset_fix/cpuset_check.sh on the host to see if you’re affected. The script also provides a –correct parameter to fix any affected slices.
Over the past few years, computing has followed two dominant trends. Containerization of workloads and accelerated machine learning with GPUs, specifically NVIDIA’s.
As these two technologies started to overlap, there were bound to be headaches that arose. One being passing specific kernel modules (NVIDIA) into a container, along with kernel devices, in our case NVIDIA GPUs. To reconcile these issues, NVIDIA has provided a fantastic set of tools for both vanilla Docker (via nvidia-docker) as well as Kubernetes-like environments (via k8s-device-plugin).
As good as these tools are, there was still room for users to experience issues running GPU workloads in containers. One such error is the dreaded CudaInitializationError (Cudasuccess (3 vs. 0)). For those of you who, like me, have spent a lot of time running containers with GPUs enabled, this error has occurred at least once in your endeavors, which can be caused by any number of things. The message flatly states, CUDA failed to initialize. Why? Well there are many possible reasons. Some revolve around software making improper CUDA calls, sometimes the devices aren’t set up properly, maybe your device driver isn’t passed into the container correctly. While most of these issues are documented, there’s one case that’s light on descriptions, and I’m going to tackle that here.
The issue I’m referring to is tied to Linux cpusets, the POWER9 AC922 (ppc64le architecture) server, and containerization.
AC922 Hardware configuration
Before we jump into how cpusets affect running NVIDIA GPUs in a container, we need to understand what IBM and NVIDIA did with their joint POWER9/NVLINK2.0 venture. The POWER9 servers, specifically AC922s come with two physical POWER9 CPUs and up to four V100 NVIDIA GPUs. Section 2.1 of the IBM AC922 Redbook describes the hardware layout. In short, AC922s use NVLINK2.0 to connect the GPUs directly to the CPUs, instead of the traditional PCIe bus. This allows for faster bandwidth, lower latency, and the most important part of this whole discussion: coherent access to GPU memory.
It’s because of this coherency that we experience the uniqueness of this problem. To allow the GPU memory to be accessible by applications running on the CPU, the decision was made to online the GPUs as numa nodes.
A sample numactl –hardware command from an AC922 illustrates this setup:
numactl --hardware available: 6 nodes (0,8,252-255) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 5 2 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 node 0 size: 257742 MB node 0 free: 48358 MB node 8 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 node 8 size: 261735 MB node 8 free: 186807 MB node 252 cpus: node 252 size: 16128 MB node 252 free: 16115 MB node 253 cpus: node 253 size: 16128 MB node 253 free: 16117 MB node 254 cpus: node 254 size: 16128 MB node 254 free: 16117 MB node 255 cpus: node 255 size: 16128 MB node 255 free: 16117 MB
Note: To avoid any potential issues with collisions on CPU nodes vs GPUs, the numbering for GPUs starts at 255 and goes backwards, while CPUs start at 0. On an AC922, we have two CPU sockets (0,8) with 80 threads each, and 256GB of memory each, and 4 GPUs(252-255), with 16GB of memory each. (GPU threads aren’t listed here.)
Now that you understand the hardware makeup of an AC922, let’s dive into a little bit of background on cpusets(https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt). Cpusets are a mechanism that allows CPU and Memory nodes to be assigned for tasks, services, virtual machines, or containers. This allows the kernel to limit what resources can be seen. There are many aspects to cpusets and you can spend hours reading about all of them. In our case, we’re mostly interested in the cpuset.mems file under sysfs. Cpusets.mems lists what memory nodes are available at a given time. The default values are kept in /sys/fs/cgroup/cpuset/cpuset.mems with various subdirectories keeping their own copy of cpuset.mems.
The GPU nodes, however, don’t come on by default. The systemd service nvidia-persistenced will online the GPU memory and the cpusets will get updated.
nvidia-persistenced service up systemctl start nvidia-persistenced cat /sys/fs/cgroup/cpuset/cpuset.mems 0,8,252-255 nvidia-persistenced service down systemctl stop nvidia-persistenced cat /sys/fs/cgroup/cpuset/cpuset.mems 0,8
One final piece of background before we get to the crux of the issue is the concept of a slice unit. To the Linux kernel, “A slice unit is a concept for hierarchically managing resources of a group of processes.”
In this case, there are three “slices” that we need to be concerned about. With RHEL 7.6, using Redhat’s version of Docker, or Podman, the slice in question is the “system.slice” Normally located at /sys/fs/cgroup/cpuset/system.slice.
For Kubernetes or OpenShift, they use the “kubepods.slice” which is located at /sys/fs/cgroup/cpuset/kubepods.slice
Finally, later docker-ce versions appear to use the “docker” slice, which is at /sys/fs/cgroup/cpuset/docker. I’m not sure why they lost the “.slice” to the name, but that’s neither here nor there.
Within these slices, a subslice is created each time a container gets spun up, passing along the necessary cpuset information. Each slice and subslice contains various details, including the cpuset.mems file that contains our memory nodes.
So, what happened?
We talked about AC922 CPU memory being coherently attached to GPU memory. Well, GPU memory would need to stay online at all times. Normally when a device is no longer in use, the kernel will tear down the kernel module and devices in question. In order to keep the GPUs online, a systemd service was created; aptly named nvidia-persistenced. With this service, we can guarantee that the GPU memory will stay online regardless of the GPUs active use. The problem? This service comes up using systemd, same as Docker, and same as Kubernetes. Unless Docker or Kubernetes explicitly waits for the nvidia-persistenced service to startup and finish onlining GPU memory, which could take up to 5 minutes past startup, they will take what’s available in the master cpuset and use it as the base system configuration.
When a process grabs the cgroup too early, the cpuset.mems will reflect an incomplete list of memory resources. For example, “0,8,253-255″, which tells us there are two CPU nodes, and only three GPU nodes. If a system actually had just three GPU nodes then this is a valid description, but odds are that the system has four GPUs and the value should have been “0,8,252-255″ to signify all four GPUs are present.
Once a containerization platform has an incomplete list of the GPU memory nodes, the problem will get masked until CUDA tries to initialize memory against that node. Upon starting up a container, the NVIDIA driver and devices will be passed through, depending on what rules you have set up, regardless of what memory nodes are specified in the cgroup. This means that, although your cpuset.mems says you have 253-255 (nvidia0-nvidia2), and 252 (nvidia3) is missing, the NVIDIA container plugins or hooks can still pass nvidia3 into a container, because by the time the container was started, all four GPUs were online. We now have a case where we have GPU devices that don’t exist are being passed into the cgroup.
Why doesn’t this fail all the time?
Once a machine is in this incorrect state, GPU devices and drivers can be added to a container, and even driver-based commands such as nvidia-smi will provide the correct output. This is because none of those commands try to allocate memory on the GPU. I’m sure someone will speak up and tell me I’m wrong, and driver commands do in fact allocate “some” memory on a GPU and they’re probably right, but they’re not using the cgroup values to do so, and odds are, the request is being sent to the host and executed by the driver itself.
When code in a container tries to allocate CUDA memory against a device that doesn’t have a corresponding value in the cpuset.mems file, errors start to occur. Normally it’ll show up as cudaSuccess (3 vs. 0) initialization error, but other flavors can show up depending on how the memory is trying to be allocated. A lot of code, such as CUDA’s deviceQuery from its sample code package, will try to touch all devices available to it. When CUDA tries to allocate memory against it, things start to go wrong. Normally if you knew which device wasn’t in the cpuset.mems file, you could use options like setting CUDA_VISIBLE_DEVICES to cordon off the device, and the rest of the code should work. However, this isn’t a viable long-term solution as it effectively makes a GPU unusable in a containerized environment.
A bug has been created to track this problem: https://bugzilla.redhat.com/show_bug.cgi?id=1746415. While it’s being worked on, there are some workarounds, most of which involve correcting the problematic cpuset slices. I’ve written a script ( https://github.com/IBM/powerai/blob/master/support/cpuset_fix/cpuset_check.sh ) that will check the slices used by the common containerization platforms (Docker, Kubernetes, and OpenShift). If it detects a mismatch between the slice folders cpuset.mems and the master cpuset.mems, it will notify the user. If desired, the script will also correct the problem by removing the slice folders altogether. This needs to be done because the slice folders aren’t deleted when the respective services are shut down or restarted, so bouncing Kubernetes, for example, will keep the same kubepods.slice as before, and you’ll still have the problem.
If we remove the slice folders altogether prior to (re)starting the respective service, the service will regenerate the cgroup slice from the master version; allowing the correct values to be ingested and applied correctly.
I have dabbled a bit in trying to edit the cpuset.mems by hand for certain slice groups, and with the right permissions you should be able to do this. However, I don’t recommend it as you’ll end up with containers that may have differing copies of the cpuset.mems within a single orchestration, leading to some pretty unpredictable results. The best scenario I can think of at the moment is to bring down the service, run the script to remove the existing incorrect values, and let the service come up naturally.
One last caveat to mention: cgroups and cpusets all reside under sysfs. This means that they are regenerated after each reboot. So any time a system is restarted, there’s a risk that this issue could happen again. One workaround that has been explored includes delaying the startup of Docker and/or Kubernetes, OpenShift, etc. until the NVIDIA GPUS have time to come online. This may not be ideal, but is still a better alternative than having to shut down the service mid production to address this problem.
In summary, this is an issue that’s unique to a specific server, due to its ability to have coherently attached GPU memory. In creating this feature, an exposure in cgroup was discovered where node memory can be added after startup and not passed along to existing slices.
Thanks for your time and, as always, please feel free to reach out to me if you have any questions!