The IBM Power System AC922 can have many physical cores, and with the ability to specify a symmetric multithreading value of 4 (SMT4), this can lead to a very large number of logical processors. This allows a high amount of concurrent work across physical CPU cores. When using the AC922’s GPUs for TensorFlow jobs, the CPU threads are used for some neural network operations and data I/O. By default TensorFlow will create many threads on the CPUs and when combined with operating system process limits, can lead to artificially limiting the number of concurrent TensorFlow jobs on the system.
TensorFlow configures two thread pools for processing data and neural network operations on the CPU. At a high level, one of these pools is for running operations in parallel, the other pool is used for parallizing the internal executions of a single operation. By default, TensorFlow initializes these thread pools with a number of threads equal to the number of logical processors on the system.
Let’s investigate how this thread pool initialization works on the AC922. An AC922 with 16 cores per socket with SMT4 will have 128 logical cores, and an AC922 with 20 cores per socket with SMT4 will have 160 logical cores. This means that each TensorFlow process on an AC922 with 20 cores per socket and SMT4 will produce 320 threads to service these thread pools.
TensorFlow documentation discusses how to set Session configuration values to tune the number of threads for these two pools. In some cases, the high number of threads in these pools can negatively impact performance. The job performance could be tuned by either setting the system-wide SMT value to decrease the number of logical processors or by setting the Session configuration values to individually tune the performance of each application.
The high number of threads possible on the AC922, when combined with operating system or user limits on the number of processes can limit the number of concurrent TensorFlow jobs that are possible for a user. The native threads that are created for these TensorFlow worker pools count toward the ‘nproc’ limit which can be set for individual users or for all users (see your operating system’s ulimit documentation).
To continue the example with the AC922 with 20 cores and SMT 4, each TensorFlow job creates about 340 threads. If the nproc limit for a user is set to 4096, the user is limited to 12 TensorFlow jobs. When the nproc limit is hit, the TensorFlow jobs fail with error messages like this in Red Hat Enterprise Linux:
terminate called after throwing an instance of 'std::system_error' what(): Resource temporarily unavailable
When using containers, the nproc limit is still in effect. The limits are set in the container’s host operating system and will limit the number of processes for a user across the system and thus across containers. With the nproc limit set in the host environment, there is no way to exceed or reset the limit from within a container.
The need of multiple TensorFlow jobs per GPU is very important for inference on GPUs with 32GB HBM2 where the memory used per job has a small foot print (ex. 480MB – 1 GB), and therefore the memory limit will not be a bottleneck, but rather the maximum number of jobs per GPU. Tuning the number of threads in these pools or adjusting the nproc limit will lead to more TensorFlow jobs per user and can lead to higher GPU utilization for inference jobs.
GPU memory utilization limits for TensorFlow can be used in order to be sure that each TensorFlow Session will get a fair share and we will not get CUDA out of memory (OMM) errors. For example, we can indicate a certain percentage of GPU memory per Session like this:
config = tf.ConfigProto() config.gpu_options.per_process_gpu_memory_fraction = 0.05 session = tf.Session(config=config)
In conclusion, when running multiple TensorFlow jobs on an IBM Power System AC922, both the nproc limits and the number of threads in TensorFlow thread pools should be taken into consideration. If the TensorFlow jobs will surpass the nproc limit or if runtime performance suffers, the sizes of the thread pools should be adjusted.