Update history:
August 6, 2018:
Updated TFLMS overhead graph with a more readable version.
Updated Figure 1, architecture diagram with a corrected version for POWER9 CPU-memory bus speed.

1. Introduction

Deep learning commonly uses GPUs for training neural networks. While this works well, the amount of memory on the GPU limits the size of the data and the depth or complexity of the neural network that can be trained. While GPU memory is often 16 or 32GB, the system memory of many GPU enabled servers is many times that. The ability to make use of that system memory during training and inferencing of neural networks on the GPU could enable larger data and more complex models to be used.

2. TensorFlow Large Model Support Overview

TensorFlow Large Model Support (TFLMS) is a Python module that provides an approach to training large models and data that cannot normally be fit in to GPU memory. It takes a computational graph defined by users, and automatically adds swap-in and swap-out nodes for transferring tensors from GPUs to the host and vice versa. During training and inferencing this makes the graph execution operate like operating system memory paging. The system memory is effectively treated as a paging cache for the GPU memory and tensors are swapped back and forth between the GPU memory and CPU memory. For more information about TFLMS algorithms see [1].

2.1 Hardware overview

We will utilize the IBM Power AC922 server for our model training and evaluation. The AC922 features POWER9 CPUs and high speed NVLink 2.0 connections to NVIDIA Tesla V100 GPUs. When this is combined with a high-speed memory bus between the CPU and system memory, it becomes feasible to utilize TFLMS to swap tensors between GPU and system memory. The NVLink 2.0 connections allow 150 GB/s communication in each direction between CPU and GPU. When this is compared to the 32GB/s PCI Gen3 communication in traditionally connected GPUs, it is easy to understand how tensor swapping over a low bandwidth connected GPU would lead to an extreme degradation of model training performance.

Figure 1:

2.2 3DUnet Image Segmentation

Training 3DUnet models for image segmentation generally has high memory usage requirements which can limit the size of the 3D images that can be used for training. TFLMS can allow the use of larger models and images by allowing tensors to be swapped in and out of the GPU as needed. When this functionality is combined with the high bandwidth interconnects between the GPU, the CPU, and system memory of the AC922, the use of large models and data becomes feasible as the GPU is not starved for data.

To demonstrate TFLMS in a real-world use case we use a Keras model [6] written by David G. Ellis. This model was written to process multimodal MRI scans following the model architecture described by Isensee et al. in the 2017 BraTS proceedings on page 100 [7] and received 3rd place in the challenge.

The Keras model was written to process the TCGA [4, 5] and MICCAI BraTS 2017 datasets [6, 7] and it was further written to expect a Theano backend for Keras. We modified the model code to work with the Keras APIs included in TensorFlow 1.8 and PowerAI 1.5.2 in the tf.keras module as well as use keras.json for the setting of backend options.

The TFLMS Keras callback was then added to the list of callbacks used during Keras model fit to allow the model to be modified for tensor swapping.

3. Overview of case study

The model code in Ellis’ git repository [5] was set to train with full images of size 128^3 with a batch size of 1. The code is also written to support training with patches rather than full images. We chose to train with full images, and in practice found 144^3 with batch size 1 to be the maximum batch size and data resolution that could be trained without TFLMS before running out of memory on the 16GB GPUs.

When we enable TFLMS on the model we can train the 192^3 image resolution with batch size 1. The 192^3 resolution achieved by TFLMS is ~2.4x the resolution of 144^3.

The model produces image segmentation labels for three sub-regions referred to as “whole tumor”, “tumor core”, and “enhancing tumor”. The evaluation code in the git repository calculates Dice coefficients for the segmentations produced by the model compared to the ground truth labels. These Dice coefficients are then plotted using box and whisker plots for easier visual comparison.

4. 192^3 image resolution result comparison

4.1 Test methodology

The model was trained with image resolutions 144^3 and 192^3 using batch size 1, and validation batch size 2. The 144^3 training runs were done without TFLMS while the 192^3 training runs leveraged TFLMS. The models are then used to predict or inference on a set of validation data and the image segmentation predictions are scored on three different sub-regions using Dice coefficients.

The model and dataset will achieve different Dice coefficients during evaluation depending on which specific subjects are in the training versus validation split and the permutations done to the images. The minimize these differences the same training/validation subject split was used for all model training runs. Six models were trained at the 144^3 resolution and their segmentation predictions were evaluated. The training/validation subject split that scored the highest was used for the all the comparison model training runs.

Eight models were then trained, four at each resolution. The 192^3 predictions were evaluated against the 192^3 resolution truth segmentations. The 144^3 prediction segmentations were resized to 192^3 using the same resize function used to resize the original dataset to the target resolutions. The resized 144^3 predictions were then evaluated against the same 192^3 truth segmentations as the 192^3 model predictions.

The average Dice coefficients for each of the validation subjects at each sub region were then averaged for each resolution.

4.2 Test results

The BraTS 2017 dataset’s dimensions are very close to 144^3 with the greatest difference being in the Y dimension:

Table 1.

X

Y

Z

Average

140

171

140

Median

140

171

140

Minimum

124

151

121

Maximum

159

191

150

 

Given that the 144^3 resolution covers most of the resolution of the dataset subjects and information loss due to scaling was generally only happening in the Y axis, we expected that the 192^3 resolution would not have large gains in the Dice coefficients over the 144^3 size. When the Dice coefficients for the “whole tumor”, “tumor core”, and “enhancing tumor” sub-regions between the model trained with 144^3 and 192^3 resolutions are graphed using box and whisker plots we can see there is little difference:

Figure 2:

The 192^3 resolution has slightly higher values in the whole tumor case, and is almost identical in the tumor core case. The enhancing tumor modality is also very similar between 144^3 and 192^3 with the 192^3 resolution having slightly better values in quartiles 3 and 4.

5. Effects of 2x image resolution on Dice coefficients

TFLMS enables the training of the 192^3 image resolution with this model, which is about 2.4x larger than the 144^3 resolution that can be trained without TFLMS. Since the resolution of the dataset is close to 144^3 the Dice coefficients don’t really show the benefit of using 2x the image resolution. To better visualize the result of 2x the image resolution on the Dice coefficients the same test methodology described above was applied to a comparison between 112^3 and 144^3. The 112^3 resolution was chosen because 144^3 is 2.1x larger than 112^3 which is similar to the 2.4x difference between 144^3 and 192^3.

The comparison of the Dice coefficients between 112^3 predictions evaluated at the 144^3 resolution and the 144^3 coefficients show a very large advantage to using higher resolution images:

Figure 3:

Given a higher resolution dataset we could expect a similar jump in coefficients with the 2.4x resolution difference between the 144^3 models and the 192^3 models trained with TFLMS.

6. TFLMS training overhead

To measure the overhead of TFLMS on model training, the resolutions between 144^3 and 192^3 were tuned and trained with their optimal TFLMS parameter settings on an AC922 with 16GB NVIDIA Tesla V100 GPUs. The model resolutions were then trained without using TFLMS on an AC922 with 32GB NVIDIA Tesla V100 GPUs. This allows us to measure the overhead of TFLMS separate from the overhead of dealing with larger data resolutions. The overhead percentages are then graphed over the resolution factor above 144^3.

Figure 4:

7. TFLMS with PCIe Gen3 on an x86 server with 2x Intel CPUs and 8 GPUs connected via NVLink

To measure the impact of NVLink 2.0 on training using large models and data with TFLMS, an x86 server with 2x Intel CPUs and 8 NVIDIA Tesla V100 GPUs with 16 GB was used to measure epoch times. For the multi-GPU epoch times, individual models were trained concurrently on multiple GPUs, the model was not distributed for multi-GPU training. The impact of CPU to GPU communication speed would be equivalent when using multi-GPU model distribution mechanisms. Distributed model training is covered later in this document.

7.1 Scaling 1 to 4 GPUs

First the epoch times for the training with the 144^3 resolution without using TFLMS were measured and the epoch times were found to be similar between the AC922 and the x86:

Table 2.

System GPU ID / Epoch time in seconds

0

1

2

3

4

5

6

7

Average

x86, 1 GPU 144^3

216

216

AC922, 1 GPU 144^3

197

197

x86, 4 GPU 144^3

219

219

218

218

219

AC922, 4 GPU 144^3

195

198

198

209

200

 

The x86 server has 8 GPUs that are connected in 4 pairs to the system with each pair sharing a single PCI bus. To avoid PCI bus contention the 4 concurrent model training runs were run on GPUs that do not share a bus. Therefore, the set of GPU IDs used on the x86 are not contiguous. Also note that the epoch times do not decrease between 1 and 4 GPUs since we are running individual model training concurrently and not distributing a single model training across multiple GPUs.

Next the epoch times for training the model with the 192^3 resolution using TFLMS were measured. The differences between the PCI connected GPUs and the NVLink connected GPUs is apparent:

Table 3:

System / GPU ID Epoch time in seconds

0

1

2

3

4

5

6

7

Average

x86, 1 GPU 192^3

1467

1467

AC922, 1 GPU 192^3

590

590

x86, 4 GPU 192^3

1520

1508

1496

1498

1506

AC922, 4 GPU 192^3

594

602

598

605

600

 

The epoch times on the PCI connected GPUs are 2.5x those of the NVLink 2.0 connected GPUs.

To investigate where this difference comes from, the NVIDIA profiler, nvprof, was used to profile the training from steps 20 to 50 during epoch 2 since epoch 1 acts as a “warm up”. These steps were also chosen to avoid the start and end of the epoch. Since we are training with batch size 1, each step is also one subject.

The nvprof data allowed us to see that the x86 has a larger gap between batches where the GPU was idle and no memory copies were happening. This accounts for 44 seconds of more overhead per epoch than the AC922 has. When factoring out this inter-batch overhead the NVLink 2.0 + Volta V100 combination is still 2.4x faster at this resolution than the PCIe Gen3 + Volta V100 combination.

The nvprof data shows that the memory copies between the CPU and GPU for tensor swapping take considerably longer on the slower PCI bus and lead to the GPUs becoming idle. This shows one batch on both the x86 and the AC922:

Figure 5:

The bottom time bar shows the GPU compute usage and the three time bars above it show the memory copies. The red lines show the relative location of equivalent tensors in both timelines.

The white space on the GPU usage timeline shows time during the image processing when the GPU is not being utilized as it waits for the memory copy to swap in/out the next tensors to run. The impact of the NVLink 2.0 transfer speeds can be seen in the AC922’s GPU usage line. The gaps in the GPU usage due to tensor swapping are drastically smaller.

The NVIDIA Visual Profiler also reports these statistics about memory copies and GPU utilizations over the 30 batches that were profiled:

Table 4:

 

GPU utilization

Host to GPU memory copy throughput

GPU to Host memory copy throughput

x86

29.1%

10.9 GB/s

11.7 GB/s

AC922

73.5%

64.4 GB/s

60.4 GB/s

 

Here we can see that the GPU utilization is much higher on the AC922 which corresponds to the higher memory copy throughput.

7.2 Scaling beyond 4 GPUs

As noted previously, the GPUs in the x86 server are connected to PCI buses in pairs. This allows use of 4 GPUs without having contention on the PCI bus. What happens when TFLMS is used with two GPUs on the same PCI bus? To measure this, we ran 6 and 8 concurrent model training runs. We also leveraged an AC922 with 6 GPUs for this test.

Table 5:

System /
GPU ID

0

1

2

3

4

5

6

7

Average

x86, 6 GPU 192^3

1514

2067

2081

1991

1991

1506

1858

AC922, 6 GPU 192^3

674

674

681

675

672

682

676

x86, 8 GPU 192^3

2110

2108

2146

2116

2025

2025

2038

2038

2076

 

On the x86 server the GPU ID pairs are (0, 1), (2, 3), (4, 5), (6, 7). We can see in the table above, when the GPUs share a PCI bus the epoch times increase by about 38% (comparing the average epoch time at 4 GPUs to the average time at 8 GPUs). At 6 GPUs the average epoch time on the x86 server is 2.75x that of the AC922. At 8 GPUs, the average epoch time is 3.5x the epoch time of the AC922 at 4 GPUs. One may notice that the AC922’s epoch time also increased from its 1 and 4 GPU times. This is because the 6 GPU AC922 has a 100GB/s per GPU NVLink 2.0 link between the CPU and GPU whereas the 4 GPU AC922 has a 150 GB/s per-GPU link.

Graphing the epoch times of the AC922 in its 4 and 6 GPU configurations versus the x86 server when the GPUs have exclusive use of their PCI bus and when they contend with each other shows the sharp increase.

Figure 6:

To investigate the impact of two GPUs sharing a PCI bus when performing training with TFLMS, nvprof was again used to profile the training for 30 steps during epoch 2. Here is a comparison of one batch on the x86 server when the GPU has exclusive use of the PCI bus and when two GPUs are sharing the PCI bus. Corresponding tensors between the two timelines are connected with red lines.

Figure 7:

When two GPUs share the same PCI bus they compete for the bandwidth which elongates the memory copy time, reduces the GPU utilization, and increases training time.

The NVIDIA Visual Profiler reports these statistics about memory copies and GPU utilizations over the 30 batches that were profiled:

 

GPU utilization

Host to GPU memory copy throughput

GPU to Host memory copy throughput

x86, 1 GPU per PCI bus

29.1%

10.9 GB/s

11.7 GB/s

x86, 2 GPUs per PCI bus

22.6%

7.6 GB/s

8.2 GB/s

AC922 6 GPU system

66.3%

45.8 GB/s

43.5 GB/s

 

Figures 8 and 9:

When two GPUs that share the same PCI bus are used with TFLMS they see a reduction of 30% in memory copy throughput due to the contention. The AC922 does not have this contention issue since the NVLink 2.0 connections between the CPU and GPU are dedicated per GPU, and GPUs do not have to compete for the available bandwidth.

8. Distributed Deep Learning Results

8.1 DDL Overview

With the increase in image resolution, the training time went up significantly. This increase in training time lead us to investigate the impact of using the IBM PowerAI Distributed Deep Learning(DDL) library to distribute the training of the 3DUnet model.

The DDL library provides a simple python interface to add software-hardware co-optimized distributed training to an existing model [8]. DDL MPI to create a copy of the model on each GPU, then syncs all of the models after each step using a topology aware AllReduce operation. This allows for near linear speedups in most cases.

8.2 DDL Integration into Model

To enable the use of DDL the primary changes that were required were splitting the data, scaling the learning rate, limiting certain steps to only running on a single node, and adding a callback to synchronize the Keras metrics.

8.2.1.1 Splitting the Data

The training data needs to be split across the nodes to actually distribute the training. We also split the validation data to distribute the on-the-fly validation. Standard python slicing is used to split the training and validation data based on the rank of the node, and the total number of GPUs.


training_lists = [training_list[i::ddl.size()] for i in range(ddl.size())]
training_list = training_lists[ddl.rank()]

validation_lists = [validation_list[i::ddl.size()] for i in range(ddl.size())]
validation_list = validation_lists[ddl.rank()]

8.2.1.2 Scaling the Learning Rate

Since the data was split between the GPUs, the learning rate had to be scaled by the total number of GPUs.


# Before:
# config["initial_learning_rate"] = 5e-4
# After:
config["initial_learning_rate"] = 5e-4 * ddl.size()

8.2.1.3 Rank 0 Restrictions

The 2 primary operations that need to be restricted to only running on rank 0 are model checkpointing and logging. This is accomplished by only adding these callbacks on rank 0.


# Only do these callbacks on the rank 0 node.
if ddl.rank() == 0:
  callbacks.append(ModelCheckpoint(model_file, save_best_only=True))
  callbacks.append(CSVLogger(logging_file, append=True))

8.2.1.4 Early Stop Callback

Finally, an extra callback is needed to keep all metrics in sync across all nodes. This ensures that early stopping and learning rate scheduling all remain in sync.


 callbacks = list()
 callbacks.append(ddl.DDLCallback())

8.3 DDL Test Results

To test the DDL integration we trained on 4 AC922s, as described above, with 100Gb/s Mellanox CX5 InfiniBand adapters connected via a 100Gb/s Mellanox SB7700 InfiniBand switch. The integration of DDL had no impact on accuracy.

The 192^3 resolution has a training time of 590 seconds per epoch on a single GPU. With DDL, this training time becomes 150 seconds per epoch on a single node with 4 GPUs, 76 seconds per epoch on 8 GPUs across 2 nodes, and , 40 seconds per epoch on 16 GPUs across 4 nodes. When using DDL the total number of epochs for the model to converge and training to be stopped by the early stop Keras callback remains unchanged. The loss, validation loss, and Dice coefficients of the models trained with DDL are equivalent to models trained without DDL.

Table 7:

Number of GPUs

Epoch time

Speedup

Efficiency

1 (without DDL)

590s

 

 

4

150s

3.93

98.33%

8

76s

7.76

97.04%

16

40s

14.75

92.19%

 

 

 

 

 

 

Here the scaling is slightly less than ideal. Some of this is explained by some portion of the computation being serial. However, in this example the number of images has a larger impact on scaling. The total number of images used for training is 228, which is not evenly divisible by 16, giving some nodes 14 images and others 15. Therefore, to remain in sync, every node must do 15 training steps, which has the effect of slightly increasing the total amount of work.

Figure 10:

9. Conclusion

The TFLMS module allows training of large models with large data at high resolutions that cannot fit into GPU memory by allowing tensor swapping between the system memory and GPU memory. While swapping adds overhead to training, when the GPUs are connected to the CPU with NVLink 2.0 high speed connections it allows model training 2.5-3.5x faster than connecting the GPUs with a PCI bus. Distributing the model training among multiple GPUs using PowerAI Distributed Deep Learning further reduces the epoch training time.

The combination of using TFLMS with AC922 servers and their NVLink 2.0 connected GPUs allows data scientists to quickly iterate while training with large models and data.

10. Acknowledgments

Thanks to Bryant Nelson for enabling the 3DUnet model with DDL, doing the DDL test runs, and authoring that section of the blog. Thanks to Tung D. Le for authoring the TFLMS module and working with me on its refactoring and performance analysis.

11. References / Citations

[1] https://arxiv.org/abs/1807.02037
[2] Menze BH, Jakab A, Bauer S, Kalpathy-Cramer J, Farahani K, Kirby J, Burren Y, Porz N, Slotboom J, Wiest R, Lanczi L, Gerstner E, Weber MA, Arbel T, Avants BB, Ayache N, Buendia P, Collins DL, Cordier N, Corso JJ, Criminisi A, Das T, Delingette H, Demiralp Γ‡, Durst CR, Dojat M, Doyle S, Festa J, Forbes F, Geremia E, Glocker B, Golland P, Guo X, Hamamci A, Iftekharuddin KM, Jena R, John NM, Konukoglu E, Lashkari D, Mariz JA, Meier R, Pereira S, Precup D, Price SJ, Raviv TR, Reza SM, Ryan M, Sarikaya D, Schwartz L, Shin HC, Shotton J, Silva CA, Sousa N, Subbanna NK, Szekely G, Taylor TJ, Thomas OM, Tustison NJ, Unal G, Vasseur F, Wintermark M, Ye DH, Zhao L, Zhao B, Zikic D, Prastawa M, Reyes M, Van Leemput K. “The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS)”, IEEE Transactions on Medical Imaging 34(10), 1993-2024 (2015) DOI: 10.1109/TMI.2014.2377694
[3] Bakas S, Akbari H, Sotiras A, Bilello M, Rozycki M, Kirby JS, Freymann JB, Farahani K, Davatzikos C. “Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features”, Nature Scientific Data, 4:170117 (2017) DOI: 10.1038/sdata.2017.117
[4] Bakas S, Akbari H, Sotiras A, Bilello M, Rozycki M, Kirby J, Freymann J, Farahani K, Davatzikos C. “Segmentation Labels and Radiomic Features for the Pre-operative Scans of the TCGA-GBM collection”, The Cancer Imaging Archive, 2017. DOI: 10.7937/K9/TCIA.2017.KLXWJJ1Q
[5] Bakas S, Akbari H, Sotiras A, Bilello M, Rozycki M, Kirby J, Freymann J, Farahani K, Davatzikos C. “Segmentation Labels and Radiomic Features for the Pre-operative Scans of the TCGA-LGG collection”, The Cancer Imaging Archive, 2017. DOI: 10.7937/K9/TCIA.2017.GJQ7R0EF
[6] https://github.com/ellisdg/3DUnetCNN
[7] https://www.cbica.upenn.edu/sbia/Spyridon.Bakas/MICCAI_BraTS/MICCAI_BraTS_2017_proceedings_shortPapers.pdf
[8] Minsik Cho, Ulrich Finkler, Sameer Kumar, David Kung, Vaibhav Saxena, Dheeraj Sreedhar. “PowerAI DDL”, August 8, 2017

2 comments on"TensorFlow Large Model Support Case Study with 3D Image Segmentation"

  1. It’s a great case study ! Very impressive.
    One question. What was the file sizes of the 3D images with 192^3 and 144^3, respectively ?

    • Thanks.
      The 2017 BraTS data set is a zip containing 285 subjects. The each subject is comprised of 5 files which are in the NIfTI file format and then gzipped: 4 tumor modality files and the segmentation “truth”. A zip of all the gzipped NIfTI files is 2.2GB. So that is compressed, and the files are preprocessed before training and then put into .h5 files which are used by the models.

      The .h5 file for 144^3 is 6.9GB. The .h5 file for 192^3 is 16GB.

Join The Discussion

Your email address will not be published. Required fields are marked *