Image data channel ordering is usually specified as “channels first” (NCHW) or “channels last” (NHWC). In many cases, operations on GPUs run faster with data in “channels first” format. TensorFlow contains a layout optimizer that will attempt to transpose the data for the fastest computation. The data transformations produce tensors which will consume GPU memory during model execution. This memory overhead can limit the data resolution, batch sizes, or model sizes that are achievable, even if TensorFlow Large Model Support is used.

To investigate the effects of the layout optimizer on GPU memory usage, we can use the TFLMS Keras_ResNet50 example with PowerAI 1.6.0. This example has command line options to build the model channels first or last and to show tensors occupying GPU memory on out of memory error conditions.

If we run the example using channels last and image size 6000×6000 we see the following list of tensors in its GPU memory list:

$ python Keras_ResNet50.py --image_size 6000 --channels_last --show_tensors_on_oom
...
Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
  2.15GiB from pool1_pad/Pad
  2.15GiB from bn2a_branch1/cond/FusedBatchNorm
  2.15GiB from bn_conv1/cond/FusedBatchNorm
  2.15GiB from conv1/Conv2D
  2.15GiB from res2a_branch1/Conv2D
  2.15GiB from training/RMSprop/gradients/AddN_173-0-TransposeNHWCToNCHW-LayoutOptimizer
  2.15GiB from training/RMSprop/gradients/AddN_161-0-TransposeNHWCToNCHW-LayoutOptimizer
  2.15GiB from res2a_branch2c/Conv2D
  2.15GiB from training/RMSprop/gradients/AddN_160-0-TransposeNHWCToNCHW-LayoutOptimizer
  549.32MiB from bn2b_branch2b/cond/FusedBatchNorm
  549.32MiB from bn2a_branch2a/cond/FusedBatchNorm
  549.32MiB from bn2a_branch2b/cond/FusedBatchNorm
  549.32MiB from bn2b_branch2a/cond/FusedBatchNorm
  549.32MiB from max_pooling2d/MaxPool
  549.32MiB from res2a_branch2a/Conv2D
  549.32MiB from res2a_branch2b/Conv2D
  549.32MiB from training/RMSprop/gradients/AddN_169-0-TransposeNHWCToNCHW-LayoutOptimizer
  549.32MiB from training/RMSprop/gradients/AddN_166-0-TransposeNHWCToNCHW-LayoutOptimizer
  549.32MiB from res2b_branch2a/Conv2D
  549.32MiB from res2b_branch2b/Conv2D
  549.32MiB from training/RMSprop/gradients/AddN_156-0-TransposeNHWCToNCHW-LayoutOptimizer
  549.32MiB from training/RMSprop/gradients/zeros_276-0-1-TransposeNCHWToNHWC-LayoutOptimizer
  412.81MiB from conv1/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer
  Remaining 642 nodes with 90.18MiB

At the time of the out-of-memory error, the top memory consuming tensors include eight LayoutOptimizer tensors totaling 9 GiB of GPU memory.

Conversely if the same command is run, but using channels first, the out-of-memory error is free of LayoutOptimizer tensors:

$ python Keras_ResNet50.py --image_size 6000 --show_tensors_on_oom
...
Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
  2.15GiB from pool1_pad/Pad
  2.15GiB from bn2a_branch1/cond/FusedBatchNorm
  2.15GiB from bn_conv1/cond/FusedBatchNorm
  2.15GiB from conv1/Conv2D
  2.15GiB from training/RMSprop/gradients/zeros_312
  2.15GiB from res2a_branch1/Conv2D
  2.15GiB from training/RMSprop/gradients/zeros_290
  2.15GiB from res2a_branch2c/Conv2D
  2.15GiB from training/RMSprop/gradients/zeros_288
  2.15GiB from res2b_branch2c/Conv2D
  549.32MiB from bn2a_branch2a/cond/FusedBatchNorm
  549.32MiB from bn2a_branch2b/cond/FusedBatchNorm
  549.32MiB from bn2b_branch2a/cond/FusedBatchNorm
  549.32MiB from bn2b_branch2b/cond/FusedBatchNorm
  549.32MiB from max_pooling2d/MaxPool
  549.32MiB from res2a_branch2a/Conv2D
  549.32MiB from training/RMSprop/gradients/zeros_306
  549.32MiB from res2a_branch2b/Conv2D
  549.32MiB from training/RMSprop/gradients/zeros_300
  549.32MiB from res2b_branch2a/Conv2D
  549.32MiB from training/RMSprop/gradients/zeros_282
  549.32MiB from res2b_branch2b/Conv2D
  549.32MiB from training/RMSprop/gradients/zeros_276
  412.81MiB from conv1_pad/Pad
  Remaining 642 nodes with 90.18MiB

We measured the “channels first” model as being 9.5% faster in data throughput tests when using this example at image size 6500×6500 with TFLMS to avoid out of memory errors.

The layout optimizer can be disabled by adding code like this:

from tensorflow.core.protobuf import rewriter_config_pb2
from tensorflow.python.keras import backend as K
config = tf.ConfigProto()
config.graph_options.rewrite_options.layout_optimizer=rewriter_config_pb2.RewriterConfig.OFF
K.set_session(tf.Session(config=config))

Disabling the layout optimizer and running with channels last produced the slowest run with this example.

This is not always the case. The DeepLabV3+ model (at commit level 078575a) is written to be channels last by default. The layout optimizer runs when you train this model on a GPU and it consumes significant GPU memory when running at higher resolutions. When the layout optimizer is disabled on this model we see faster data throughput during training with TFLMS, and the removing the memory overhead of the optimizer provides the ability to achieve higher image resolutions.

In summary, it is preferable to write models that will run on GPUs to be channels first by default to avoid the memory and data throughput overhead of the layout optimizer. If that is not feasible, disabling the layout optimizer may produce a higher performant model, but performance testing with and without the optimizer enabled would be necessary to determine the best performing combination.

Join The Discussion

Your email address will not be published. Required fields are marked *