In WML CE 1.6.1, TensorRT was added as a technology preview. TensorRT is a platform for high-performance deep learning inference that can be used to optimize trained models. This is done by replacing TensorRT-compatible subgraphs with a single TRTEngineOp that is used to build a TensorRT engine. These engines are a network of layers and have well defined input shapes. They run inference using the TensorRT libraries (see Conversion Parameters for more details.) Once a model is optimized with TensorRT, the traditional Tensorflow workflow is still used for inferencing, including TensorFlow Serving.

TensorRT-compatible subgraphs consist of TF-TRT supported ops (see Supported Ops for more details) and are direct acyclic graphs (DAG). Tensorflow ops that are not compatible with TF-TRT, including custom ops, are run using Tensorflow.

TensorRT can also calibrate for lower precision (FP16 and INT8) with a minimal loss of accuracy. Using a lower precision mode reduces the requirements on bandwidth and allows for faster computation speed. It also allows for the use of Tensor Cores, which perform matrix multiplication on 4×4 FP16 matrices and adds a 4×4 FP16 or FP32 matrix.

This blog explains how to convert a model to a TensorRT optimized model, some of the parameters that can be used for the conversion, how to run an upstream example in the WLM CE environment, and compares statistics between native and TensorRT optimized runs.

Note: TensorRT engines are optimized for the currently available GPUs, so conversions should take place on the machine that will be running inference.

Optimizing pre-trained models

In this section, we used the ResNet-50 v2 (fp32) model from the official TensorFlow models repository saved into the /tmp/resnet directory.

# curl -s https://storage.googleapis.com/download.tensorflow.org/models/official/20181001_resnet/savedmodels/resnet_v2_fp32_savedmodel_NHWC_jpg.tar.gz | tar --strip-components=2 -C /tmp/resnet -xvz
  • Saved models can be optimized by using the saved_model_cli script included with the TensorFlow conda package:
    # saved_model_cli convert --dir /tmp/resnet/1538687457/ --output_dir /home/user/example/4/ --tag_set serve tensorrt --is_dynamic=True
  • Saved models and frozen graphs can also be optimized by using the Tensorflow python TrtGraphConverter class.
    • For saved models, you need to pass in input_saved_model_dir=dir, where dir/saved_model.pb exists.
      from tensorflow.python.compiler.tensorrt import trt_convert as trt
      # Convert a saved model
      converter = trt.TrtGraphConverter(input_saved_model_dir='/tmp/resnet/1538687457/')
      graph_def = converter.convert()
      converter.save('/home/user/example/1/')
      
    • For frozen graphs, you need to pass in input_graph_def and nodes_blacklist parameters. nodes_blacklist is a list of output nodes.
      Since this example model is in the saved model format, we need to create a frozen graph:

      freeze_graph --input_saved_model_dir=/tmp/resnet/1538687457/ --output_graph=/tmp/resnet/frozen_graph.pb --saved_model_tags serve --output_node_names=softmax_tensor

      Next, we load the frozen graph into a TensorFlow GraphDef:

      import tensorflow as tf
      # Load and convert a frozen graph
      graph_def = tf.GraphDef()
      with tf.gfile.GFile("/tmp/resnet/frozen_graph.pb", 'rb') as f:
      graph_def.ParseFromString(f.read())
      

      Finally, we optimize the frozen graph using TensorRT:

      from tensorflow.python.compiler.tensorrt import trt_convert as trt
      converter = trt.TrtGraphConverter(input_graph_def=graph_def, nodes_blacklist=['softmax_tensor'])
      graph_def = converter.convert()
      converter.save('/home/user/example/2/')
      
  • When using INT8 precision mode, an additional calibration step is required to finish the optimization. The calibration data set should be representative of the problem data set. For information about INT8 calibration see NVIDIA’s 8-bit Inference with TensorRT
    # Get calibration data
    import requests
    IMAGE_URL = 'https://tensorflow.org/images/blogs/serving/cat.jpg'
    data = requests.get(IMAGE_URL, stream=True).content
       
    # Convert and calibrate model
    from tensorflow.python.compiler.tensorrt import trt_convert as trt
    import numpy as np
        
    converter = trt.TrtGraphConverter(input_saved_model_dir='/tmp/resnet/1538687457/', precision_mode='INT8')
    converted_graph_def = converter.convert()
    calibrated_graph_def = converter.calibrate(
        fetch_names=['softmax_tensor'],
        num_runs=1,
        feed_dict_fn=lambda: {'input_tensor:0': np.array([data])}
    )
    converter.save('/home/user/example/3/')
    

    The calibrate function accepts either feed_dict_fn or input_map_fn for mapping input tensors to data.

Conversion Parameters

There are additional parameters that can be passed to the saved_model_cli and TrtGraphConverter:

  • precision_mode: The precision mode to use (FP32, FP16, or INT8)
  • minimum_segment_size: The minimum number of TensorFlow nodes required for a TensorRT subgraph to be valid.
  • is_dynamic_op: TensorRT engines are converted and built at model run time instead of during the converter.convert() call. This is required if there are tensors with unknown or dynamic shapes.
  • use_calibration: Only used if precision_mode='INT8'. If True, a calibration graph will be created, and converter.calibrate() should be called. This is the recommended option. If False, all tensors that will not be fused must have quantization nodes. See NVIDIA’s INT8 Quantization for details.
  • max_batch_size: Used when is_dynamic_op=False. This is the maximum batch size for TensorRT engines. At run time, smaller batch sizes can be used, but a larger batch size will result in an error.
  • maximum_cached_engines: Used when is_dynamic_op=True. This limits the number of TensorRT engines that are cached, per TRTEngineOp.

Running the object detection example:

Image classification and object detection examples can be found at github.com/tensorflow/tensorrt. The object detection example provides performance output for various models and configurations with and without TensorRT.

  1. Get the example source code [1]
    # git clone https://github.com/tensorflow/tensorrt --recursive
    
  2. Set up the environment:
    # conda create -n tf-trt tensorflow-gpu requests pillow cython -y
    # conda activate tf-trt
    # cd tensorrt
    # pushd tftrt/examples/object_detection
    # ./install_dependencies.sh
    # popd
    
  3. Download the coco validation data set:
    # python
    >>> from tftrt.examples.object_detection import download_dataset
    >>> download_dataset('val2017', output_dir='coco')
    
  4. Create a “test.json” file. See the verified models for additional options.
    {
        "model_config": {
           "model_name": "ssd_inception_v2_coco",
           "output_dir": "models"
        },
        "optimization_config": {
           "use_trt": true,
           "precision_mode": "FP16"
        },
        "benchmark_config": {
            "images_dir": "coco/val2017",
            "annotation_path": "coco/annotations/instances_val2017.json",
            "batch_size": 1,
            "image_shape": [600, 600],
            "num_images": 2048,
            "output_path": "stats/ssd_inception_v2_coco_trt_fp16.json"
        }
    }
    
  5. For additional test configuration options, run the following:
    # python
    >>> import tftrt.examples.object_detection as object_detection
    >>> help(object_detection.test)
    >>> help(object_detection.optimize_model)
    >>> help(object_detection.benchmark_model)
    
  6. Run the test: python -m tftrt.examples.object_detection.test test.json
  7. Below are results from three different runs of the object_detection example: native (no TensorRT), FP32 (TensorRT optimized), and FP16 (TensorRT optimized). The TensorRT optimized models show an increase in performance with minimal to no loss of precision. These results were gathered on an AC922 with 16GB NVIDIA Tesla V100 GPUs:

    Native FP32 FP16
    avg_latency_ms 19.106557061364345 14.386320138001466 13.284107108970543
    avg_throughput_fps 52.3380532027989 69.51047873309172 75.27792359674037
    map 0.273695258874402 0.273695258874402 0.273

Summary

Converting a model to a TensorRT optimized model is a straightforward process and can enhance performance with little to no loss of accuracy. The image classification and object detection examples can be easily run to compare the performance of different models, with or without TensorRT.

[1]: This was verified with commit 3ddfab

Join The Discussion

Your email address will not be published. Required fields are marked *