Using TensorRT models with TensorFlow Serving on IBM WML CE

IBM® Watson™ Machine Learning Community Edition (WML CE) 1.6.1 added packages for both NVIDIA TensorRT and TensorFlow Serving. These two packages provide functions that can be used for inference work. This tutorial describes the steps that a user should perform to use TensorRT-optimized models and to deploy them with TensorFlow Serving.

Introduction to using TensorRT models with TensorFlow Serving

NVIDIA TensorRT provides a method to optimize a TensorFlow graph to be used for inference operations, and improve the run-time performance of the inference requests performed with the model.

With the frequent emphasis in deep learning on training models, this tutorial shows how TensorRT and TensorFlow Serving can be combined to take the trained models and deploy them for inference work.

Background information on TensorRT and TensorFlow Serving

The notes in the TensorRT in TensorFlow repository provide introductory and detailed information on how TensorFlow with TensorRT (TF-TRT) is used to optimize TensorFlow graphs using TensorRT. The TF-TRT component provides the command line interface (CLI) for converting a model into a TensorRT model.

The inference work is accelerated using a NVIDIA GPU. The NVIDIA user’s guide for Accelerating Inference in TF-TRT contains the NVIDIA information on how TensorRT accelerates inference operations on a NVIDIA GPU.

The Tensorflow Serving server is used to serve a model for inference work in a production environment. The TensorFlow Serving repository notes explain how TensorFlow Serving relates to inference usage of trained models.

Introduction of the examples

This tutorial shows two example cases for using TensorRT with TensorFlow Serving.

  • Example one shows the steps for converting and preparing a TensorRT model to be served by TensorFlow Serving.
  • The second example shows the steps for running the TensorFlow Serving server in one container and the application client in a second container.

Preparing the required environment for the examples

For this introductory example, the steps can all be performed on the host system. This example requires the following environment:

  • The WML CE conda environment.
  • The tensorflow-gpu conda packages – Follow the steps in the Knowledge Center for Installing TensorFlow GPU packages. Installing the tensorflow-gpu conda package will install the required tensorrt conda package. This example can be run in the base (default) Conda environment, or in a separate Conda environment.
  • The tensorflow-serving conda package – The tensorflow-serving-api conda package provides the client APIs. Follow the steps in the Knowledge Center for Installing TensorFlow Serving packages. Install the tensorflow-serving and tensorflow-serving-api packages into the same conda environment the example will be run in.
  • Only a single conda environment is required to run the first example.
  • Example code – All of the code for examples is available on the IBM/powerai Github repository.

Download the code:

cd $HOME
git clone
cd powerai/examples/tfs-tftrt-example

(The code can be reviewed at the internal repository

Converting and preparing a TensorRT model

Follow these steps to set up the directories and get the pre-trained model that will be used for this introductory example:

mkdir -p $HOME/saved-models
cd $HOME/saved-models
tar --no-same-owner -xzvf resnet_v1_fp32_savedmodel_NCHW.tar.gz

The tar command will extract the pre-trained TensorFlow model files for the conversion steps.

TensorFlow Serving serves a saved model format model. The following steps convert the saved TensorFlow graph into the TensorRT model that can be served with the TensorFlow Serving server:

cd $HOME
mkdir -p $HOME/inference-models/resnet_v1_50_fp32
saved_model_cli convert --dir $HOME/saved-models/resnet_v1_fp32_savedmodel_NCHW/1538686577 --output_dir $HOME/inference-models/resnet_v1_50_fp32/0000000001 --tag_set serve tensorrt --precision_mode FP32 --max_batch_size 1 --is_dynamic_op False


  • The saved_model_cli CLI was installed when the tensorflow-gpu conda packages were installed.
  • The saved_model_cli command will be in the $PATH environment.
  • The convert sub-command runs the conversion steps.
  • The tensorrt keyword parameter indicates that the conversion should be performed for a TensorRT model.
  • The --tag_set serve parameter is required for the conversion of the saved model. See the NVIDIA User’s Guide for information on using other precision modes than the FP32 mode.
  • Run saved_model_cli convert -h for additional information on the saved_model_cli parameters.

Model conversion optimizations

Check the model conversion messages to verify the model optimizations that were applied by TensorRT during the model conversion:

2019-06-10 13:36:50.051400: I tensorflow/core/grappler/optimizers/] Optimization results for grappler item: tf_graph
2019-06-10 13:36:50.051518: I tensorflow/core/grappler/optimizers/]   constant folding: Graph size after: 476 nodes (-267), 490 edges (-267), time = 325.207ms.
2019-06-10 13:36:50.051540: I tensorflow/core/grappler/optimizers/]   layout: Graph size after: 476 nodes (0), 490 edges (0), time = 52.528ms.
2019-06-10 13:36:50.051562: I tensorflow/core/grappler/optimizers/]   constant folding: Graph size after: 476 nodes (0), 490 edges (0), time = 125.532ms.
2019-06-10 13:36:50.051580: I tensorflow/core/grappler/optimizers/]   TensorRTOptimizer: Graph size after: 12 nodes (-464), 10 edges (-480), time = 385.028ms.

The negative numbers indicate the number of nodes that were optimized and combined for the TensorRT converted graph.

A pre-trained model is being used in the examples to simplify the example. In a customer environment, a user will have trained their model prior to serving the model and running inference against the model.

Why is this model being used in the example? The Resnet V1 50 model is being used in this example. The Resnet V1 50 model is one of the models that has been verified by NVIDIA for TensorRT NVIDIA Verified Models.

TensorFlow Serving serves a saved model, not a TensorFlow frozen graph. The output of the saved_model_cli convert command is a saved model. The TensorRT converted model will also be used in the second example.

Serving the converted model

The converted model is ready to be served. Use the tensorflow_model_server command to start the server with the model:

tensorflow_model_server --model_base_path=$HOME/inference-models/resnet_v1_50_fp32 --model_name=resnet_v1_50_fp32 # PredictionService started on the default port 8500


  • The tensorflow_model_server command was installed when the tensorflow-serving Conda package was installed.
  • The default port for the PredictionService RPC call is on port 8500.
  • This example only uses gRPC calls.
  • The example does not show the usage of the REST call.
  • Specify the GPU to use by changing the CUDA_VISIBLE_DEVICES environment variable.

Running inference from a client

From a second command line, run the Python client code to run the inference against the converted TensorRT model running on the TensorFlow Serving server:

cd $HOME/powerai/examples/tfs-tftrt-example

The prediction results were:

predictions:  [('n02109525', 'Saint_Bernard', 0.33126747608184814), ('n02093256', 'Staffordshire_bullterrier', 0.20472775399684906), ('n02088094', 'Afghan_hound', 0.14217598736286163)]

The predictions are a bit off, but within tolerance for this pre-trained model. In this introductory example, the image was for a yellow Labrador Retriever. Additional training of the model with more dog images may improve the accuracy.

Explanation of the client code

In this introductory example, tf.keras APIs are used to transform the input image into a format that is compatible with the requirements of the pre-trained model.

Keras pre-processing load_img converts the image format from the jpg format to the Python PIL format.

image = load_img(filename, target_size=(224, 224))

Some models accept a jpg image for the input, but this pre-trained model requires a scaled numPy array of the image. Keras pre-processing img_to_array converts the image in PIL format into a numPy array for input into the pre-trained model.

image = img_to_array(image)

Keras pre-processing preprocess_input normalizes the pixels in the numPy array to between -1 and +1.

image = preprocess_input(image)

The data has now been transformed into the HWC format (Height-Width-Channels), but one additional transformation must be performed on the model to format the data into the NHWC format (Batch size-Height-Width-Channels) required for input into the model.

image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))

For this introductory example, the batch size is 1, with a single input image.

How do I know what inputs are required for the model? The saved_model_cli can be used to examine the model to determine the model input and output.

saved_model_cli show --dir $HOME/inference-models/resnet_v1_50_fp32/0000000001 --all

The shape shows the maximum batch size for the model is 64, and the input array shape is [224, 224, 3], which matches the HWC input array. The DT_FLOAT is the TensorFlow data type that matches the FP32 precision.

  The given SavedModel SignatureDef contains the following input(s):
    inputs['input'] tensor_info:
        dtype: DT_FLOAT
        shape: (64, 224, 224, 3)
        name: input_tensor:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['classes'] tensor_info:
        dtype: DT_INT64
        shape: (64)
        name: ArgMax:0
    outputs['probabilities'] tensor_info:
        dtype: DT_FLOAT
        shape: (64, 1001)
        name: softmax_tensor:0
  Method name is: tensorflow/serving/predict

Why were the tf.keras APIs used? The tf.keras APIs are used on the client side. The Keras APIs operate on numPy arrays, which only require the python run-time routines. TensorFlow APIs are not used since the evaluation of the model will be occurring on the server, not in the client. The TensorFlow run-time would require a TensorFlow Session() object, which would then in turn be evaluated on a GPU, which isn’t on the client requesting system.

Executing the gRPC request

With the data transformed into the NHWC format, the image is completely prepared to be sent to the TensorFlow-serving server. To send the request to the TensorFlow-serving server, a gRPC request needs to be prepared. The Google Remote Procedure Call (gRPC) is used in this example to communicate with the TensorFlow-serving server.

  tf.contrib.util.make_tensor_proto(image, dtype=tf.float32, shape=[1, image.shape[1], image.shape[2], image.shape[3]]))

result = stub.Predict(request, 30.0) # 30 secs timeout

The dtype=tf.float32 parameter on the make_tensor_proto matches the FP32 precision of the TensorRT converted model. The shape matches the NHWC format of the data image.

The stub.Predict is a synchronous request. Use the stub.Predict.future API with add_done_callback to build an asynchronous request to the TensorFlow-serving server.

Interpreting the prediction results

The result from the prediction model is an array of tf.float32 numbers. The tf.keras decode_predictions API is used to return the top three predictions. The dimension of the output probabilities array matches the number of classes that were used to train the Resnet V1 50 model.

print("predictions: ", decode_predictions(probabilities, top=3)[0])

Because there was only a single image with a batch size of 1, only array row [0] is interpreted for the result predictions.

Summary of example one

Example one showed the steps for converting a pre-trained model into a TensorRT converted model, starting the TensorFlow-serving server to serve the converted model, and building and running an inference request from a client.

An application design for using TensorRT and TensorFlow Serving

The second example more closely resembles an application design, with the TensorFlow Serving server running in one container, and the client running in a second separate container. Two different Docker hub container images are used in the second example. Two application containers will be built for the second example. The pre-built PowerAI Docker hub images will be pulled from Official docker images for IBM PowerAI.

Configuring the serving container

The TensorFlow Serving Docker hub image has the required TensorRT and TensorFlow Serving packages pre-installed into the image. First build the serving application container:

cd $HOME/powerai/examples/tfs-tftrt-example
docker build --build-arg=SERVING_USER=$USER -t tfs-tftrt-server server

Show the built Docker image with docker images tfs-tftrt-server. The Python 3.6 Docker hub images are used in the second example.

The TensorFlow Serving server runs in a container named $USER-tf-serving-gpu. Start the container with the image that was built:

docker run --interactive --detach --tty --rm --name $USER-tf-serving-gpu --user $USER --volume $HOME/inference-models/resnet_v1_50_fp32:/models/inference-models/resnet_v1_50_fp32:z --env CUDA_VISIBLE_DEVICES=1 --env MODEL_BASE_PATH=/models/inference-models --env MODEL_NAME=resnet_v1_50_fp32 tfs-tftrt-server


  • The TensorRT converted model that was converted during example one will be reused for example two. The steps for creating the TensorRT converted model are explained above.
  • The path to the TensorRT converted model on the host system is defined with the --volume parameter.
  • The path to the TensorRT converted model is /models in the container.

Use docker exec to inspect inside the container at any time.

docker exec -it --user $USER $USER-tf-serving-gpu /bin/bash # Only for inspecting inside the container

With the TensorFlow Serving server container, the TensorFlow Serving server will run as a plain user that was added during the Docker build. For the example two, the server will run as the current user.

Specify the GPU to use by changing the CUDA_VISIBLE_DEVICES environment variable. The MODEL_BASE_PATH and MODEL_NAME environment variables define the path and the name of the model to be served by the TensorFlow Serving server.

The TensorFlow Serving server will show the startup messages when the server has been started and is ready to serve the model.

2019-06-10 19:55:11.299377: I tensorflow_serving/core/] Successfully loaded servable version {name: resnet_v1_50_fp32 version: 1}
2019-06-10 19:55:11.926276: I tensorflow_serving/model_servers/] Running gRPC ModelServer at ...

Use docker logs $USER-tf-serving-gpu to view the container logs.

Configuring the client container

A second Docker hub image is required for running the client code in the second example. From a second command line, run the commands to build the client application container:

cd $HOME/powerai/examples/tfs-tftrt-example
docker build --build-arg=CLIENT_USER=$USER -t tfs-tftrt-client client

Show the built Docker image with docker images tfs-tftrt-client. The TensorFlow container image is required for the client because the tf.keras APIs are also used in the client in the second example.

The TensorFlow Serving client runs in a container named $USER-client. Start the container with the image that was built:

docker run --interactive --detach --tty --rm --name $USER-client tfs-tftrt-client

As with the TensorFlow Serving server container, the TensorFlow Serving client commands will run as a plain user. For the example two, the client will also run as the current user.

Both the $USER-tf-serving-gpu and the $USER-client containers are now running and active. Use docker ps to check the running containers at any time.

Connecting the client container to the server container

The TensorFlow Serving server is running as a microservice in the $USER-tf-serving-gpu container. The docker network commands are used to connect the $USER-client to the $USER-tf-serving-gpu container running as a microservice. For example two, the $USER-tf-serving-network is added.

docker network create $USER-tf-serving-network
docker network connect $USER-tf-serving-network $USER-tf-serving-gpu --alias server
docker network connect $USER-tf-serving-network $USER-client --alias client

Use docker network inspect to review the Docker container network.

docker network inspect $USER-tf-serving-network

Running inference from the client container

For example two, the client commands are run from inside the $USER-client container. Use docker exec to get the /bin/bash command line in the client container:

docker exec -it --user $USER $USER-client /bin/bash # Run client commands interactively

Running a batch request from the client container

In example two, a batch request for predictions is sent from the $USER-client container to the TensorFlow-serving server running in the $USER-tf-serving-gpu container. Run the client request from the command line in the $USER-client container:

cd $HOME/example
python --data_dir testdata --max_test_images 20

The prediction results for the batch request were:

predictions:  [('n04285008', 'sports_car', 0.9523131847381592), ('n04037443', 'racer', 0.035011447966098785), ('n03100240', 'convertible', 0.010458654724061489)]
predictions:  [('n03594945', 'jeep', 0.2605750858783722), ('n03417042', 'garbage_truck', 0.19224104285240173), ('n03478589', 'half_track', 0.17961420118808746)]

There is some variance in the prediction results. The input images for example two are automobiles and other vehicles.

Explanation of the client code

How does the client locate the TensorFlow Serving server container? There are two pieces of information that link the client code to the TensorFlow Serving server container. First, note the --alias server parameter on the docker network connect $USER-tf-serving-network $USER-tf-serving-gpu --alias server command. Second, the client code specifies server and port 8500 on the gRPC request:

server = 'server:8500'
channel = grpc.insecure_channel(server)
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

The name server connects the client to the TensorFlow Serving server container with the alias server.

For the second example, the prediction request images are the actual images that the Resnet V1 50 model was pre-trained with.


r = requests.get(_IMAGENET_SYNSET)

The Imagenet repository provides a REST API for retrieving the list of training images. The REST API is called to return the list of images defined for the n02958343 ID that is specified. Check the ImageNet ID’s list for the class lists of the images.

Batch request input

The batch request input of images is a numPy array.

predict_images = numpy.array(predict_images)

For the second example, the images are collected into a Python list object.

TensorRT engine optimization messages

TensorRT at run-time will build additional optimizations into the model based on the batch size.

2019-06-11 17:01:26.853602: I external/org_tensorflow/tensorflow/compiler/tf2tensorrt/kernels/] Building a new TensorRT engine for resnet_model/TRTEngineOp_0 input shapes: [[10,224,224,3]]

For this request, the shape shows the batch size was 10. Use docker logs $USER-tf-serving-gpu to view the server container logs.

Interpreting the prediction results

For the Imagenet images sent to the server for the inference requests, the Imagenet IDs have to be used when interpreting the prediction results. The 1001 probabilities in the outputs['probabilities'] array that is returned from the inference request is interpreted to return the IDs with the highest probability.

def decode_predictions(predictions, top=3):

The indexing needs to be adjusted between the 1001 class Inception model style labels with a background class at 0 returned by the pre-trained model, to the zero-based label IDs used by the Imagenet trained networks:

result_class = tuple(imagenet_class_index[str(prediction-1)]) + (predictions[prediction],)

Subtract the prediction index by one.

With the batch request, there are up to 64 arrays of 1001 probabilities returned from the inference request; one row for each image that is evaluated:

num_batch_probabilities = result.outputs['probabilities'].tensor_shape.dim[0].size
num_probabilities = result.outputs['probabilities'].tensor_shape.dim[1].size
probabilities_shape = (num_batch_probabilities, num_probabilities)
outputs {
  key: "probabilities"
  value {
    dtype: DT_FLOAT
    tensor_shape {
      dim {
        size: 10
      dim {
        size: 1001

The above was returned for a batch of 10 images.

The prediction probabilities are returned as a numPy array, not a TensorFlow tensor. The probabilities can be examined as a numPy array.

probabilities = numpy.array(result.outputs['probabilities'].float_val)
probabilities = numpy.reshape(probabilities, probabilities_shape)

Summary of example two

Example two showed an application example with the TensorFlow Serving server running in a Docker container as a micro-service. The client code for example two showed how a batch request for multiple images can be sent to the model running in the TensorFlow Serving server, and how to interpret the batched prediction results returned from the server.

Removing the example containers

Stopping the containers will delete the containers. After the containers are deleted, the docker network can be removed.

docker stop $USER-tf-serving-gpu
docker stop $USER-client
docker network rm $USER-tf-serving-network

Next steps

This tutorial showed the step using the saved_model_cli CLI command to optimize a TensorFlow saved model to be used for inference operations. On their own saved model, a user would run the saved_model_cli CLI command to convert and optimize the TensorFlow graph to be used for inference operations. This article also showed how the TensorFlow Serving server can be configured with the server running as a microservice in a Docker container. This same configuration can be reused by a user on their system.