Ready to deploy your trained model in a production environment? TensorFlow Serving is a process that will host your trained model so that client side applications can make inference requests to the TensorFlow Serving server and get back predictions. TensorFlow Serving is now available as both a conda package and a Docker container in WML CE 1.6.1

This blog will show you how to generate a saved model that can be served, how to install and run the TensorFlow Serving model server, and how clients can send inference requests to the server.

Prerequisites And Environment Setup

  • Install Anaconda and add the WML CE Conda channel:
$ conda config --prepend channels \
               https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/
  • Install the NVIDIA GPU driver(optional)
  • Install Docker and nvidia-docker (optional)

Check the WML CE 1.6.1 Knowledge Center for details on installing the above.

Download the code used for this exercise:

$ git clone https://github.com/IBM/powerai/
$ cd powerai/examples/tensorflow_serving/

Generating a Saved Model to Serve

For this walk-through we will use the TensorFlow example Text classification with movie reviews to predict if a review is positive or negative.

To do training we need a Conda environment with TensorFlow 1.14 or the TensorFlow 2.0 beta. This example will use TensorFlow 1.14 built with GPU support.

$ conda create -n tensorflow_env python=3.6 tensorflow-gpu
$ conda activate tensorflow_env

The program to do the training (movie_reviews_training.py) is the basically same as in the Jupyter notebook, but with two
changes to help send the inference request to TensorFlow Server:

1) We provide names to the input and output layers. These names need to be used with the gRPC client when sending inference requests.

  model.add(keras.layers.Embedding(vocab_size, 16, name="input_layer"))
  model.add(keras.layers.GlobalAveragePooling1D())
  model.add(keras.layers.Dense(16, activation='relu'))
  model.add(keras.layers.Dense(1, activation='sigmoid', name="output_layer"))

2) We save the trained model to disk at the end so that it can be served. Note that each version of the model is saved to a new directory.

  save_directory = os.path.join(args.model_base_path,str(args.model_version))
  tf.saved_model.save(model, save_directory)

Run movie_reviews_training.py to train our first model to serve:

$ python movie_reviews_training.py --training_iteration 40 --model_version 1 \
                                 --model_base_path /tmp/movie_reviews

When finished, it shows the final loss and accuracy based on the training data:

Final Results
loss = 0.311046
acc = 0.874880

The saved model will be on the file system at /tmp/movie_reviews/1.

Now we are ready to serve the trained model with TensorFlow Serving.

Installing and Running TensorFlow Serving

The TensorFlow Serving binary is called tensorflow_model_server. There are a few options available for running your model: It can be run as a Docker container or as a standalone process; each option can be used with or without GPUs.

Running TensorFlow Serving using the ibmcom/powerai Docker container

The easiest way to run TensorFlow Serving is in its own Docker container. Containers are available for TensorFlow Serving built with GPU support (1.6.1-tensorflow-serving-ubuntu18.04) and TensorFlow Serving built with CPU only support (1.6.1-tensorflow-serving-cpu-ubuntu18.04). The nvidia-docker container runtime is needed to start a container with GPUs.

For this example we will use the packages that include GPU support. These commands can be run in another terminal session to simulate your serving environment.

$ docker pull ibmcom/powerai:1.6.1-tensorflow-serving-ubuntu18.04
$ docker run -t --rm -p 8500:8500 -p 8501:8501 \
    --user 2051:2051 \
    -v "/tmp/movie_reviews:/models/movie_reviews:z" \
    -e MODEL_NAME=movie_reviews \
    ibmcom/powerai:1.6.1-tensorflow-serving-ubuntu18.04

A few notes on the parameters used:

  • -p 8500:8500 -p 8501:8501 exposes ports 8500 and 8501 in the container as ports 8500 and 8501 on the host. This allows us to send requests from the host or another container to the TensorFlow model server.
  • --user 2051:2051 runs the container as the pwrai user ID instead of root.
  • -v "/tmp/movie_reviews:/models/movie_reviews:z" mounts the volume ‘/tmp/movie_reviews’ on the host to ‘/models/movie_reviews’ on the container. The ‘:z’ is needed with SELinux only.
  • -e MODEL_NAME=movie_reviews This is the model to serve. TensorFlow ModelServer will look for the model under /models and expects client requests to reference this model name.

Also note that -e CUDA_VISIBLE_DEVICES=X where X is a GPU ID, can be used to limit the TensorFlow model server to a single GPU.

Running the TensorFlow Serving binary

The other option for running the TensorFlow model server is to install the conda package and run the binary. The TensorFlow model server does not require TensorFlow to be installed to run it.

The package name tensorflow-serving=*=gpu* is used to designate that we want to install the GPU variant of tensorflow-serving for the latest version.

$ conda create -n tf_serving_env tensorflow-serving=*=gpu*
$ conda activate tf_serving_env
$ tensorflow_model_server --port=8500 --rest_api_port=8501 \
                        --model_base_path=/tmp/movie_reviews \
                        --model_name=movie_reviews

A few notes on the parameters used:

  • --port=8500 is optional and if not specified, port 8500 will always be used for the gRPC port.
  • --rest_api_port=8501 is optional. However, if not specified, the REST API interface cannot be used.
  • --model_base_path=/tmp/movie_reviews is required and is the path to the model to be served.
  • --model_name=movie_reviews is required and is the model name that will be used by client requests for this model.

Also note that the environment variable CUDA_VISIBLE_DEVICES=X where X is a GPU ID, can be used to limit the TensorFlow model server to a single GPU.

Run an Inference request against the server

Inference requests can be done using either the REST API or the gRPC API using clients written in any language. For this example we will be using Python to submit the client side request. Each example will take text of a movie review and output a prediction of whether the review is positive or negative.

First we will look at the common code before the inference request is made.

As machine learning works with linear algebra, numbers and not words, we need to convert the text review into numbers. We need to use the same number for each word that was used to train the data set. To do this we get the word index used from TensorFlow’s keras.datasets.imdb.get_word_index() method. After converting the review text to lower case, removing the non-alphabetical characters, and splitting the review into a list of words, we can convert each word into an integer representation of the word. Finally, since our model requires all movie reviews be the same length, we pad the review with zeros to make it the same length as the trained data. At this point we are ready to make the client request.

  word_index = keras.datasets.imdb.get_word_index()

  # Put the review in lower case, remove anything that isn't a letter, split into a list of words
  review=review.lower()
  review=re.sub(r"[^a-z]", " ", review)
  review=review.split()

  # Convert the review into a list of integers corresponding to the words. Use 2 for unknown words.
  # As the model was only trained with words up to vocab_size, we need to remove any words
  # greater than vocab_size or get an error.
  coded_review = []
  for i in review:
    int_value = word_index.get(i, word_index["<UNK>"])
    if int_value>vocab_size:
      int_value=word_index["<UNK>"]
    coded_review.append(int_value)

  # Pad the length of the review to match the length of the tensors the model was trained with
  coded_review = keras.preprocessing.sequence.pad_sequences([coded_review], value=word_index["<PAD>"],
                                                            padding='post', maxlen=max_length)

REST API

To inference with the REST API we will use the JSON module to format the data and the requests module to submit an http request to the server:

  import json
  import requests

As the return value from keras.preprocessing.sequence.pad_sequences is a numpy array, we use the tolist() method to convert it into a Python list that can be passed into json.dumps. This gives us a formatted JSON request body. Note: TensorFlow server is designed to accept interface request in batches. In the example, we are passing a list containing one review to the server.

  # Create the rest inference request
  request_body = '{"instances" :  %s }' % json.dumps(coded_review.tolist())

Then we construct the URL to the server using the server host name and port (localhost:8501 for example) and the model name. We post the request to the server and wait for a response.

  # Submit the request to the server
  SERVER_URL = "http://%s/v1/models/%s:predict" % (args.server, model_name)
  response = requests.post(SERVER_URL, data=request_body)
  response.raise_for_status()

To get our prediction, we need to query the JSON response for predictions. As the results are for a batch of predictions, we pull the first element in the list. Additionally, since a model can have multiple output tensors, we pull the first element from this list for our prediction. In our trained model, positive movie reviews are considered 1 and negative movie reviews are considered 0. For predictions, the closer the value is to 1, the more likely it is a positive review. We will consider anything greater than or equal to 0.50 to be a positive review.

  # Display the results
  prediction = response.json()['predictions'][0][0]
  print("Confidence level the review is a positive one: %f" % prediction)
  if prediction >= 0.50:
    print("Review is considered postive!")
  else:
    print("Review is considered negative!")

To run the example, make up your own movie reviews or find movie reviews online. Because we use the TensorFlow API to retrieve the IMDB dataset, this needs to be run in a conda environment with TensorFlow installed. Also ensure that the requests module is installed.

$ conda activate tensorflow_env
$ conda install requests
$ python rest_client.py --server localhost:8501 --review "I liked the original enough \
to purchase it for my home collection and was really looking forward to the sequel. \
Whatever magic the original had was lost in the sequel. Maybe the word original is \
the problem. The first movie was unique in every way. The sequel did nothing to make \
it stand out from the first movie."

The review above is for a movie that left me disappointed. When the inference request was sent to the TensorFlow model server it predicted a 19.7% chance the review was positive.

Confidence level the review is a positive one: 0.197184
Review is considered negative!

gRPC API

The gRPC API is the alternative to the REST API for making client side requests. Before we inference with the gRPC API, let’s look at our saved model.

Run the command saved_model_cli show --dir /tmp/movie_reviews/1 --signature_def serving_default --tag_set serve to inspect the saved model.

$ saved_model_cli show --dir /tmp/movie_reviews/1 --signature_def serving_default --tag_set serve
The given SavedModel SignatureDef contains the following input(s):
  inputs['input_layer_input'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, -1)
      name: serving_default_input_layer_input:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['output_layer'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 1)
      name: StatefulPartitionedCall:0
Method name is: tensorflow/serving/predict

Note in the output above:

  1. The signature definition is ‘serving_default’.
  2. Our input layer is called ‘input_layer_input’. (TensorFlow appends _input to input layers, which is why it is different from the name we specified in the model.)
  3. Our output layer is called ‘output_layer’.
  4. The data type expected for both input and output is a float.

All four of these values are needed for the gRPC client. With that information, let’s take a look at the code. To inference with the gRPC API, we will use the TensorFlow serving APIs(from the tensorflow-serving-api conda package) to format the request and the grpc module to create the channel to the server.

  from tensorflow_serving.apis import predict_pb2
  from tensorflow_serving.apis import prediction_service_pb2_grpc
  import grpc

We use the grpc module to open a channel to our server host name and port (localhost:8500 for example), then use the TensorFlow serving APIs to create a prediction request with the model name and the model signature, found in the output of saved_model_cli above.

  # Create the grpc inference request
  channel = grpc.insecure_channel(args.server)
  stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
  request = predict_pb2.PredictRequest()
  request.model_spec.name = model_name
  request.model_spec.signature_name = 'serving_default'

Next, we will provide the input for the request. Here we use ‘input_layer_input’ for the name of the input, found in the output of saved_model_cli above and based on our original name in our trained model and we use the data type tf.float32; also from the saved_model_cli output.

Note: In the future, the method ‘tf.compat.v1.make_tensor_proto’ can be replaced with tf.tensor_util.make_tensor_proto(…), but that change was not included in TensorFlow 1.14 or the 2.0 beta.

  # Convert the coded_review into a tensor
  request.inputs['input_layer_input'].CopyFrom(tf.compat.v1.make_tensor_proto(coded_review,dtype=tf.float32))

With our channel to the server already created, submitting the request is straightforward:

  # Submit the request to the server
  result = stub.Predict(request, 10.0) # 10 secs timeout

The results are stored in the layer we named ‘output_layer’ when we created our model. Since our prediction request is a batch of one review, we need to get the value of the first element of the list:

  # Display the results
  prediction = result.outputs['output_layer'].float_val[0]
  print("Confidence level the review is a positive one: %f" % prediction)
  if prediction >= 0.50:
    print("Review is considered postive!")
  else:
    print("Review is considered negative!")

To run the example like when we used the REST API, we use the TensorFlow API to retrieve the IMDB dataset, this needs to be run in a conda environment with TensorFlow installed. We also need to install the tensorflow-serving-api Conda package.

$ conda activate tensorflow_env
$ conda install tensorflow-serving-api
$ python grpc_client.py --server localhost:8500 --review "I liked the original enough \
to purchase it for my home collection and was really looking forward to the sequel. \
Whatever magic the original had was lost in the sequel. Maybe the word original is \
the problem. The first movie was unique in every way. The sequel did nothing to make \
it stand out from the first movie."

Note: the prediction result is the same as when the REST API was used. The type of client used makes no difference in the prediction results.

Confidence level that the review is a positive one: 0.197184
Review is considered negative!

Version Handling

One of the cool features of TensorFlow Server is that we can deploy a new version of a trained model without having to restart the TensorFlow model server. This means we can put an improved model into production without an outage.

To demonstrate this, we are going to regress our model. The example program batch_rest_client_example.py takes 100 records from the IMDB data set’s sample test data and sends a batch inference request to get back a prediction from each sample. The prediction is then compared to the expected result from the test data. Our trained model was not trained with any of the test data. Only the training data was used to train the model.

When we run the program, we provide a seed to numpy’s random generator. As long as we provide the same seed, we will get the same 100 records. This ensures that any accuracy difference between runs is because the model changed and not because the test data changed.

$ python batch_rest_client_example.py --server localhost:8501 --seed 618
86 out of 100 correct

Since we are specifying a seed, no matter how many inference requests we make, the results will always be the same. 86% correct is about as good as this model can get with the amount of data it has been trained with. This was trained with 40 iterations. What if we only trained our model for 5 iterations? What would the results look like then?

$ python movie_reviews_training.py --training_iteration 5 --model_version 2 \
                                 --model_base_path /tmp/movie_reviews

Here we trained version 2 of our model and now, without restarting the TensorFlow Serving binary or docker container, we can see that the change has been picked up:

$ python batch_rest_client_example.py --server localhost:8501 --seed 618
78 out of 100 correct

With the same set of test data, we have gone from 86% accurate to 78% accurate without restarting the TensorFlow model server. We can fix that by training a version 3 with more iterations, and maybe further iterate on the model by making improvements in the code.

Conclusion

TensorFlow Serving is now available as part of WML CE 1.6.1 and is an easy way to host a trained model for inference requests. For more information about TensorFlow Serving see the documentation at tensorflow.org. For examples that reference containers, replace ‘tensorflow/serving’ with the WML CE container ‘ibmcom/powerai:1.6.1-tensorflow-serving-ubuntu18.04’.

Now let’s get those models into production!

Join The Discussion

Your email address will not be published. Required fields are marked *