Vision TensorRT inference samples

As part of IBM® Maximo Visual Inspection 1.2.0 (formerly PowerAI Vision) labeling, training, and inference workflow, you can export models that can be deployed on edge devices (such as FRCNN and SSD object detection models that support NVIDIA TensorRT conversions). To enable you to start performing inferencing on edge devices as quickly as possible, we created a repository of samples that illustrate how to use Maximo Visual Inspection with edge devices.

This repository contains samples that perform object detection, either from image files or a camera, and will then output object classes and bounding boxes. The output is optionally logged as text or overlays, and is stored as image file overlays or displayed in a modeless window. This capability is supported on NVIDIA Jetson devices, as well as other systems, such as IBM Power® servers with GPUs enabled.

In this article, we describe the samples and explain how to use them as templates for an embedded inference workflow based on local input and output. It also demonstrates how the samples can be modified to customize batch size, native model resolution, and floating point precision, and to perform additional actions with the inference result, such as sending the results to a shared folder or a remote location. This article is intended for developers familiar with C/C++ and Python development with access to an NVIDIA Jetson TX2 device for simplicity. The following examples can be applied to other NVIDIA Jetson-class devices.

Samples are based on NVIDIA’s TensorRT C/C++ samples described at https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html and stored locally at /usr/src/tensorrt/samples. They have been modified for ease of use. Python samples are made to be functionally equivalent.

Note that the source code might change in the future without explicit prior notice.

figure 1

Common use

The common usage pattern is to either have the image file processed locally or to use a camera [onboard, USB, or Real-Time Streaming Protocol (RTSP) based] in an inference loop and optionally display, store, or transmit results to a remote location. The usage is to display results in a window for demonstration purposes or perform other modifications to modify the TensorRT parameters to evaluate accuracy and performance impact. Both the graphical display and debugging code need to be commented out for best performance.

The described workflow consists of the following steps:

  1. Starting with a trained SSD/FRCNN model in Maximo Visual Inspection, export the model.
  2. Extract the TensorRT model files from the .zip file and embedded .gz file, typically as *_trt.prototxt and *.caffemodel, and copy to the Jetson file system like /home/nvidia/Downloads.
  3. Modify the sample’s source code specifically for a given model, such as file folders, resolution, batch size, precision, and so on.
  4. Build a sample.
  5. Deploy a sample for single or batch image files input from a folder or deploy a sample for inference from a camera by passing the argument value Camera instead of the file name or names.

The input can be any image files (type and resolution) passed in as command line arguments. Alternatively, some cameras can be used by adjusting a gstreamer string for the native, USB, or streaming camera. For more information, read http://developer2.download.nvidia.com/embedded/L4T/r28_Release_v1.0/Docs/Jetson_TX2_Accelerated_GStreamer_User_Guide.pdf.

In the file mode, multiple images can be handled simultaneously, as long as the number of images matches the value specified for batch size.

Note : If a camera is used from within containers, the Docker run options need to enable access to the underlying camera hardware. The required syntax and options are beyond the scope of this article.

The output is a list of classes, bboxes per image, debug messages, and images with the bbox marked. The code outputs classes as debug text and bboxes as images or in a modeless window.

The model must meet these requirements:

  • Any custom Maximo Visual Inspection trained SSD or FRCNN model (YOLO and GoogLeNet are currently not supported)
  • The values \<_model_name_\>trt.prototxt and \<_model_name>.caffemodel must be present, with the appropriate names specified in the source code.
  • Batch size can be a custom value as long as it fits the device memory, which is adjustable in the source code. For camera input, this value should be one to reduce latency. There are some benefits to a bigger batch size. You will get increased bandwidth at the expense of increased latency.

Customizations

You can customize these samples in the following ways:

  • Floating point precision can be changed. This value affects the accuracy, speed, and memory footprint. It is adjustable in the source code.
  • The model resolution can be changed. However, it needs to match the (average) aspect ratio of the images. Within the same aspect ratio, a bigger resolution produces better accuracy at the expense of latency and bandwidth, and uses more GPU memory. Native FRCNN resolution (width x height) is 1000×600 and SSD resolution is 512×512.
  • The number and names of classes can be changed based on the model. This is also adjustable in the source code. The class names could be read from a label file, but due to variations in file syntax, this function is currently not included. The number of classes is always one more than the label file because there is one background class.
  • The confidence level can be adjusted. This determines the number of objects recognized.
  • The inference loop used in the camera mode can be modified to transmit results to a local or remote location.

Common prerequisites

Following are the prerequisites to have the correct native or Docker environment for Jetson, for both build and runtime:

  • If building on Jetson TX2 native, follow the steps described in NVIDIA Jetpack Manager installation (currently 4.2.2). This requires a Linux® host machine to initially flash the Jetson board.
    1. Download Jetpack manager from https://developer.nvidia.com/nvsdk-manager and install it using the command, sudo apt install sdkmanager_0.9.14-4961_amd64.deb
    2. Invok Jetpack manager through sdkmanager &;
    3. During installation to the target Jetson board, you have the option to have the prerequisites installed for you. Deselect the host machine. You can deselect TensorFlow and select Jetson TX2, Nano, or Xavier, depending on your environment.
    4. Once installed, follow the rest of the prerequisite instructions from the NVIDIA /usr/src/tensorrt/samples/README.md.
  • For Python samples, the only additional step is to install pip, then pip install pycuda. If building on Power within Docker:
    1. Start with nvidia/cuda-ppc64le:10.1-cudnn7-devel-ubuntu18.04 Docker and add the latest TensorRT SDK (currently 5.1.3.2, CUDA 10.1, CUDNN 7.5, for Power).
    2. Install or build OpenCV version 3.3.1 and later.
    3. Extract the TensorRT archive.

Compiling and running

Compiling and running involves the following steps:

  1. Compile the C/C++ samples first from the source and run from the bin directory with command line parameters. It is assumed that CUDA, cuDNN, TensorRT, GCC, and OpenCV are preinstalled and the environment variables are set using the NVIDIA JetPack described above.
  2. Copy this repository samples source code files and make files over the respective TensorRT sample directories in /usr/src/tensorrt/samples. You may need to change the folder owner to the current user (nvidia). Modify the code to match the desired model, batch size, floating point precision, image folder, and names of classes (more below).
  3. Compile the source code using the make command from the respective samples directory, which usually takes a few seconds.
  4. Run the binary (release or debug) from the bin folder and pass in the file names as name1.ext and name2.ext without the folder path. On the initial run, if the TensorRT engine for the model has not been run before, it will take a little while (about a minute) to parse and serialize it. On subsequent runs if no changes were made to the model or engine parameters, the engine will be deserialized from an earlier saved one.

Samples

Samples are published at https://github.com/IBM/powerai/tree/master/vision/tensorrt-samples

/samples/sampleMultiGPUbatch/

/samples/sampleMultiGPUbatch/ contains a inferenceloop.py script to instantiate and test multiple inference containers for Power benchmark testing. The script allows a multi-threaded client to instantiate many instances of the inference only server per node, up to the cumulative size of the GPU’s memory. The parameters include model files, the model configuration, an image file to test with, and transport GET or POST. Depending on the model, batch size, model resolution used, and GPU generation and memory, the number of instances will vary. While the increased batch size accelerates individual instance inference speed (higher bandwith but higher latency), it also requires more memory and therefore decreases number of instances. So, there is a balance to sizing the batch. The native resolution and image file resolution will affect the speed, and with the optimal settings, it is expected to reach upto 350 FPS per Power AC922 with four Tesla V100 16 GB GPUs and around 240 FPS for Power IC922 with six T4 16 GB GPUs for 1080p images using POST.

/samples/python/sampleFasterRCNN/

/samples/python/sampleFasterRCNN/ is an FRCNN model project suitable for slower (2 FPS at 1000×600) but more accurate object detection inference than SSD or YOLO. It contains a vision_model_deploy.py script that takes input parameters such as model files, model configuration information (batch size, resolution), inference configuration (confidence), and input files or camera information. Model files include prototxt file, .caffemodel weights file, JSON file, and label file.

Model configuration includes batch size, resolution (does not need to match the default as was trained or the input image resolution), inference configuration, and confidence threshold.

Input configuration is either a list of input files or the keyword Camera , then the gstream string can be adjusted in the code to reflect the onboard camera, USB attached, or streaming (RTSP and so on).

Display and debug configuration is in the code and can be commented out to store image overlays with the detected bounding boxes or to display in the modeless window. This includes sending any result to a remote client or server.

The batch size can be set to any value up to the memory size. This increases the bandwidth but also the latency. In the camera mode, which operates in a loop, the batch should be set to 1.

Below are the places in the code that can be customized for particulars of the model and images:

gst_str = (‘nvarguscamerasrc ! nvvidconv ! video/x-raw, format=BGRx ! videoconvert !video/x-raw, format=BGR ! Appsink’)

The invocation would be:

python detector_deploy.py –model_def <prototxt> –-model_weights <weights.caffemodel> –label_map <exported label.prototxt> –batch <1-20> –resolution <512×512 or other> –confthre <0.5 or other> –-image_name <value>

Note that the value for –image_name can be one of these:

  • <full path, comma separated (if batch)>
  • Camera

/samples/python/sampleSSD

/detector_deploy.py is an SSD model project suitable for faster (10 FPS at 512×512) but less accurate object detection inference than FRCNN. Below are the places in the code that can be customized for the particulars of the model and images:

gst_str = (‘nvarguscamerasrc ! nvvidconv ! video/x-raw, format=BGRx ! videoconvert !video/x-raw, format=BGR ! Appsink’)

The invocation would be:

python vision_model_deploy.py –net_file <prototxt> –model_file <weights.caffemodel> –json_file <exported jsonfile> –lable_file <exported label.prototxt> –batch <1-20> –resolution <1000×600 or other> –confthre <0.5 or other> –-image_name <value>

Note that the value for –image_name can be one of these:

  • <full path, comma separated (if batch)>
  • Camera

/samples/sampleFasterRCNN/

sampleFasterRCNN.cpp is an FRCNN model project suitable for slower (2 FPS at 100×600) but more accurate object detection inference than SSD or YOLO. Below are the places in code that can be customized for the particulars of the model and images:

static const int INPUT_H = 600;
static const int INPUT_W = 1000;
static const int OUTPUT_CLS_SIZE = 3;
const int N = 1;
const std::string CLASSES[OUTPUT_CLS_SIZE]{“background”, “1”, “2”};
std::vector<std::string> dirs{“data/samples/faster-rcnn/”,“data/faster-rcnn/”};
caffeToTRTModel(“faster_rcnn_test_iplugin.prototxt”,“VGG16_faster_rcnn_final.caffemodel”,
cap.open(“nvarguscamerasrc ! nvvidconv ! video/x-raw, format=BGRx ! videoconvert !video/x-raw, format=BGR ! Appsink”);

The invocation would be one of these:

./sampleFasterRCNN <image_file> <image_file> <image_file>

or

./sampleFasterRCNN Camera

/samples/sampleSSD/

sampleSSD.cpp is an SSD model project suitable for faster (10 FPS at 512×512) but less accurate object detection inference than FRCNN. Below are the places in the code that can be customized for particulars of the model and images:

static const int kINPUT_H = 512;
static const int kINPUT_W = 512;
static const int kOUTPUT_CLS_SIZE = 3;
const int N = 1; // Batch size
static const std::vector<std::string> kDIRECTORIES{“data/samples/ssd/”, “data/ssd/”, “data/int8_samples/ssd/”, “int8/ssd/”};
const std::string gCLASSES[kOUTPUT_CLS_SIZE]{“background”, “1”, “2”};
caffeToTRTModel(“ssd.prototxt”,“VGG_VOC0712_SSD_300x300_iter_120000.caffemodel”,
cap.open(“nvarguscamerasrc ! nvvidconv ! video/x-raw, format=BGRx ! videoconvert !video/x-raw, format=BGR ! Appsink”);

The invocation would be one of these:

./sampleFasterRCNN <image_file> <image_file> <image_file>

or

./sampleFasterRCNN Camera

Conclusion

The repository described in this article currently contains four samples that do not require you to install any of the frameworks or libraries but use TensorRT as a common inference engine, covering FRCNN and SSD models in both C/C++ and Python forms. Using the samples requires some basic modifications related to the locations of the trained model files, TensorRT parameters, and command line parameters related to image input files. The samples have been tested on both Jetson TX2 and IBM POWER9™ processor-based servers. The supported models will be extended in the future with YOLO, GoogLeNet and others.