As part of PowerAI Vision’s labeling, training, and inference workflow, you can export models that can be deployed on edge devices (such as FRCNN and SSD object detection models that support TensorRT conversions). To enable you to start performing inferencing on edge devices as quickly as possible, we created a repository of samples that illustrate how to use PowerAI Vision with edge devices.

This repository contains samples that perform object detection, either from image files or a camera, and will then output object classes and bounding boxes. The output is optionally logged as text or overlays, and is stored as image file overlays or displayed in a modeless window. This capability is supported on NVIDIA Jetson devices, as well as other systems, such as Power servers with GPUs enabled.

In the blog we describe the samples and explain how to use them as templates for an embedded inference workflow based on local input and output. It also demonstrates how the samples can be modified to customize batch size, native model resolution, and floating point precision, and to perform additional actions with the inference result, such as sending the results to a shared folder or a remote location. This blog is intended for developers familiar with C/C++ and Python development with access to an NVIDIA Jetson TX2 device for simplicity. The below examples can be applied to other NVIDIA Jetson-class devices.

Samples are based on NVIDIA’s TensorRT C/C++ samples described at https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html and stored locally at /usr/src/tensorrt/samples. They have been modified for ease of use. Python samples are made to be functionally equivalent.

Note that source code will change in the future without explicit prior notice.

Common Use

The common usage pattern is to either have the image file processed locally or to use a camera (onboard, USB, or RTSP based) in an inference loop and optionally display, store, or transmit results to a remote location. The usage is to display results in a window for demonstration purposes or perform other modifications to modify the TensorRT parameters to evaluate accuracy and performance impact. Both the graphical display and debugging code need to be commented out for best performance.

The described workflow consists of following steps:

  • Starting with a trained SSD/FRCNN model in Vision, export the model.
  • Extract the TensorRT model files from the zip and embedded gz file, typically as *_trt.prototxt and *.caffemodel, and copy onto the Jetson file system like /home/nvidia/Downloads.
  • Modify samples source code specifically for given a model, such as file folders, resolution, batch size, precision etc.
  • Build a sample.
  • Deploy a sample for single or batch image files input from a folder or deploy a sample for inference from a camera by passing the argument value “Camera” instead of file name(s).

The input can be any image files (type and resolution) passed in as command line arguments. Alternatively, some cameras can be used by adjusting a gstreamer string for the native, USB, or streaming camera. For more information, read http://developer2.download.nvidia.com/embedded/L4T/r28_Release_v1.0/Docs/Jetson_TX2_Accelerated_GStreamer_User_Guide.pdf.

In file mode, multiple images can be handled simultaneously, as long as the number of images matches the value specified for “batch size”. Note: If a camera is used from within containers, the Docker run options need to enable access to underlying camera hardware. The required syntax and options are beyond the scope of this blog.

The output is a list of classes and bboxes per image and debug messages, and images with the bbox marked. The code outputs classes as debug text and bboxes as images or in a modeless window.

The model must meet these requirements:

  • Any custom IBM Power AI Vision trained SSD or FRCNN model (YOLO and GoogLeNet are currently not supported)
  • The values <model_name>_trt.prototxt and <model_name>.caffemodel must be present, with the appropriate names specified in the source code.
  • “Batch size” can be a custom value as long as it fits the device memory, which is adjustable in source code. For camera input, this value should be “one” to reduce latency. There are some benefits to a bigger batch size. You will get increased bandwidth at the expense of increased latency.

Customizations

You can customize these samples in the following ways:

  • Floating point precision can be changed. This value affects the accuracy, speed, and memory footprint. It is adjustable in the source code.
  • The model resolution can be changed. However, it needs to match the (average) aspect ratio of the images. Within the same aspect ratio, a bigger resolution will produce better accuracy at the expense of latency and bandwidth, and uses more GPU memory. Native FRCNN resolution is 1000×600 and SSD 512×512; width x height.
  • The number and names of classes can be changed based on the model. This is also adjustable in the source code. The class names could be read from a label file, but due to variations in file syntax, this function is currently not included. The number of classes is always one more than the label file since there is one background class.
  • The confidence level can be adjusted. This determines the number of objects recognized.
  • The inference loop used in the camera mode can be modified to transmit results to a local or remote location.

Common Prerequisites

Following are the prerequisites to have the correct native or Docker environment for Jetson, for both build and runtime:

  • If building on Jetson TX2 native, follow the steps described in NVIDIA Jetpack Manager installation (currently 4.2.2). This requires a Linux host machine to initially flash the Jetson board.
    • Jetpack manager is installed by downloading if from https://developer.nvidia.com/nvsdk-manager and installing via sudo apt install sdkmanager_0.9.14-4961_amd64.deb.
    • Once installed, it is invoked via sdkmanager &.
    • During installation onto the target Jetson board, you have the option to have the prerequisites installed for you. Deselect Host machine. TensorFlow can be deselected and Jetson TX2, Nano, or Xavier can be selected, depending on your environment.
    • Once installed, follow the rest of the prerequisite instructions from the NVIDIA /usr/src/tensorrt/samples/README.md.
  • For Python samples, the only additional step is to install pip, then pip install pycuda. If building on Power within Docker:
    • Start with nvidia/cuda-ppc64le:10.1-cudnn7-devel-ubuntu18.04 Docker and add latest TensorRT SDK (currently 5.1.3.2, CUDA 10.1, CUDNN 7.5, for Power).
    • Install or build OpenCV version 3.3.1 and the above.
    • Unzip the TensorRT archive.

Compilation and Running

  • The C/C++ samples first have to be compiled from source and run from the bin directory with command line parameters. It is assumed that CUDA, cuDNN, TensorRT, GCC, and OpenCV are preinstalled and the environment variables are set via the NVIDIA JetPack described above.
  • Copy this repository samples source code files and make files over the respective TensorRT sample directories in /usr/src/tensorrt/samples. You may need to change the folder owner to the current user (nvidia). Modify the code to match the desired model, batch size, floating point precision, image folder, and names of classes (more below).
  • Compile the source code via “make” from the respective samples directory, which usually takes a few seconds.
  • Run the binary (release or debug) from the bin folder and pass in the file names as “name1.ext” “name2.ext” without the folder path. On the initial run, if the TensorRT engine for the model has not been run before, it will take a little while (about a minute) to parse and serialize it. On subsequent runs if no changes were made to the model or engine parameters, the engine will be deserialized from an earlier saved one.

Samples

Samples are published at https://github.com/IBM/powerai/tree/master/vision/tensorrt-samples

/samples/sampleMultiGPUbatch/

/samples/sampleMultiGPUbatch/ contains a inferenceloop.py script to instantiate and test multiple inference containers for Power benchmark testing. The script allows a multi-threaded client to instantiate many instances of the inference only server per node, up to the cumulative size of the GPU’s memory. The parameters include model files, the model configuration, an image file to test with, and transport GET or POST.
Depending on the model, batch size, model resolution used, and GPU generation and memory, the number of instances will vary. While the increased batch size accelerates individual instance inference speed (higher bandwith but higher latency), it also requires more memory and therefore decreases number of instances, so there is a balance to sizing the batch. The native resolution and image file resolution will affect the speed, and with the optimal settings, it is expected to reach upto 350fps per AC922 with 4 V100 16GB GPUs and around 240fps for IC922 with 6 T4 16GB GPUs for 1080p images using POST.

/samples/python/sampleFasterRCNN/

/samples/python/sampleFasterRCNN/ is an FRCNN model project suitable for slower (2fps at 1000×600) but more accurate object detection inference than SSD or YOLO. It contains a vision_model_deploy.py script that takes input parameters such as model files, model configuration information (batch size, resolution), inference configuration (confidence), and input files or camera information. Model files include prototxt file, .caffemodel weights file, json file, and label_file.

Model configuration includes batch size, resolution (does not need to match the default as was trained or the input image resolution), inference configuration, and confidence threshold.

Input configuration is either a list of input files or the keyword “Camera”, then the gstream string can be adjusted in the code to reflect the onboard camera, USB attached, or streaming (rtsp etc).

Display and debug configuration is in the code and can be commented out to store image overlays with the detected bounding boxes or to display in the modeless window. This includes sending any result to a remote client/server.

The batch size can be set to any value up to the memory size. This increases the bandwidth but also the latency. In camera mode, which operates in a loop, the batch should be set to 1.

Below are the places in the code that can be customized for particulars of the model and images:

gst_str = (‘nvarguscamerasrc ! nvvidconv ! video/x-raw, format=BGRx ! videoconvert !video/x-raw, format=BGR ! Appsink’)

The invocation would
be:

python detector_deploy.py –model_def <prototxt> –-model_weights <weights.caffemodel> –label_map <exported label.prototxt> –batch <1-20> –resolution <512×512 or other> –confthre <0.5 or other> –-image_name <value>

Note that the value for –image_name can be one of these:

  • <full path, comma separated (if batch)>
  • Camera

/samples/python/sampleSSD

/detector_deploy.py is an SSD model project suitable for faster (10fps at 512×512) but less accurate object detection inference than FRCNN. Below are the places in the code that can be customized for the particulars of the model and images:

gst_str = (‘nvarguscamerasrc ! nvvidconv ! video/x-raw, format=BGRx ! videoconvert !video/x-raw, format=BGR ! Appsink’)

The invocation would be:

python vision_model_deploy.py –net_file <prototxt> –model_file <weights.caffemodel> –json_file <exported jsonfile> –lable_file <exported label.prototxt> –batch <1-20> –resolution <1000×600 or other> –confthre <0.5 or other> –-image_name <value>

Note that the value for –image_name can be one of these:

  • <full path, comma separated (if batch)>
  • Camera

/samples/sampleFasterRCNN/

sampleFasterRCNN.cpp is an FRCNN model project suitable for slower (2fps at 100×600) but more accurate object detection inference than SSD or YOLO. Below are the places in code the can be customized for the particulars of the model and images:

static const int INPUT_H = 600;

static const int INPUT_W = 1000;

static const int OUTPUT_CLS_SIZE = 3;

const int N = 1;

const std::string CLASSES[OUTPUT_CLS_SIZE]{“background”, “1”, “2”};

std::vector<std::string> dirs{“data/samples/faster-rcnn/”,“data/faster-rcnn/”};

caffeToTRTModel(“faster_rcnn_test_iplugin.prototxt”,“VGG16_faster_rcnn_final.caffemodel”,

cap.open(“nvarguscamerasrc ! nvvidconv ! video/x-raw, format=BGRx ! videoconvert !video/x-raw, format=BGR ! Appsink”);

The invocation would be one of these:

./sampleFasterRCNN <image_file> <image_file> <image_file>

or

./sampleFasterRCNN Camera

/samples/sampleSSD/

sampleSSD.cpp is an SSD model project suitable for faster (10fps at 512×512) but less accurate object detection inference than FRCNN. Below are the places in the code that can be customized for particulars of the model and images:

static const int kINPUT_H = 512;
static const int kINPUT_W = 512;
static const int kOUTPUT_CLS_SIZE = 3;
const int N = 1; // Batch size
static const std::vector<std::string> kDIRECTORIES{“data/samples/ssd/”, “data/ssd/”, “data/int8_samples/ssd/”, “int8/ssd/”};
const std::string gCLASSES[kOUTPUT_CLS_SIZE]{“background”, “1”, “2”};
caffeToTRTModel(“ssd.prototxt”,“VGG_VOC0712_SSD_300x300_iter_120000.caffemodel”,
cap.open(“nvarguscamerasrc ! nvvidconv ! video/x-raw, format=BGRx ! videoconvert !video/x-raw, format=BGR ! Appsink”);

The invocation would be one of these:

./sampleFasterRCNN <image_file> <image_file> <image_file> 

or

./sampleFasterRCNN Camera 

Concluding Remarks

The repository described above currently contains four samples that do not require one to install any of the frameworks or libraries but use TensorRT as a common inference engine, covering FRCNN and SSD models in both the C/C++ and Python forms. Using the samples requires some basic modifications related to the locations of trained model files and TensorRT parameters, and command line parameters related to image input files. The samples have been tested on both Jetson TX2 and Power 9. The supported models will be extended in the future with YOLO, GoogLeNet and others.

Join The Discussion

Your email address will not be published. Required fields are marked *