Establishing an AI model benchmark for manufacturing quality inspection using edge computing

This article is the third in an article series about using artificial intelligence (AI) for quality assurance in manufacturing using edge computing. In the first article, we gave an architecture overview of a manufacturing quality inspection system using AI and edge computing. In the second article, we explored the solution in detail including describing our AI models and inferencing. In this third article, we move the focus to the inferencing performance of production-ready AI models on edge devices.

In today’s world, there are a wide range of AI models and edge devices. To achieve optimal performance, it is important that we carefully choose the appropriate edge devices for specific AI models. In our work, we characterize the inferencing performance in model-device matrices and provide reference points for selecting edge devices and balancing edge workload for actual production deployment.

A series of benchmarking test cases were conducted in an environment that was similar to the actual production environment where the edge inferencing workload would run. We can use this data as a guide on choosing the best balance of inference times versus overall workload requirements.

Benchmarking test environment

We set up an environment for the benchmarking testing as shown in Figure 1, where an inferencing server executes tests with an image as input and JSON text as output. Scripts were run to measure inference time and resource usage.

Figure 1. Environment Diagram for Benchmarking Test

Environment diagram that shows benchmarking test script and the inferencing service

We used NVIDIA Jetson TX2, IBM Power AC922, and IBM Power IC922 as the edge devices. We use the term edge devices as the generalized term for the inferencing hardware at the edge in this article, since the IBM Power servers AC922 and IC922 are technically servers and not just devices. See Table 1 for the platform details and CUDA libraries versions for the three devices.

Table 1. Hardware detail
Jetson TX2 IC922 AC922
arch Aarch64 ppc64le ppc64le
OS NAME=”Ubuntu” VERSION=”18.04.4 LTS (Bionic Beaver)” NAME=”Red Hat Enterprise Linux Server” VERSION=”7.9 (Maipo)” NAME=”Red Hat Enterprise Linux Server” VERSION=”7.9 (Maipo)”
CUDA Version 10.0 10.2 10.2
GPU 1 256-core NVIDIA Pascal GPU 4 Tesla T4 4 Tesla V100-SXM2
MEM 7860MB 15109MB 16160MB

Benchmarking tests

We described the details on how AI models are used for parts inspection in our previous articles. Different AI models were chosen based on their accuracy, resources needed to run inference on different edge devices, and environments. In this article, we take a closer look at how those models perform and compare.

The benchmarking tests consist of three parts:

  • Part 1. Measure inference time and system resources comparing two object detection models on the Jetson TX2 and Power devices.
  • Part 2. Measure inference time and system resources for two object detection models running on Jetson TX2 device.
  • Part 3. Measure inference time and system resources for two object detection models running on Power servers.

Part 1: Inference time comparison for AI models

Two types of models were chosen: Faster R-CNN and YOLOv3, both trained from the same dataset. The objective is to characterize the performance difference between FRCNN and YOLOv3 on TX2 versus power servers (GPUs). Therefore, we measured only pure inference time and excluded model loading time from our reporting.

The details of the two models are shown in Table 2.

Table 2. Model details for Part 1 benchmarking
Faster R-CNN YOLOv3
MVI version
Accuracy 0.9701 0.9937
Model Size 484.6 MB 436.6 MB

The results of the benchmarking tests are shown in Table 3.

Table 3. Inference time for FRCNN vs YOLOv3
Hardware Faster R-CNN YOLOv3
AC922 0.983 sec 0.066 sec
IC922 1.215 sec 0.196 sec
Jetson TX2 4.657 sec 0.719 sec

The results showed that inferencing time on Power servers is approximately 25% of that in Jetson platforms due to the significant difference in resources. Between the two types of Power servers, AC922 showed better inferencing performances than IC922. The YOLOv3 model is faster than Faster R-CNN on all three platforms as an expected result of the nature of the neural networks that are used in YOLOv3 and Faster R-CNN.

Part 2: Characterization of memory, CPU, and network limits for inferencing in TX2

This part involved two object detection models, Faster R-CNN, and Single Shot Detector (SSD), trained from different dataset. The objective of Part 2 of our benchmarking tests was to characterize memory, CPU, and network limits when using TX2 as the edge devices for production workload.

The benchmarking in Part 2 consisted of CPU, memory, and network utilization monitoring in a span of 10 minutes for each of the following scenarios involving two models (details in Table 4) trained from different datasets.

In each scenario, two backup sizes 2.5M and 50M were tested:

  • Scenario S2.1: Baseline (no inferencing)
  • Scenario S2.2: One inferencing workload – Faster R-CNN
  • Scenario S2.3: One inferencing workload – Single Shot Detector (SSD)
  • Scenario S2.4: Two simultaneous inferencing workloads – Faster R-CNN + SSD

This particular SSD model is the only non-production model used in this benchmarking study. SSD has lower accuracy and is generally not suitable for quality inspection with small feature size. However, SSD is smaller in size and is less resource-hungry, so we chose to use it for benchmarking on the TX2 edge device.

Because backing up data from edge devices to a central storage location is a required function for a quality inspection operation, the network monitoring was added to understand the impact of backup jobs to the system performance on TX2.

Table 4. Model details for Part 2 benchmarking
Faster R-CNN SSD
Accuracy 0.97 0.48
Model Size 469 MB 107 MB
Table 5. summarizes the results.
Scenario CPU Average CPU Peak Memory Average Memory Peak Network (2.5M backup) Network (950M backup)
S2.1 0.24% 1.25% 21.83% 21.88% 8.3% 90.93%
S2.2 9.74% 24.18% 34.82% 70.57% 8.47% 90.14%
S2.3 8.16% 16.04% 29.10% 51.69% 8.36% 91.21%
S2.4 17.49% 47.87% 37.08% 83.19% 10.62% 91.16%

The data suggests that the NVIDIA Jetson TX2 is a robust platform that can run a single, moderate inference with no problem even with network backups running. However, it pushed the memory of the TX2 to the limit (peak memory utilization) when running two simultaneous (one Faster R-CNN and one SSD) inferencing workloads. The reason that TX2 was able to run two simultaneous inferencing workloads in our test was partly because the SSD model was a lighter model that required less resource. As the peak memory consumption for single Faster R-CNN workload was 71%, we conclude that TX2 would not be able to run two simultaneous Faster R-CNN inferencing models due to memory limitations. As the complexity and resource requirements of models may vary in different use cases, we recommend only a single model workload for TX2 in production mode if TX2 is chosen as the edge device for production deployment.

These benchmarking tests also suggested that network utilization was primarily affected by backup sizes and was independent from memory and CPU utilization from inferencing. Therefore, the data size and network bandwidth should be the primary consideration when designing the data backup solutions.

Part 3: Comparison of memory, CPU, and network limits for inferencing in Power Servers

Part 3 is a continuation from Part 2 and involves two object detection models, YOLOv3 and DETECTRON, each trained from different datasets. The objective of Part 3 is to characterize memory, CPU, and network limits when running multiple models on power servers.

In the actual production environment, different use cases with models trained with different datasets would run on the same AC922 or IC922 (as an edge server). Therefore, the test cases in Part 3 were designed with three production-ready models that were trained from different datasets (details in Table 6).

Table 6. Model details for Part 2 benchmarking
Detectron YOLOv3 Faster R-CNN
Accuracy 1.0 1.0 0.970
Model Size 318.2 MB 436.6 MB 521.53 MB

Prior to the characterization of memory, CPU, and network utilization, inferencing timing measurements on YOLOv3 and Detectron models was collected as reference data. The results are shown in Table 7.

Table 7. Inference time for YOLOv3 and Detectron
Hardware YOLOv3 YOLOv3 (4 Parallel Inferencing for 1000 iterations) Detectron
AC922 0.253 sec 1 min, 9.952 sec 0.390 sec
IC922 0.335 sec 5 min, 3.185 sec 0.366 sec

The Detectron model was not tested in Part 1, but the results in Part 3 suggest that Detectron is expected to have similar inferencing performance as YOLOv3 on AC922 and IC922 in production.

A special test case of running four parallel inferencing models (one inference per GPU) for 1000 iterations using YOLOv3 was included to characterize parallel inferencing performance, which would be a typical production workload, between AC922 and IC922. The results showed parallel inferencing on AC922 is 4.33 times faster than that on IC922, and single inferencing on AC922 is only 1.32 times faster than IC922. This suggests AC922 has a non-linear performance advantage over IC922 for parallel and continuous inferencing workloads.

CPU, memory, and GPU was then monitored in a span of 10 minutes for each of the following scenarios involving the three models in Table 6.

  • Scenario S3.1: Baseline (no inferencing)
  • Scenario S3.2: One inferencing workload – Detectron
  • Scenario S3.3: One inferencing workload – YOLOv3
  • Scenario S3.4: One inferencing workload – Faster R-CNN
  • Scenario S3.5: Two simultaneous inferencing workloads – Detectron + YOLOv3 on different GPUs
  • Scenario S3.6: Two simultaneous inferencing workloads – Detectron + YOLOv3 on same GPU
  • Scenario S3.7: Two simultaneous inferencing workloads – Faster R-CNN+ YOLOv3 on different GPU
  • Scenario S3.8: Two simultaneous inferencing workloads – Faster R-CNN+ YOLOv3 on same GPU

We designed an extensive set of test cases that covered as many possible real user scenarios; for example, running one inference or two inferences in parallel, running two inferences on separate GPUs and on the same GPU, and with and without backup process running in parallel with inferencing.

The results are summarized in tables 8 and 9.

Table 8. Summary of CPU, memory, and network utilization benchmarking on AC922
Scenario CPU Average CPU Peak Memory Average Memory Peak GPU 1 Average GPU 1 Peak GPU 2 Average GPU 2 Peak
S3.1 0.02% 0.19% 3.44% 3.44% 0.00% 0.00%
S3.2 1.05% 2.79% 3.46% 3.48% 8.90% 21.00%
S3.3 0.10% 0.26% 3.47% 3.48% 29.67% 37.00%
S3.4 0.40% 0.65% 3.57% 4.05% 15.76% 81.00%
S3.5 1.18% 2.78% 3.48% 3.51% 4.41% 11.00% 9.76% 21.00%
S3.6 1.19% 3.47% 3.56% 3.60% 12.73% 33.00%
S3.7 0.41% 0.69% 3.59% 4.06% 9.16% 81.00% 1.99% 12.00%
S3.8 0.49% 0.62% 3.59% 4.07% 16.24% 83.00%
Table 9. Summary of CPU, memory, and network utilization benchmarking on IC922
Scenario CPU Average CPU Peak Memory Average Memory Peak GPU 1 Average GPU 1 Peak GPU 2 Average GPU 2 Peak
S3.1 0.03% 0.34% 2.62% 2.64% 0.00% 0.00%
S3.2 2.28% 4.36% 3.37% 4.64% 20.14% 58.00%
S3.3 0.28% 0.59% 4.63% 4.67% 37.51% 40.00%
S3.4 0.52% 0.80% 2.81% 2.95% 21.71% 91.00%
S3.5 2.19% 3.68% 4.63% 4.72% 17.58% 40.00% 20.83% 58.00%
S3.6 2.22% 3.99% 4.67% 4.76% 29.84% 76.00%
S3.7 0.62% 0.94% 3.45% 3.60% 18.52% 91.00% 6.78% 45.00%
S3.8 0.61% 0.90% 3.47% 3.61% 27.00% 91.00%

As AC922 and IC922 are much higher-grade computing hardware than the TX2 edge device, the constraint of memory resources seen in TX2 in Part 2 were not a concern for AC922 and IC922.

GPUs allocation, which is controlled by the inferencing service, became the key consideration when running parallel inferencing workloads in AC922 and IC922. The data shows both AC922 and IC922 can handle parallel inferencing workloads well even when running two simultaneous inferencing workloads at the same GPU. With 4 GPUs available, both AC922 and IC922 would be able support at least 8 simultaneous inferencing workloads.


Our benchmarking study shows that TX2, IC922, and AC922 are all proven choices, in terms of inferencing performance, for edge hardware for AI model inferencing in a distributed AI architecture. The selection between TX2 and IC922 or AC922 depends on performance requirements, the number of inspection stations, and the IT budget of a particular use case.

  • TX2 still has a considerable gap in terms of inferencing timing that is in the range between 1s to 4s. If the use case requires faster inferencing timing below 0.5 seconds, then IC922 or AC922 would be the optimal choice.
  • From a hardware cost point of view, one IC922 is approximately equivalent to 50 TX2s and one AC922 is approximately equivalent to 150 TX2s. We only tested 2 simultaneous inferencing workloads per GPU in our benchmarking tests. The maximum number of concurrent inferences on the same GPU will vary from one use case to another due to different memory usage, and multiple inferences on a GPU can include a combination of AI models. TX2 would apparently be the cost-effective choice, considering one TX2 is recommended to support a single model workload and IC922 and AC922 can support simultaneous inferencing workloads with multiple GPUs. However, a large number of TX2 edge devices would add IT operational costs into the production environment. Therefore, hardware cost and IT operational costs need to be considered together with the number of inspection stations for selecting an appropriate edge device.
  • AC922 is purposely designed for AI model training workloads with faster GPUs. Although our benchmarking tests showed its better inferencing performance over IC922 as expected, the hardware cost probably does not justify the performance gain for AC922 to be a production level edge device for inferencing. The IC922 is purpose built for enterprise inferencing.

You can contact Christine Ouyang ( for more information on this solution.