The Blog


Deep learning models make it possible for you to have real-time annotation on a video stream, although it’s a challenging task. To make this possible, you might need an efficient mechanism or a good system design. Here, I’ll demonstrate one way to make this work. The proposed method can be used to label each frame of the video. The labels can then be computed with any heavy deep learning model. As an example, I use a facial age estimation model to annotate the detected faces (the bounding boxes) and the estimated ages for the people in each frame in real-time video streaming.

The cost of inference process in both facial detection and age estimation models can be higher than generating frames from a webcam. With the help of visual tracking algorithms such as Kernelized Correlation Filter, TLD, and MedianFlow, we can hide the latency from a video frame as an input of a deep model and an inference result generated by the model. We simultaneously process a visual tracking algorithm and deep model inference. This way, each video frame is annotated with the bounding box information generated either from visual tracking or a deep model. If the bounding box generated by the visual tracking algorithm deviates from the actual location, we update the annotated results from a deep learning model to change the initial position of the bounding box with the visual tracking algorithm. This method combines the annotating information with the visual tracking algorithm and a deep learning model. By doing this we resolve the latency issue for real-time video annotation and the framework can be applied to any deep learning video annotation model… Well, more specifically, by “any model” here I mean it can be applied to any high-computational time cost models such as video-based object detection, video segmentation, and so on.

I’ll be using the facial age estimator model from the Model Asset eXchange to demonstrate these things. The Model Asset eXchange (MAX) is a collection of open source deep learning models that make it possible for you to deploy pre-trained models without writing any code. With MAX, you can also train models using your own data! Check out the MAX landing page to browse the full collection of models that are free and available for you to use.

Latency is an issue

Suppose we have a deep learning model that can estimate a person’s age when given an image with human faces, like the facial age estimator model from the Model Asset eXchange. If we would like to change the input from a still image to a video, for example, a webcam device, then we will encounter a latency issue. Latency is an issue to be solved between a live stream and a deep learning model. This latency issue occurs because a webcam can capture every frame in real time (normally 30 – 120 frames per second), and a deep learning model (in this case, an age estimator model) processes approximately 1 – 2 frames per second (fps). Therefore, there’s a gap for the generated video frames between a webcam and a deep learning model.

Facial age estimator example

Using a facial age estimator as the example, I’ll demonstrate how the latency can be reduced or hidden in video streaming annotation. To resolve the quantization error, the model performs a regression for age estimation through a mechanism called dynamic range. Additionally, the procedure uses a coarse-to-fine strategy for multiple stages processing. Finally, an end-to-end soft stagewise regression network was proposed for age estimation.


An illustration of the facial age estimator model.

An entire framework

We know that there’s a large gap in the processing time between the webcam and the facial age estimator model. For example, an age estimator model usually takes approximately 0.3 – 0.5 seconds to process every frame while a webcam sends every frame in less than 0.1 seconds. Furthermore, in a web application, a browser displays each frame in less than 0.1 seconds, which has a similar processing time to a webcam.

Let’s start at the first frame captured by a webcam. When the webcam sends the first frame to the age estimator model, it also captures the second, third, and other frames sequentially. If a browser (a web-app’s browser) tries to display an estimated result from the age estimator model, the browser must wait for the result generated by the age estimator model. What we really want to do is show the estimated results instantly. How can we do this without any latency?

  • Instead of sending every frame to the model, we send 1 in 20 frames to the age estimator model.
  • During model processing, the browser displays the annotated ages and bounding boxes from the previously estimated result from the visual tracking algorithm.
  • When receiving an estimated result generated from the model, the browser updates the latest bounding boxes and ages, then adopts visual tracking for the next 19 frames.

With the help of visual tracking, a real-time annotation mechanism can hide the latency effect smoothly. Additionally, the general-purpose technology can be applied to any real-time video annotation with any model. Besides the visual tracking, we also apply multiple threads for dealing with visual tracking and age estimator models in parallel.


The proposed framework of reducing latency for real-time video annotation.

Learn more about the facial age estimator

By reducing latency issue, we have made it possible for real-time video annotation in a live stream. We not only introduce a scenario for a proposed framework, but also introduce a practical web application (web app) to demonstrate how smoothly you can annotate video streaming. For more technical details, look at our code pattern which allows you to get the model, test the API, try the web app, and test the model in a Node-RED Flow. I’ve also included below a full poster that further illustrates the information of this blog, which I presented at Think 2019.


Please visit IBM Cloud for more services.