The counting cars code pattern
Using computer vision to annotate videos is a fun and useful exercise. I recently had a chance to try it out while working on this code pattern. I learned a lot (you might, too), and I think that if you try it out, you’ll think of many useful applications.
For a use case, I started with the idea of counting objects in motion and how to apply that to business applications. Whether it is car traffic, people traffic, or products on a conveyer belt, there are many applications for keeping track of potential customers, actual customers, products, or other assets. With video cameras everywhere, a business can get useful information from them with some computer vision. Applying this technology to videos is much more practical than older methods (for example, using special hardware or a person counting vehicle traffic).
Most use cases can be solved if you can do the following four things:
- Recognize objects of interest
- Keep track of the objects as they move
- Determine if they enter or exit a specific region of interest
- Annotate the video (if needed)
In the example code pattern, I am able to recognize and track cars on a highway. In this case, the region of interest is the bottom of the screen where I count cars as they exit the video (a.k.a., the “finish line”).
Recognize objects of interest
In my example, I wanted to recognize cars and be able to locate them in the video. I used IBM PowerAI Vision and its Video Data Platform to create an object detection API for cars. PowerAI Vision makes deep learning incredibly easy to apply. I created, trained, and deployed the model without writing any code. I can then easily use the model from any code. In this instance, I used a Jupyter Notebook to show how to use it with Python and a video.
Because I am using video, I was able to use the Video Data Platform and automatic labeling. This replaced a lot of manual effort and greatly improved the accuracy of the model. The basic idea is that you do a little bit of manual labeling (draw some bounding boxes around cars on some images using your mouse in a web UI), and then the Video Data Platform uses that model for inference to do automatic labeling on many more frames sampled from the video. You simply validate the results, decide if you need to make manual adjustments, and voilà — you get a much more accurate model. Whether you use the model on video or on still images, having so many examples with slightly different angles and lighting really improves the accuracy of the model.
PowerAI Vision is built to run on Power Systems and uses GPUs to accelerate deep learning. You can run it on-premise or try it out in the cloud.
So, how do you recognize cars in the video? Well, the Jupyter Notebook (Python code) uses OpenCV to take a video and extract the frames. You take a sampling of the frames and send them to your deployed PowerAI Vision API to get back JSON describing the cars that were detected and the coordinates for the bounding boxes around them. With so many frames per second, it isn’t necessary to run inference on every frame. You can tune that to your use case. Either way, you’ll want to keep track of these cars from frame to frame — so keep reading.
Keep track of the objects as they move
|Detected here||Followed down the road|
If I only cared about the number of cars in an image or the average number of cars in a set of images, then it would be easier — no tracking would be needed. But in most cases with video, objects start and stop and move around. Consider the example of counting people in line at a grocery store. If you only care about what the average line length is, then you just count and average. If you care about the number of customers served, then you need to somehow capture movement. The line length could always stay the same, but the customers flow through. In the car counter example, I used a tracker, and I count the cars as they cross the finish line in each lane. Therefore, I can demonstrate not only how to recognize, annotate, and count the cars in a frame but also how to track them and count them as they move frame to frame and enter a region of interest.
So how do you track them? Well, OpenCV has a tracking API that does that for you. You start with bounding boxes that came from the PowerAI Vision inference and create an OpenCV tracker for each new box. It took a little code to decide whether the box was new, but otherwise, OpenCV does a pretty good job of keeping track of that box as the object (it doesn’t need to know it is a car) moves from frame to frame. There are several algorithms to choose from for tracking. The Kernelized Correlation Filter (KCF) and Multiple Instance Learning (MIL) look like the most useful today.
If you try my example, you might notice that sometimes the tracker loses track of a car when the car images overlap (the accuracy of the model has some influence on that, so your results might vary). In my case, changing KCF to MIL in the code can fix that tracking loss — but then MIL stumbles in other areas. So I left it with a loss counter but can suggest a couple of easy fixes as an exercise. The easiest fix is to just subtract the loss from the total count and that fixes the issue — the lost car generally is detected further down the road. In general, however, a lost object might not reappear later in the video. So you’d need to know if you are double-counting. For the highway video, I added a starting line to avoid adding cars in the distance where the overlap problem happens. That fixes the problem (with KCF), but I left it turned off in the code because I think it is an interesting problem for developers to see.
Another option is to write your own tracker, most likely with more frequent object detection — possibly every frame. This would certainly give you more accurate bounding boxes from PowerAI Vision as well as control over how objects are tracked. It’s just more code. Your approach should be dictated by your business requirements.
Determine if they enter and exit a specific region of interest
The approach I took was assuming that I wanted to follow objects into some “region of interest” or across some goal line. In the sample video, you could decide that the average number of cars on the road is good enough to describe that video, but as discussed earlier you can apply this to many more use cases if you have the ability to track objects in motion, count them as they enter your area of interest, and calculate a rate or a latency. For the cars on the highway example, this is done with a finish line, a lane delimiter, and frames-per-second. These cars don’t stop, so counting them as they cross the line is quite easy. The only trick is to make sure you don’t add them back in a later frame. Not hard. I have consistent enough bounding boxes and midpoints with the tracker to easily count them as they get to the line (mid-point) and avoid adding them if they are close to it (low point).
Adding a cars/second seemed like a great idea as it is a much more useful metric than a count of cars over an unmentioned time period. OpenCV helps out with that. There is a simple call to get frames per second from the video, and then the math is easy.
Annotating the video
I ended up with the total number of cars detected, bounding boxes from inference and from the trackers, a car count for cars that reach the finish line for each lane, and a car/second metric. Maybe I shouldn’t mention the “cars lost?” That’s there too. Just having the numbers might meet some business requirements, but I like to see the proof — and an annotated video is just more fun. How did I do that? That is mostly OpenCV with a little IPython to show the animation in the notebook and some FFmpeg to recreate a video and an animated GIF.
The text, lines, boxes, and stats are added to each frame as I process them using OpenCV. I save the annotated frames in a directory. The OpenCV commands, docs, and examples are pretty easy to follow and hopefully, the example notebook will help with that as well. Displaying a sequence number on each tracked car helps humans see what the trackers are doing.
To play the frames in the Jupyter Notebook with animation, I use IPython display, Image, and clear_output. Some OpenCV commands help with the encoding. This frame loop plays like a video, and it is nice to see your results, but you don’t get the frame rate like a video.
To put the frames back together into an MP4 video, I use the ffmpeg command. You can run FFmpeg from your notebook (depending on your environment) or copy that directory of output frames to wherever you have FFmpeg installed. The code pattern GitHub repo includes a tools directory. There is one script to create the video from the frames and another one to create an animated GIF from the video. It’s nice to see the results of your work and be able to share it for “show and tell.”
Well, that was a great exercise but to really get something out of it, you need to try it. Go take a look at the code pattern. It includes a detailed README and the Jupyter Notebook. I hope you get a chance to run it, change it, take it, and innovate!