By Gil Vernik Published January 31, 2019
Ants are incredible creatures. They have been fascinating biologists for years and inspiring computer scientists to invent efficient algorithms that simulate their behavior. An ant colony moves fast. It’s coordinated, completely efficient, and ants are usually well-optimized to work toward the same overall goal. Examining ants closely, I can’t help but be amazed at how well the colony organizes their parallel efforts to achieve a single mission. But, this blog isn’t about ants. I’m going to focus on how to coordinate parallel serverless computing to achieve the same goal for complex tasks. Using serverless computing to separate a single mission into parallel tasks has enormous benefits when it comes to the data preprocessing and the transformations needed for AI and machine learning compute engines. This blog explains how to set up image preprocessing that consumes less than 35 seconds by using IBM Cloud Functions, as compared to 2,200 seconds using a traditional approach on a single computer.
Digital cameras, mobile phones, applications, and IoT sensors all generate boatloads of data. The original data generated by these devices is what we usually refer to as raw unprocessed data. All of this data that we collect is fine, but when you want to get analytics, the raw data needs to be preprocessed before the different analytics systems can use it. One example is face recognition, as discussed below. Here, the raw data might contain “noise,” such as background images that must be cleaned, or it might need to be preprocessed before being consumed by AI frameworks. In the following example, the image on the left is raw a 2-MB image that was captured by a digital camera, while the image on the right is a processed image, without a background and with an aligned face. The processed image consumes only 12 KB of storage.
Transforming raw data into something we can further analyze is what we call the “data preparation phase.” Clearly, this process depends on what we want to process. It might be raw video data that needs to be preprocessed to generate specific frames, log records that need to be anonymized, and so forth.
As an example, let’s look more closely at the need for face alignment in facial recognition. The process of aligning an image is pretty simple and can be done using the Dlib library and its face landmark predictor. Only a few lines of Python code are required to apply the face landmark predictor and preprocess a single image. But what about activating the same function to process the millions of raw images stored in IBM Cloud Object Storage, an ultimate storage solution both for structured and unstructured data? Extending our code to process millions of images involves learning how to find the images inside the object storage, alongside reading and writing the images. We also need to address the challenge of how and where to scale the data processing for the entire set of images. Obviously, the processing can’t be practically done sequentially over a standard computer. It must be done in parallel by many computers.
The scenario described above is a perfect fit for IBM Cloud Functions, which is the IBM Function as a Service platform. Using IBM Cloud Functions, we can get the resources we need for the processing and pay only for the actual resources being used. However, we still need to understand how to effectively scale the serverless processing against a dataset of images in IBM Cloud Object Storage, how to monitor all the executions, and execute all of the tasks as a single “logical” job in IBM Cloud Functions.
To tackle this challenge and easily integrate between IBM Cloud Functions and image preprocessing, we turned to the IBM-PyWren framework. This framework is designed to scale Python applications inside IBM Cloud Functions. The fascinating part is that using IBM-PyWren requires minimal code modifications to benefit from the IBM Cloud Functions and run massively parallel workloads. You can read about IBM-PyWren in the following posts:
With just five lines of new code, we can integrate the IBM-PyWren framework into the existing image preprocessing code. The five lines of code activate the IBM-PyWren framework, which internally scales the image preprocessing code for the entire set of images stored in IBM Cloud Object Storage. This is completely transparent to the user, making the overall process an ideal user experience.
We decided to run an experiment where we processed 1,054 images, all stored in IBM Cloud Object Storage. Using IBM-PyWren, we completed the entire processing in about 35 seconds, compared to 2,200 seconds running over a local computer with an Intel Core i7 having four cores. But it’s not just the faster running times that made such a big impression on us. Without the IBM-PyWren framework, we needed close to 100 lines of additional “boilerplate” code to find and loop over the images, read and write the images, and so forth. Because our past blogs explained the benefit of IBM-PyWren, we decided to focus here on the internal architecture to understand the real magic that happens inside the framework.
The following image demonstrates the internal flow activated by the IBM-PyWren framework. A user’s align() function takes a single image and applies the Dlib alignment. The IBM-PyWren client first serializes the align() function and stores it in IBM Cloud Object Storage. Then, using the massive spawning method, which is unique to the IBM-PyWren framework, the IBM-PyWren client invokes a server runtime deployed inside IBM Cloud Functions. Each such invocation is responsible for spanning a large number of further invocations, each of which is assigned to process a single image from IBM Cloud Object Storage. This architecture allows us to efficiently process a vast number of images in parallel, while benefiting from fast invocations within IBM Cloud Functions.
You can read more about IBM-PyWren in this recent paper published at Middleware 2018.
To demonstrate how all this works, we created a Watson™ Studio example notebook that contains complete end-to-end instructions. This includes how to create images, process the images with IBM-PyWren over IBM Cloud Functions, and use the Watson Machine Learning service to run actual machine learning analytics on the preprocessed data. You can follow the steps and explanations here.
Data preprocessing is challenging and a perfect case to use with IBM Cloud Functions while using IBM Cloud Object Storage as a storage solution. We plan to explore additional data preprocessing use cases and post more blogs and papers. Stay tuned!
February 7, 2019
March 4, 2019
Peek behind the curtain of any new innovation, and you’ll likely find a foundation built on open source contributions.
Back to top