PyWren for IBM Cloud

Open Source Project

What problem we solve

Say you want to use a machine learning framework like TensorFlow to more easily process data for your AI needs. The general process involves writing Python code for the data preparation phase and testing the code on sample data. Even if this all goes smoothly, how do you run the same code at massive scale, with parallelism, on terabytes of data? Clearly, you are not in a position to start learning the intricacies of cloud IT or the best ways to set up VMs or containers. Nor do you want to become an expert in scaling up Python code for TBs of data and learn a whole set of new code developer skills.

Serverless to rescue

This is exactly where serverless computing comes to the rescue. With serverless computing, you pay the cloud provider only for the actual amount of resources consumed by an application, as opposed to pre-purchased units of capacity. In our example, deploying your Python code as a serverless action will resolve most of the issues. The serverless action will be executed in the cloud, there’s no need to setup VMs, and you will be billed based on the exact amount of code executed. But, some of the challenges still remain. For example, how can you scale the serverless action for the input of millions of data objects, such that each invocation processes a single data object? How do you process outputs and monitor concurrent executions?

The bottom line is you want to run your existing code on a large data set, get the results, and consider the value of insights gained. How can you benefit from, and integrate, serverless computing with a minimal impact on your existing code or flows?

How we do it

To address the challenge of how to easily integrate serverless computing, without major disruptions to your system or code rewrites, IBM Research developed and released to open source the PyWren-IBM-Cloud framework. Based on the open source PyWren project, this new framework offers a brand new “push to the cloud” experience for the users. It allows them to focus strictly on writing their Python code, while PyWren deploys the code as a serverless action to IBM Cloud Functions (based on Apache OpenWhisk), monitors its execution, and runs it with a large amount of parallelism. PyWren-IBM-Cloud is not, however, just a mere reimplementation of PyWren’s API atop IBM Cloud Functions. Rather, it is must be viewed as an advanced extension of PyWren to run broader MapReduce jobs and contains unique extensions like seamless integration with Python notebooks, partition discovery to process large amounts of data stored in cloud object storage, faster initialization and many other additional features.

Try it yourself

You can easily get the code to experiment with PyWren on IBM Cloud Functions. Install locally or get it from IBM Watson Studio. The project page contains working examples and details how to setup PyWren.

Why should I contribute?

The contribution process is easy enough and many people can improve their skills by contributing code to PyWren-IBM-Cloud. Contributing to the project is also a great way to gain experience and a deeper understanding the world of serverless computing.