This tutorial gets you up and running with PyWren so that you can quickly and easily scale your parallel workloads. Your code is run in parallel, as individual functions on a serverless platform.

PyWren is an open source project that enables Python developers to massively scale the execution of Python code and then monitor and combine the results of those executions. One of PyWren’s goals is to simplify this push to cloud experience, bringing the compute power of serverless computing to everyone.

PyWren is great for various use cases: processing data in object storage, running embarrassingly parallel compute jobs like Monte Carlo simulations, or enriching data with more attributes.

In this tutorial, you use PyWren to count the occurrences of words in a set of text documents in an object store. You set up a Cloud Object Storage Instance and add the .txt documents. Next, you set up PyWren and run some python code to count the words. This use case is simple, but you can see the benefits from scaling out the count action across a set of functions that are running in parallel.

You can use these word counts to find patterns or entities across a large set of documents or to help expand the variation of vocabulary in a set of essays.

Architecture to scale out functions

Learning objectives

In this tutorial, you learn:

  • The benefits of running parallel workloads as individual functions in a serverless computing platform
  • How to set up and use PyWren with IBM Cloud Functions
  • How to provide credentials for various cloud services to PyWren

Prerequisites

To complete this tutorial, you need the following tools:

Estimated time

  • This tutorial takes you approximately 10 minutes, if you have already installed the prerequisites.

Steps

First, you create text files and an instance of IBM Cloud Object Storage to hold those files. Then, you set up PyWren and create Python code for running a word count at scale.

Create two text files

Create two .txt files that contain text with words that need to be counted for this tutorial.

  1. Create one file named sixteenwords.txt, and paste in the following text:

     These are just some words there are sixteen.
     These are just some words there are sixteen.
    
  2. Create another file named eightwords.txt, and paste in the following text:

     These are just some words there are eight.
    
  3. Save these files.

Create a Cloud Object Storage instance and required buckets

Now you run some data analysis against objects that are stored in a Cloud Object Storage bucket. Start by creating the Cloud Object Storage service and bucket.

  1. Go to the IBM Cloud Object Storage page in the IBM Cloud Catalog.

  2. Choose a Service name and click Create.

  3. Click Buckets on the left menu and type a bucket name such as words. Choose the resiliency and the location. For this example, use Regional resiliency in the us-south location.

  4. Click Create Bucket.

  5. After the bucket is created, add your two .txt files by clicking Upload in the upper right corner. You can also drag the files to the bucket.

  6. Create another bucket that PyWren uses to store results in. On the left menu, click Buckets. Provide a bucket name, such as pywrenresults. Choose the same resiliency and location options as the words bucket, and click Create Bucket.

  7. Click Endpoints on the left menu and notice the public Regional us-south endpoint. It should be something like: s3.us-south.cloud-object-storage.appdomain.cloud.

  8. Click Service Credentials on the left menu. If a service credential is not created, click Create. After you have a service credential, make note of the value for apikey.

Install PyWren locally

  1. Clone the pywren-ibm-cloud repository:

     git clone git@github.com:pywren/pywren-ibm-cloud.git
    
  2. Navigate to the pywren folder inside the pywren-ibm-cloud folder:

     cd pywren-ibm-cloud/pywren
    
  3. Obtain the most recent stable release version from the release tab:

     git checkout 1.0.3
    
  4. Build and install the project:

     python3 setup.py install --force
    

Configure PyWren to have access to your Cloud Object Storage and Cloud Functions instances

  1. Copy the pywren/ibmcf/default_config.yaml.template into a file named ~/.pywren_config:

     cp ibmcf/default_config.yaml.template ~/.pywren_config
    
  2. Edit the ~/.pywren_config file with the information you saved earlier from Cloud Object Storage:

     ibm_cos:
         # make sure to use full path.
         # for example https://s3-api.us-geo.objectstorage.softlayer.net
         endpoint   : <COS_API_ENDPOINT>
         api_key    : <COS_API_KEY>
    
  3. Edit the ~/.pywren_config file with the bucket name for storing the results from PyWren:

     pywren:
     storage_bucket: <BUCKET_NAME>
    
  4. You also need to give PyWren an endpoint, a namespace, and an API key from Cloud Functions. You can find that information at the API Key page.

     ibm_cf:
     # Obtain all values from https://console.bluemix.net/openwhisk/learn/api-key
    
     # endpoint is the value of 'host'
     # make sure to use https:// as prefix
     endpoint    : <CLOUD_FUNCTIONS_API_ENDPOINT>
     # namespace = value of CURRENT NAMESPACE
     namespace   : <CLOUD_FUNCTIONS_NAMESPACE>
     api_key     : <CLOUD_FUNCTIONS_API_KEY>
    
  5. Save the file.

Deploy PyWren to Cloud Functions

The PyWren main action is responsible for executing Python functions inside PyWren’s runtime environment within Cloud Functions.

To deploy the runtime, navigate into the runtime folder and then run the deploy_runtime script:

```
cd ../runtime
./deploy_runtime
```

This script automatically creates a Python 3.6 action named pywren_3.6, which is based on the python:3.6 docker image. This action is used to run the Cloud Functions with PyWren.

Create Python code for running a word count at scale

  1. Create a file named word_counter.py.

  2. Copy and paste the following code into the file:

     import pywren_ibm_cloud as pywren
    
     bucketname = 'words'
    
     def my_map_function(bucket, key, data_stream):
         print('I am processing the object {}/{}'.format(bucket, key))
         counter = {}
    
         data = data_stream.read()
    
         for line in data.splitlines():
             for word in line.decode('utf-8').split():
                 if word not in counter:
                     counter[word] = 1
                 else:
                     counter[word] += 1
    
         return counter
    
     def my_reduce_function(results):
         final_result = {}
         for count in results:
             for word in count:
                 if word not in final_result:
                     final_result[word] = count[word]
                 else:
                     final_result[word] += count[word]
    
         return final_result
    
     chunk_size = 4*1024**2  # 4MB
    
     pw = pywren.ibm_cf_executor()
     pw.map_reduce(my_map_function, bucketname, my_reduce_function, chunk_size)
     print(pw.get_result())
    
  3. Update the bucketname variable (around line 3) to point to your own bucket containing the .txt documents, which you created earlier.

  4. Inspect the code. You can see a map function that counts each instance of a particular word, and a reduce function that compiles those results into one. PyWren runs each of the map functions as separate cloud functions. If you are processing a large data set, this approach can greatly improve running the computation in parallel. As you can see, the following code to kick off these parallel functions is straightforward:

     pw = pywren.ibm_cf_executor()
     pw.map_reduce(my_map_function, bucketname, my_reduce_function, chunk_size)
    
  5. Run the python file that you just created:

     python3 word_counter.py
    
  6. You should see the following result with the word count for each instance of each word:

     {'These': 3, 'are': 6, 'just': 3, 'some': 3, 'words': 3, 'there': 3, 'sixteen.': 2, 'eight.': 1}
    
  7. You can check out the invocations in the Monitor page on the IBM Cloud Functions UI. As show in the following screen capture, you should see three new invocations, one for each of the .txt documents that are processed, and one for a combination of the results.

    Monitor page in the Cloud Functions dashboard

  8. You can also check out the results in the pywrenresults bucket ythat you created in the Cloud Object Storage dashboard:

    Results in the Cloud Object Storage dashboard

Summary

In this tutorial, you set up PyWren and used it to scale out functions to count the occurences of words in a set of text documents stored in a Cloud Object Storage instance. While this is a simple use case example, you can use it as a basis for your next project. We’re excited to see how you use PyWren to build on IBM Cloud Functions.