Distributed REST Calls to Watson Services using REST Data Source on Apache Spark

 

Unstructured data continues to grow at an exponential rate and businesses, across industries, are actively exploring or leveraging Artificial Intelligence (AI) technologies to extract insights from the vast amounts of data they have access to. IBM Watson offers a variety of ready-to-consume and customizable AI services that are designed to extract insights from unstructured data, be it in the form of audio (speech), images, or text.

Most often, we find that businesses capture the best insights by integrating Watson AI services in an analytics environment optimized for the complete life cycle of data. IBM Watson Data Platform offers such an optimized analytics environment with Data Catalog, Data Refinery, Data Science Experience, and Watson Machine Learning that enables data professionals to handle big data through all the stages of its life cycle from collection to curation, analysis, enrichment, insights, and deployment of trained machine learning (ML) models.

The Watson AI services provide developers with the tools required to analyze unstructured data (speech, text, images, video, etc) and extract metadata (data about data) which empowers businesses to extract insights from the vast amounts of data they own. Businesses can leverage such insights to improve various business objectives such as improving customer service, understanding the sentiment or tone of their customers’ interaction, and/or personalizing the customer experience. The following blogs/tutorials discuss two common use cases that leverage Watson AI services within Data Science Experience:

  • Discover hidden Facebook usage insights: This pattern combines the power of a Jupyter Notebook, PixieDust, and IBM Watson cognitive services to glean useful marketing insight from a vast body of unstructured Facebook data. To help improve brand perception, product performance, customer satisfaction, and audience engagement, the notebook shows how to take data from a Facebook Analytics export, enrich it with Watson Visual Recognition, Natural Language Understanding, and Tone Analyzer, and create interactive charts to outline the findings.
  • Extract insights from social media posts with Watson and Spark in Data Science Experience: This blog steps through the complete journey of acquiring data, curating and cleansing the data, analyzing and visualizing the data, and enriching the data using Watson AI services to drive value. It illustrates how business can drive value by understanding consumers’ sentiment towards their brand and delivering better personalized messaging to consumers based on their personalities and their social media presence.

In this blog, we focus on one fundamental topic which is the performance and scalability of solutions leveraging Watson AI services. To address that, we examine using Watson AI services to enrich unstructured text on sample tweets with three different approaches. We also discuss the trade-offs between each approach.

For illustration purposes, we start with a data set of tweets collected that mention popular singers and then we enrich these tweets using Watson Natural Language Understanding (NLU) to extract sentiment and the top referenced keywords. Given one tweet of text, URL, or HTML, Watson NLU can be used to extract several features from text including sentiment, emotion, keywords, entities, concepts, relations, semantic roles, categories, and metadata.

In practice, businesses can follow the same outlined approach to extract insights from various types of data, be it emotional tone of calls in their call center recordings or the sentiment referenced in tweets and other social media posts that mention the brands. If you’d like to try the notebooks below and would like to collect sample tweets related to any topic of interest, consult this sample notebook.

To better compare and contrast the performance, we published three notebooks to this github repository for calling NLU to enrich the tweets. We compare three variations:

The rest of the blog and notebooks assume the tweets are available in a DB2 Warehouse on the Cloud instance which is a fully-managed, enterprise-class, cloud data warehouse service available on IBM Cloud.

Local NLU Notebook

The first approach we examine is running a Jupyter notebook locally on your machine. To do so, you need to install an IBM DB driver on your machine to be able to connect to the DB2 Warehouse on the Cloud instance. Following the instructions in this blog, we download Python ibm_db driver. The actual nlu-local.ipynb notebook includes the cells to install ibm_db and set the DYLD_LIBRARY_PATH environment variable. However, if you prefer, you can run these externally on your command line terminal as follows:

  • On your command line terminal, install ibm_db using pip:
pip install ibm_db

==>  This installs the IBM DB clidriver under your Python site-packages directory.

  • On a Mac, set DYLD_LIBRARY_PATH environment variable to point to /lib and /lib/icc folders of clidriver.
export DYLD_LIBRARY_PATH=<path-to-python-site-packages>/site-packages/clidriver/lib: <path-to-python-site-packages/site-packages/clidriver/lib/icc:$DYLD_LIBRARY_PATH

 

Next, you can start Jupyter notebook and execute Python code in the different cells to:

  • Connect to our DB2 Warehouse on the Cloud instance and load the data into a Pandas dataframe.
// Create a DB2 connection to connect to your DB2 Warehouse on the Cloud instance.

conn = ibm_db.connect("DATABASE=$dbname;HOSTNAME=$hostname;PORT=$port;PROTOCOL=$protocol;UID=$userid;PWD=$password;","","")

pconn = ibm_db_dbi.Connection(conn)

df = pandas.read_sql('SELECT * FROM $table ', pconn)

// My dataset consists of a table called DASH6296.DSX_CLOUDANT_SINGERS_TWEETS

// You can find the required credentials under the Service Credentials tab of your DB2 Warehouse instance. You can simply copy the dsn field and execute the code conn=ibm_db.connect(dsn,””,””)
  • Next, we enrich the tweets using Watson NLU by importing the required libraries for the Watson Python SDK and calling NLU with sentiment and keywords.
// To get NLU credentials, you need to create a Natural Language Understanding service on IBM Cloud by following the instructions on the NLU documentation page.

credentials_json= {

    "nlu_url":"YOUR_NLU_URL",

    "nlu_username": "YOUR_NLU_USERNAME",

    "nlu_password": "YOUR_NLU_PASSWORD",

    "nlu_version": "2017-02-27"

}
  • Take a sample of the data. As the NLU Lite plan (free tier) has limits on the number of calls allowed per day, we need to take a sample of N records from the data set and query NLU for that set (the example notebook takes a sample of N=500 records).
  • Run NLU enrichment on all the records in the sample.
  • Compute and record the total time it takes for running NLU enrichment.

The nlu-local.ipynb notebook includes all the outline step; it mainly needs the DB2 Warehouse and NLU credentials. Running this on a local Macbook pro took 211.86 seconds to complete.

Data Science Experience NLU Notebook

The second approach we present involves leveraging IBM Cloud Data Science Experience (DSX) and Apache Spark engine to execute a Jupyter notebook for enriching tweets with sentiment and keywords returned by Watson NLU service.

Create Data Science Experience on IBM Cloud

First, we need to create an instance of Data Science Experience on the IBM Cloud by executing the steps referenced in the DSX documentation.

Setup Data Science Experience

Execute the following steps to launch DSX, setting up a new project, creating data connectors to your DB2 Warehouse on the cloud instance, and executing the notebook steps to enrich the tweets with Watson NLU keywords and sentiment.

  • Launch DSX.
  • Create a new project; call is nlu-dsx.
  • Create a data connection to the DB2 Warehouse on the Cloud.
  • Read data from DB2 Warehouse into a Spark dataframe.
  • Specify the Watson NLU credentials.
  • Execute some steps for data curation.
  • Map the Spark dataframe to a Pandas Dataframe.
  • Execute the code for Watson NLU enrichment.

Note that you can actually import the notebook from our github repository directly into your DSX project and execute the different cells in the notebook. It would require the DB2 Warehouse and Watson NLU credentials.

Lastly, record the execution time required for running Watson NLU on the sample of N records from the original tweets. Our run reports a total time of 72.75 seconds.

This approach offers multiple advantages over the first approach:

  • Run on Apache Spark engine as opposed to your local machine.
  • Leverage built-in connectors for DB2 Warehouse on the Cloud. DSX also offers connectors for most of the common data sources including 3rd party sources.
  • Add collaborators to your project seamlessly. In this simple example, there was no need to add collaborators but consider the scenario where one data professional (data engineer) owned the task of coding how to curate and enrich the data and another data professional (data scientist) owned the task of analyzing that enriched data and creating predictive ML models based on that. In DSX, the data engineer would run the notebook we referenced and then simply add the data scientist to pick that up and continue with the task of data analysis and insights.

Data Science Experience NLU Notebook with REST Data Source Extension

While the second approach showed improvement in performance, that solution, thus far, does not really leverage the power of distributed computing that is offered with an Apache Spark engine. This is a well-known challenge in leveraging Spark with REST APIs as highlighted in this blog which provides good insights and details on the limitations of the second approach as well as presents a Data Science extension for calling REST based APIs/services. Our third approach illustrates how to leverage the REST Data Source for Apache Spark extension for calling Watson NLU service to enrich the text.

We’ll illustrate the third approach using the nlu-dsx-spark-REST.ipynb notebook running on DSX. But first, we need to upload the REST Data Source extension code to your Apache Spark engine by following the steps outlined on the REST Data Source for Apache Spark github repository. Specifically, execute the steps under the “Using Rest Data Source in IBM Data Science Experience (DSX)” section:

  • At first, get your free account in Data Science Experience (DSX).

Use this link to get your free account for Data Science Experience. It by default comes with a Spark Cluster where you can try out this Data Source. This also automatically creates an IBM Cloud account for you.

  • Next, create your first project in Data Science Experience (DSX).

Go to the Get Started link and create a new Project. While creating a project it will also create a new Spark Service. Note the name of the Spark Service.

  • Get credentials of the Spark As A Service.

Login to IBM Cloud. Check the Dashboard. You should see the name of your Spark Service (the Service Offering would be Apache Spark).

Click your Spark Service. This will open the window which has a link called Service Credentials in the left pane. Click that and it will take you to the Credential Window. There you shall see the available Service Credentials. Click ‘View Credential’. This will show you the credentials in a json format. Copy the json string for using in next step.

  • Upload the jar file to Data Science Experience (DSX).

At first, download the jar file for this library from the release link of the parent repository Data-Science-Experience. Follow the steps those have been outlined in the section ‘Accessing the binary/jar file for this library already available in the release’ of this document.

Alternatively, you can create the library by following the steps mentioned in the section ‘Building the jar file’ of this document.

Next, you can upload the jar file to Data Science Experience by following the guidance in this link

The typical command for the upload would look like as below. The value of tenant_id, tenant_secret, instance_id and cluster_master_url you can get from the credential json you got in last step. The spark-datasource-rest_2.11-2.1.0-SNAPSHOT.jar is the file got created when you run the build instruction in the target folder. You need to run the command below from the target folder. This would upload the jar file to your Spark Instance:

curl \    -X PUT \    -k \    -u ${tenant_id}:${tenant_secret} \    -H "X-Spark-service-instance-id: ${instance_id}" \    --data-binary "@./spark-datasource-rest_2.11-2.1.0-SNAPSHOT.jar" \    ${cluster_master_url}/tenant/data/libs/spark-datasource-rest_2.11-2.1.0-SNAPSHOT.jar" \

Next, import the nlu-dsx-spark-REST.ipynb notebook into your DSX project and execute the steps in the notebook which consist of:

  • Load data from DB2 Warehouse into Spark data frame.
  • Explore and curate the data to remove unneeded columns and make sure the text doesn’t include any unwanted characters.
  • Take a sample of N records from the data.
  • Setup the parameters for REST Data Source. At a minimum, you need to specify the REST endpoint url, the HTTP method, and the credentials (username, password) for the REST API (in this case for Watson NLU). The REST Data Source for Apache Spark github repository includes several details on the different parameters to control the functionality of this extension (under Features section).
  • Run Watson NLU enrichment on the sample data using REST Data Source.
  • Collect and record the time required to extract sentiment and keywords from the N sample tweets.

In our run, we recorded 42.45 seconds for enriching ~500 records with Watson NLU using the third approach. In general, the execution time should illustrate approximately a 2x improvement in total time which is expected given that the Apache Spark engine we’re using includes 2 executors. As we apply more executors, we expect even more improvements.

Conclusion

As observed, leveraging the Spark engine in Data Science Experience with the REST Data Source extension provides a scalable analytics solution that leverages the Spark executors in calling Watson NLU to enrich a large number of unstructured text entries. It is worth nothing that the approaches we presented work equally well to all the Watson AI services (as well as other REST services).

Lastly, we note that several APIs have concurrency and rate limits where they limit how many requests they can serve simultaneously. The third approach presents a scalable solution leveraging Spark distributed engine and for that approach, the number of enrichments in a given period of time would be limited by the concurrency and rate limits of the API itself.

Learn more about our Code Patterns used for data analytics

Join The Discussion

Your email address will not be published. Required fields are marked *