Time is running out! Call for Code submissions due July 29 ›
Joseph Kozhaya, Rafi Kurlansik | Published August 9, 2017
Artificial intelligenceData scienceKnowledge discoveryPlatform as a ServicePython
Data Science Experience (DSX) is now Watson Studio. Although the name has changed and some images may show the previous name, the steps and processes in this tutorial will still work.
According to statistics from Excelacom, 20.8 million WhatsApp messages, 2.4 million Google searches, and 347,222 tweets occur in every internet minute (as of April 2016). Much of the massive amounts of data that is generated and consumed is unstructured data in the form of text, speech, images, and video. To manage and use this data, there is a need for a new computing paradigm, a cognitive computing paradigm that helps extract insights from big data.
Cognitive computing systems are defined as systems that learn and interact naturally with humans to help them analyze, understand, and extract insights from big data. The IBM Watson Developer Cloud is a platform that offers a wide variety of cognitive services that are designed to extract knowledge from unstructured data in all possible formats: text, speech, and images. Several cognitive solutions have used the Watson Developer Cloud services to address various business problems such as:
In several cognitive solutions, we find the most impactful results are achieved by combining Watson Developer Cloud services with analytics solutions that are optimized for big data. In this tutorial, we explain how to develop a cognitive solution that combines Watson Developer Cloud services with custom machine learning solutions using IBM Watson Studio. This tutorial references the Python TwitterInsightsWatsonDSX notebook, which would be uploaded and run in Watson Studio.
To be able to complete the tutorial, you will need:
You will need the following Twitter credentials to run through this tutorial.
If you already have these Twitter credentials, you can skip this section. If not, here are quick instructions to get these credentials. To begin, you must have a Twitter account. If you don’t have an account, you can sign up for one at https://twitter.com/signup. After you have a Twitter account, run the following steps:
At this point, you should have the required credentials to connect to Twitter’s APIs.
To work with this tutorial, you must have an IBM Cloud account so that you can provision Watson and database cloud services. To create an IBM Cloud account:
Now that you have an IBM Cloud account, you can run the following steps from your terminal to create the required services for this tutorial, namely Natural Language Understanding and Personality Insights.
cf create‑service natural‑language‑understanding free dsxnlu
cf create‑service‑key dsxnlu svcKey
cf service‑key dsxnlu svcKey
cf create‑service personality_insights lite dsxpi
cf create‑service‑key dsxpi svcKey
cf service‑key dsxpi svcKey
To create a Watson Studio service:
Alternatively, you can launch Watson Studio by pointing your browser to https://www.ibm.com/cloud/watson-studio, selecting Sign In (top right), and logging in with your IBM Cloud user name and password.
This tutorial looks at the problems with brand analytics, user segmentation, and personalized messaging. My solution involves collecting social media posts that reference a brand, understanding the sentiment toward the brand, and segmenting the consumers based on multiple parameters such as number of followers, number of posts, sentiment, and personality profile. Given the more granular segmentation, the brand manager and marketing teams can then provide targeted messaging and marketing to reach consumers in a more personal way.
Because we want to keep this tutorial more generic, we won’t reference any specific brand. Instead, I’ll collect data on three popular musicians and run my analysis on that data.
Before we dive into the details of the solution, I’ll describe the various tools and services that I’ll use. Specifically, we rely on Twitter, the Watson Developer Cloud, Db2 Warehouse on Cloud, and Watson Studio. In this section, I’ll cover each of these tools and services individually.
Social media are computer mediated technologies that make it easier to create and share information, ideas, and thoughts by using virtual communities and networks. Some of the popular social media platforms include Facebook, Twitter, LinkedIn, Pinterest, and Snapchat.
It’s become a common practice for brands to connect with their consumers and better understand how these consumers perceive the brand by listening to what’s being said on social media. Social media listening refers to collecting social media posts from various platforms and analyzing them to understand overall consumer perception.
In this tutorial, we collect Twitter data on three popular musicians (@katyperry, @justinbieber, and @taylorswift13). It is worth noting that my approach can work with other social media platforms or other data sources where consumers share their opinions on brands, events, or entities of interest.
The IBM Watson Developer Cloud is a platform of cognitive services that enable you to build cognitive solutions to extract insights from big data. Watson Developer Cloud services offer a wide range of capabilities to understand and extract insights from unstructured data, including text, speech, and images. In this tutorial, we use Natural Language Understanding for sentiment analysis and keyword extraction. We also use Personality Insights to extract sentiment and keywords that are expressed in tweets and the personality profile of the users sharing the tweets.
IBM Db2 Warehouse on Cloud is a database that is designed for performance and scale and is compatible with a wide range of tools. The massively parallel processing (MPP) options enable increased performance and scale by adding more servers to your cluster. The dynamic in-memory columnar store technology minimizes I/O and delivers an order of magnitude speed when compared to row-store databases.
IBM Watson Studio is a cloud-based social workspace that helps you create, consolidate, and collaborate on building solutions for capturing insights from data across multiple open source tools such as R, Python, and Scala. IBM Watson Studio helps data explorers use a rich set of open source capabilities to analyze large data sets and collaborate with colleagues in a social collaborative data-driven environment.
Your Watson Studio account includes an Apache Spark service (provisioned on IBM Cloud) by default. Apache Spark is a fast open source cluster computing engine for efficient large-scale data processing. Apache Spark technology enables programs to run up to 100 times faster than Hadoop MapReduce in-memory or 10 times faster on disk. Spark consists of multiple components:
In Watson Studio, you can use Spark for your Python, Scala, or R notebooks.
Watson Studio includes a rich set of community-contributed resources such as data science articles, sample notebooks, public data sets, and various tutorials that make it easy to use Watson Studio and Apache Spark.
Additionally, your Watson Studio account includes an Object Storage service that is provisioned on IBM Cloud under a free plan that includes one service instance with a limit of 5 GB of storage. (The Object Storage plan can be upgraded without disruption.) The Object Storage service provides an unstructured cloud data store where you can store your files, including images, documents, and more.
To summarize, Watson Studio provides a social collaborative environment where you can upload large data sets into an Object Storage service and use the fast Apache Spark computing engine to efficiently explore, analyze, visualize, and extract insights from large structured and unstructured data sets. It also offers an easy and seamless connection to GitHub where you can upload and share your notebooks. Watson Studio’s community feature makes it easy to share and explore various notebooks, data sets, and tutorials that are built by all Watson Studio community members.
In this tutorial, we focus on using Watson Studio to build Python notebooks to analyze Twitter data and integrate that data with Watson Developer Cloud services.
The following two notebooks will help you quickly get started with Python and Apache Spark:
For reference, a Jupyter notebook is a web-based environment for interactive computing. You can run code and view results of your computation interactively. Notebooks include all building blocks needed to work with data, including the data, the code to process the data, visualization of results, and text and rich media to document your solution and enhance your understanding.
The following image shows the solution architecture where tweets are collected from Twitter and saved into a Cloudant database. The Cloudant database is saved into a Db2 Warehouse on Cloud warehouse , which is then imported into Object Storage. The notebook in Watson Studio ingests data from Object Storage and uses Spark for data curation, analysis, and visualization. Furthermore, the notebook connects to Watson services (Natural Language Understanding and Personality Insights) to enrich the tweets and extract sentiment, keywords, and user personality traits. Finally, the notebook uses Spark MLlib to cluster the users based on several features including personality traits.
Given these user clusters, the application can identify the right message to send to users.
The first step is always to acquire relevant data, understand it, and process it into the right format. As mentioned earlier, for this tutorial we collect social media data, specifically Twitter data that references three musicians: @katyperry, @justinbieber, and @taylorswift13. Next, we explore the data to get a better understanding of what it represents. We look at the schema and visualize the data to understand it better. After that, we run some preprocessing to get the data in an adequate format for further processing.
There are various third-party services for acquiring Twitter data such as Twitter GNIP.
In this tutorial, we use Twitter Streaming APIs to collect tweets mentioning “@katyperry,” “@justinbieber,” or “@taylorswift13” and process them to capture metadata of interest before saving them to a Cloudant database as described in the https://github.com/joe4k/twitterstreams notebook.
After we are in a Cloudant database, we follow the instructions that are referenced in the previous video to create a Db2 Warehouse on Cloud warehouse for that Cloudant database. To proceed with this tutorial, you must have a Db2 Warehouse on Cloud service instance that is populated with the tweets you collected mentioning “@katyperry,” “@justinbieber,” or “@taylorswift13.”
Make sure to note the name of the Db2 Warehouse on Cloud service instance you’re using as a warehouse for your Cloudant database to host all the tweets.
Assuming that you have collected tweets in a Db2 Warehouse on Cloud database, you can proceed by running the following steps:
So far, you’ve created a new project in Watson Studio and created a connection to a Db2 Warehouse on Cloud service instance that includes approximately 200,000 tweets that mention “@katyperry,” “@justinbieber,” or “@taylorswift13” between 05-12 July, 2017. In this tutorial, I’m limiting the number of tweets. In practice, you can collect millions of tweets and run the analysis on those.
Next, I’ll run some analytics to evaluate and explore the tweets data.
Now that we have collected relevant tweets and ingested them into a Spark DataFrame, I’ll focus on enriching the data by using Watson Developer Cloud services.
In particular, we extract sentiment and keywords in tweets by using the Watson Natural Language Understanding service. We also use Watson Personality Insights to extract personality profiles for the users sharing these tweets.
nlu = watson_developer_cloud.NaturalLanguageUnderstandingV1(version=nlu_version,
To get Natural Language Understanding credentials, you must provision a Watson Natural Language Understanding service on IBM Cloud as explained previously. For reference, you can find detailed instructions on the Natural Language Understanding Getting started page.
After extracting sentiment and keywords from the unstructured data (tweets), we can use these enrichments to visualize tweet trends, sentiment, and keywords. We can do this for each brand separately to provide insights to the brand manager and marketing team on consumers’ perceptions toward the brand. We can also compare and contrast the results across brands.
Run Step 7 of the notebook to plot sentiment and trends of the tweets over time.
We separate tweets by brand so that we can plot sentiment and trends for each brand separately because it would be useful to compare and contrast trends, sentiments, and keywords for the three musicians.
Here are some of the visualizations that we can produce with the data we have after enriching with Watson services.
The following figure shows the overall sentiment distribution (positive, negative, neutral) of the tweets toward the three musicians.
The timeline plot in the following figure shows the trend (number of tweets) for all three brands (musicians). It also shows the positive, negative, and total number of tweets for each brand.
The keywords word cloud plot in the following figure shows the most relevant keywords that are mentioned in the tweets for the brands.
Next, we focus on the users sharing these tweets. Traditional segmentation methods might focus on creating clusters of users based on the number of tweets that they post or the number of followers that they have. In this notebook, we show how you can enrich the users’ information with their personality profile, which in turn allows you to create finer segmentation that accounts for users’ personality profiles.
To do so, we first identify all unique users who are contributing posts to the list of tweets we collected. We use Watson Personality Insights to create personality profiles for the users based on their tweets. This tutorial explains how to extract unique users based on the USER_SCREEN_NAME. Then, for each user, it shows how to use Twitter to collect enough tweets for that user, which are then passed to Personality Insights to obtain the personality profile. We limit the analysis to 100 users simply to illustrate the approach. In practice, you want to create personality profiles for all users (or maybe all users with a certain number of followers or posts). Furthermore, for each user, you want to collect a large enough sample of tweets for accurate Personality Insights results as explained in the Personality Insights documentation. In this tutorial, we limit it to 100 tweets per user.
Run Step 8 in the notebook to extract useful information about users such as the number of unique users in the given data set of tweets and which users expressed negative sentiment versus positive sentiment. Some useful commands include:
df.sample(False, fraction, seed)
Step 8 in the notebook also extracts the Big 5 personality traits (also referred to as OCEAN) for each unique user in the sample of users you’re working with. These personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism) are helpful in better understanding the users and reaching out to them.
Now that I’ve collected and enriched relevant social media posts regarding the brand, I can use the rich set of machine learning algorithms available with Spark MLlib for user segmentation.
In particular, we use a Kmeans clustering algorithm to group users based on their personality profile, number of followers, and number of posts. To illustrate the difference, we actually run Kmeans clustering using two different feature sets, one without personality traits and one with personality traits:
Run Step 9 in the notebook to get the data into the correct format and then run the Kmeans algorithm for clustering the users. Some of the useful commands include:
Transform a column into a vector:
assembler_field = VectorAssembler( inputCols=["FIELD_NAME"], outputCol="vector_field_name")
assembled_field = assembler_field.transform(df)
assembled_field = assembled_field.select("FIELD_NAME_1","FIELD_NAME","vector_field_name")
Scale a field using MinMaxScaler:
scaler_field = MinMaxScaler(inputCol="vector_field_name", outputCol="scaled_field_name")
scalerModel_field = scaler_field.fit(assembled_field)
scaledData_field = scalerModel_field.transform(assembled_field)
Select specific features for clustering and map to a Vector:
df _noPI = df_scaled.select('SENTIMENT','SCALED_USER_FOLLOWERS_COUNT','SCALED_USER_STATUSES_COUNT')
df _wPI = df_scaled.select('SENTIMENT','SCALED_USER_FOLLOWERS_COUNT', 'SCALED_USER_STATUSES_COUNT', \
'OPENNESS', 'CONSCIENTIOUSNESS','EXTRAVERSION', 'AGREEABLENESS','NEUROTICISM')
from pyspark.mllib.linalg import Vectors
df_noPI = df_noPI.rdd.map(lambda x: Vectors.dense([c for c in x]))
df_wPI = df_wPI.rdd.map(lambda x: Vectors.dense([c for c in x]))
Kmeans Clustering (base and PI_ENRICHED):
From pyspark.ml.clustering import KMeans.
baseKMeans = KMeans(featuresCol = "BASE_FEATURES", predictionCol = "BASE_PREDICTIONS").setK(5).setSeed(206)
piKMeans = KMeans(featuresCol = "PI_ENRICHED_FEATURES", predictionCol = "PI_PREDICTIONS").setK(5).setSeed(206)
baseClustersFit = baseKMeans.fit(userPersonalityDF.select("BASE_FEATURES"))
enrichedClustersFit = piKMeans.fit(userPersonalityDF.select("PI_ENRICHED_FEATURES"))
userPersonalityDF = baseClustersFit.transform(userPersonalityDF)
userPersonalityDF = enrichedClustersFit.transform(userPersonalityDF)
After creating user clusters based on both structured metadata (such as the number of followers and the number of posts) and enriched metadata that is extracted from unstructured data (such as the sentiment of the tweet and the personality traits of the users), we can run some visualizations to understand the differences between the segmentation solutions.
At a simplistic level, to illustrate that the results are different, we can plot a pie chart that shows the number of users in each cluster for both scenarios, without and with personality traits.
Run Step 10 in the notebook to show some visualizations of the Kmeans clustering solution for both scenarios, without and with personality traits extracted with Personality Insights.
The pie chart in the following figure shows the number of users in each cluster with and without personality traits. This is a very simplistic visualization to show that the clustering solutions are different when including personality traits.
Typically, you would visualize clusters by plotting some aggregate measure of the data, then completing the data points with different colors based on the cluster ID. However, in the absence of aggregate metrics, we can use Principal Components Analysis to compress the data set down to two dimensions. After I’ve performed PCA, we can then plot the values of the two components on the X and Y axis to form a scatterplot. The following figures show the clustering results with base features only (figure 1) and with both base features and personality traits (figure 2). Note that clustering results for your run might be different.
Given these user clusters, the brand manager and marketing teams can craft personalized messages to reach out to these users. They can track these user clusters over time to see how the users respond to various metrics such as purchase history, click patterns, or the response to different ad campaigns.
In this tutorial, we explained how you can go through the complete journey of acquiring data, curating and cleansing the data, analyzing and visualizing the data, and enriching the data to drive value. In my example scenario, the value was in delivering better personalized messaging to consumers by understanding their personalities and their social media presence. Although we used small data samples in this analysis, the referenced technology (Watson Studio, Spark, Object Storage) scales to handle big data efficiently.
Back to top