Qamar un Nisa, Mike Ashley Cedric | Updated July 9, 2018 - Published January 8, 2018
AnalyticsArtificial intelligenceData sciencePythonRetail
In the era that we currently live in, all the focus has shifted towards data. Each day, the amount of data that is generated and consumed is increasing, adding somewhere around 5 exabytes of data. Everything we do generates data, be it turning on and off the light, or commuting from home to work. This data can be used to generate information that can be used for insights to predict and extract patterns. Data Mining or Data Science is the term that has taken the industry abuzz. It is the process of discovering patterns, insights, and associations from data. In this how-to guide we’ll learn how to use data and implement a predictive model on it to get insights. Our intended audience include developers, general users with basic knowledge of programming, and organizations that want to enhance customer experience. It will enable a user to create a predictive model on Watson Studio, which is a cloud-based environment for Data Scientists. By using this how-to user can predict and optimize their twitter interaction and would lead to optimum traffic on their tweets.
After completing this how-to, the reader will be able to:
To complete this tutorial it should take around 45 minutes.
The first thing we’ll need to do is get a bunch of tweets to analyze. In this step we’ll go through how to get a bunch of tweets, but if you’re not interested in doing that, we provide a sample data set:
If you’re using the sample data, then skip to Step 3.
Before we use tweepy to get tweets we neeed to generate OAuth Consumer and Access token keys and secrets. There are various guides that show how to do this, like this one, but the Twitter UI will change. It’s best to go to https://developer.twitter.com to follow along. In the end you’ll end up with these keys and secrets:
Consumer API Key
Consumer API Secret
Access Token Secret
These can be revoked and regenerated, but as with any other key, you should keep these secret.
Again, if you’re using the sample data, then skip to Step 3.
Now that we’ve got our Twitter API keys and secrets, we can use tweepy to save tweets into a CSV file. Free developer accounts on Twitter will limit the amount of tweets that are retrieved, but that’s enough for our purposes.
If you don’t have Python, then download and install the latest version, and then install tweepy. This can be done using pip install tweepy, if you have pip installed.
pip install tweepy
Copy the code below into a new file and save it. There are a few lines to update at the top, add values to the variables for keys, secrets, and the twitter handle you want to analyze.
# Twitter API credentials
consumer_key = ""
consumer_secret = ""
access_key = ""
access_secret = ""
screen_name = ""
# initialize tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth)
alltweets = 
# request first 200 tweets, the max allowed
new_tweets = api.user_timeline(screen_name=screen_name, count=200)
oldest = alltweets[-1].id - 1
# keep grabbing tweets until the 3200 tweet limit is hit
while len(new_tweets) > 0:
print("getting tweets before id: %s" % (oldest))
new_tweets = api.user_timeline(screen_name=screen_name,
oldest = alltweets[-1].id - 1
print("...%s tweets downloaded so far" % (len(alltweets)))
# transform the tweepy tweets into an array
outtweets = [[tweet.id_str, tweet.created_at,
tweet.favorite_count] for tweet in tweets]
# write the csv
with open('%s_tweets.csv' % screen_name, 'w') as f:
writer = csv.writer(f)
writer.writerow(["id", "created_at", "text", "Retweets", "Favorites"])
if __name__ == '__main__':
# pass in the username of the account you want to download
tweets = get_all_tweets()
Run the script by running python tweets.py in a terminal, a CSV file will be output, containing various tweets and information about those tweets, for example:
You can remove the id and created_at columns, and remove empty rows to clean the data a bit.
IBM Watson Studio is an easy-to-use, collaborative and cloud based environment for data scientists where they can use tools like Scala, R, Jupyter Notebookc etc.
Log into https://dataplatform.cloud.ibm.com/ and choose to create a New Project, the Complete option will work for this tutorial.
At the new project wizard, enter a Name and Description, You will also be required to create a new Object Storage service or choose an existing service during project creation. Once created, you’ll be able to see a project overview, for example:
Once created, we can add an asset, by clicking Add to project and in this case, we’ll click Model, to add a new model.
Add to project
Give your model a Name and Description. We will also set the Model type option to Model builder and choose the Manual for this exercise.
Before proceeding we need to associate two services. An Apache Spark service, and a Machine Learning service. You can use the UI to create a new one or select an existing one. For an example of how to do that with Apache Spark, refer to this IBM Code Tutorial. To do that with Machine Learning is the same exercise.
We’re now going to add the CSV file to the model. Click Add Data Assets, browse to either the generated CSV file or the saved sample CSV file. The data should appear in the dashboard, for example:
Add Data Assets
Click on the Next button to continue. Loading the data may take a few minutes.
For this example we’re trying to predict the best time to send a tweet, so let’s set the Column value to predict to be hour. Leave the Feature columns unchanged and set to All. The important choice here is the technique used, we’ll be using the Regression technique. We’ll also be leving the Validation Split unchanged.
Column value to predict
It should be noted that because the classifier is set to hour, which has around 20 values, Watson Studio will suggested Multiclass classification. But in this case the best technique according to our data is Regression.
We also need to add estimators. To do that, click on Add Estimators and select all avilable choices, then click Add.
Once we have our technique and estimators selected we can click Next. This will start training and testing data. This step will take a few minutes to fully complete.
The results show just how accurate each estimator is, with the most optimal estimator at the top. Here it is Isotonic Regression, click on the first one and select the Save option, for example:
Once saved, you will be redirected to an overview of the model, for example:
From here, we can create a web deployment so our model is accessible over a REST call.
In this tutorial we learned to extract user data from twitter and then perform data science predictive model on it to optimize future tweeting and increasing the users audience. This tutorial of building a model on Watson Studio can be applied on any other CSV file as well and can be further deployed on a web application. We also learned how to deploy the model as a web application to allow REST calls.
Get the Code »
Back to top