Build a predictive model on Watson Studio using CSV data set from Tweets
Create a predictive model on IBM's Watson Studio in an easy to follow guide that will use Twitter account data to predict the optimal time to post tweets.
In the era that we currently live in, all the focus has shifted towards data. Each day, the amount of data that is generated and consumed is increasing, adding somewhere around 5 exabytes of data. Everything we do generates data, be it turning on and off the light, or commuting from home to work. This data can be used to generate information that can be used for insights to predict and extract patterns. Data Mining or Data Science is the term that has taken the industry abuzz. It is the process of discovering patterns, insights, and associations from data. In this how-to guide we’ll learn how to use data and implement a predictive model on it to get insights. Our intended audience include developers, general users with basic knowledge of programming, and organizations that want to enhance customer experience. It will enable a user to create a predictive model on Watson Studio, which is a cloud-based environment for Data Scientists. By using this how-to user can predict and optimize their twitter interaction and would lead to optimum traffic on their tweets.
After completing this how-to, the reader will be able to:
- Learn Watson Studio to build a predictive model using any CSV data.
- Extract user information from Twitter.
- Leverage Twitter to predict and optimize their twitter interactions.
- IBM Cloud account – sign up if you don’t have an account yet.
- A Twitter account
- A Twitter Developer account
To complete this tutorial it should take around 45 minutes.
Use sample data or get your own?
The first thing we’ll need to do is get a bunch of tweets to analyze. In this step we’ll go through how to get a bunch of tweets, but if you’re not interested in doing that, we provide a sample data set:
- ufone_tweets.csv: Tweets from a Ufone, a phone operator, cleaned up and ready for Watson Studio. (Use this one!)
- ufone_tweets_raw.csv: Same as above, but raw, taken directly from tweepy. (Only added for completeness.)
Step 1. Getting Twitter API access (optional)
If you’re using the sample data, then skip to Step 3.
Before we use tweepy to get tweets we neeed to generate OAuth Consumer and Access token keys and secrets. There are various guides that show how to do this, like this one, but the Twitter UI will change. It’s best to go to https://developer.twitter.com to follow along. In the end you’ll end up with these keys and secrets:
Consumer API Key
Consumer API Secret
Access Token Secret
These can be revoked and regenerated, but as with any other key, you should keep these secret.
Step 2: Saving Tweets to CSV format (optional)
Again, if you’re using the sample data, then skip to Step 3.
Now that we’ve got our Twitter API keys and secrets, we can use tweepy to save tweets into a CSV file. Free developer accounts on Twitter will limit the amount of tweets that are retrieved, but that’s enough for our purposes.
Copy the code below into a new file and save it. There are a few lines to update at the top, add values to the variables for keys, secrets, and the twitter handle you want to analyze.
import csv import tweepy # Twitter API credentials consumer_key = "" consumer_secret = "" access_key = "" access_secret = "" screen_name = "" def get_all_tweets(): # initialize tweepy auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_key, access_secret) api = tweepy.API(auth) alltweets =  # request first 200 tweets, the max allowed new_tweets = api.user_timeline(screen_name=screen_name, count=200) alltweets.extend(new_tweets) oldest = alltweets[-1].id - 1 # keep grabbing tweets until the 3200 tweet limit is hit while len(new_tweets) > 0: print("getting tweets before id: %s" % (oldest)) new_tweets = api.user_timeline(screen_name=screen_name, count=200, max_id=oldest) alltweets.extend(new_tweets) oldest = alltweets[-1].id - 1 print("...%s tweets downloaded so far" % (len(alltweets))) return alltweets def write_tweets_to_csv(tweets): # transform the tweepy tweets into an array outtweets = [[tweet.id_str, tweet.created_at, tweet.text.encode("utf-8"), tweet.retweet_count, tweet.favorite_count] for tweet in tweets] # write the csv with open('%s_tweets.csv' % screen_name, 'w') as f: writer = csv.writer(f) writer.writerow(["id", "created_at", "text", "Retweets", "Favorites"]) writer.writerows(outtweets) pass if __name__ == '__main__': # pass in the username of the account you want to download tweets = get_all_tweets() write_tweets_to_csv(tweets)
Run the script by running
python tweets.py in a terminal, a CSV file will be output, containing various tweets and information about those tweets, for example:
You can remove the
created_at columns, and remove empty rows to clean the data a bit.
Step 3: Log into Watson Studio
IBM Watson Studio is an easy-to-use, collaborative and cloud based environment for data scientists where they can use tools like Scala, R, Jupyter Notebookc etc.
Log into https://dataplatform.cloud.ibm.com/ and choose to create a
New Project, the
Complete option will work for this tutorial.
At the new project wizard, enter a
Description, You will also be required to create a new
Object Storage service or choose an existing service during project creation. Once created, you’ll be able to see a project overview, for example:
Once created, we can add an asset, by clicking
Add to project and in this case, we’ll click
Model, to add a new model.
Step 4: Create a new model
Give your model a
Description. We will also set the
Model type option to
Model builder and choose the
Manual for this exercise.
Before proceeding we need to associate two services. An
Apache Spark service, and a
Machine Learning service. You can use the UI to create a new one or select an existing one. For an example of how to do that with
Apache Spark, refer to this IBM Code Tutorial. To do that with
Machine Learning is the same exercise.
Step 5: Add data to the model
We’re now going to add the CSV file to the model. Click
Add Data Assets, browse to either the generated CSV file or the saved sample CSV file. The data should appear in the dashboard, for example:
Click on the
Next button to continue. Loading the data may take a few minutes.
Step 6: Select a training technique
For this example we’re trying to predict the best time to send a tweet, so let’s set the
Column value to predict to be
hour. Leave the
Feature columns unchanged and set to
All. The important choice here is the technique used, we’ll be using the
Regression technique. We’ll also be leving the
Validation Split unchanged.
It should be noted that because the classifier is set to
hour, which has around 20 values, Watson Studio will suggested
Multiclass classification. But in this case the best technique according to our data is
We also need to add estimators. To do that, click on
Add Estimators and select all avilable choices, then click
Once we have our technique and estimators selected we can click
Next. This will start training and testing data. This step will take a few minutes to fully complete.
Step 7: Wrapping up
The results show just how accurate each estimator is, with the most optimal estimator at the top. Here it is
Isotonic Regression, click on the first one and select the
Save option, for example:
Once saved, you will be redirected to an overview of the model, for example:
From here, we can create a web deployment so our model is accessible over a REST call.
Congratulations! Your model is saved, deployed, and you can start testing it out with the generated
In this tutorial we learned to extract user data from twitter and then perform data science predictive model on it to optimize future tweeting and increasing the users audience. This tutorial of building a model on Watson Studio can be applied on any other CSV file as well and can be further deployed on a web application. We also learned how to deploy the model as a web application to allow REST calls.