In today’s environment, a lot of focus has shifted towards data. Each day, the amount of data that is generated and consumed is increasing, adding approximately 5 exabytes of data. Everything we do generates data, whether turning a light on or off, or commuting to work. This data can generate information that can be used for insights to predict and extract patterns. Data mining or data science is the process of discovering patterns, insights, and associations from data. This tutorial shows you how to implement a predictive model on the data to gather insights. You’ll learn how to create a predictive model using AutoAI on IBM® Watson™ Studio, which is a cloud-based environment for data scientists. Specifically, you’ll learn how to predict and optimize your Twitter interaction that can lead to optimum traffic on your tweets.
Learning objectives
This tutorial explains how you can extract data, create a CSV file and upload it to IBM Cloud Object Storage, create a data connection from Watson Studio to IBM Cloud Object Storage, and then refine the data and use it to build, deploy, and test the predictive model with AutoAI.
After completing this tutorial, you understand how to:
- Work with IBM Cloud Functions to extract data from Twitter
- Create and upload a CSV file to IBM Cloud Object Storage from an IBM Cloud Function
- Use Watson Studio and AutoAI to build a predictive model using CSV data
- Use Twitter to predict and optimize your Twitter interactions
Prerequisites
To follow this tutorial, you need:
Estimated time
It should take you approximately 60 minutes to complete this tutorial.
Steps
Use sample data or get your own?
The first thing that you need is tweets to analyze. This step explains how to get these tweets. However, if you don’t want to get your own tweets, you can use the ufone_tweets.csv sample data set. If you use the sample data set, then skip the Twitter API access and IBM Cloud Function sections of this tutorial.
Step 1: Getting Twitter API access
If you’re using the sample data, then skip to Step 2.
Before using tweepy to get tweets, you must generate your Consumer API keys. Go to your Twitter Developer account and hover over your name on the upper right create your app. Complete the required information.
After your app is created, select the Keys and tokens tab. You see your Consumer API key
and Consumer API secret key
, which you’ll be using later in the tutorial. These keys can be revoked and regenerated, but as with any other key, you should keep these secret. (In this tutorial, you won’t be using the API tokens so you can ignore them.)
Step 2: Creating a Cloud Object Storage
- Log in to your IBM Cloud Account.
Click Create Resource, and search for Object Storage.
Choose the free Lite plan, change the name if you want, and click Create.
You can now find the Cloud Object Storage instance created in resources under Storage.
After you open your instance, click Buckets from the left-side pane, then click Create bucket (you can choose any type of bucket). Make sure to note the name of your bucket after you create it.
Go to Service Crendentials, and select the service credential that was just created. If nothing is showing, then click New credential to generate one. Click the arrow to expand the credentials. Note the
api_key
,iam_serviceid_crn
, andresource_instance_id
.Go to Endpoint, and choose your resiliency and location. Note the
Private url
because you’ll need it for the other steps.
Your bucket is now ready. Make sure to have your:
- Bucket name
- API Key
- Service ID
- Resource Instance ID
- Endpoint URL
Again, if you’re using the sample data, then you can directly upload the file in your bucket and skip Step 3 (jump to Step 4).
Step 3: Create IBM Cloud Functions
This step is only valid if you started with Step 1.
IBM Cloud Functions is an IBM Function-as-a-Service (FaaS) programming platform where you write simple, single-purpose functions known as Actions that can be attached to Triggers, which execute the function when a specific defined event occurs.
Create an Action
Usually, you create the Actions directly from IBM Cloud, but in this case, you want to use tweepy, which is an external Python library for accessing the Twitter API. External libraries are not supported in the IBM Cloud Functions runtime environment, so you must write your Python code, package it with a virtual local environment in a .zip file, and then push it to IBM Cloud.
If you don’t have Python, then download and install the latest version. After it’s installed, make sure to install virtualenv
.
pip install virtualenv
Create a directory that you can use to create your virtual environment. In this tutorial, it’s named
twitterApp
.cd desktop; mkdir twitterApp; cd twitterApp
From the
twitterApp
directory, create a virtual environment namedvirtualenv
. Your virtual environment must be namedvirtualenv
.virtualenv virtualenv
From your directory (in this case
twitterApp
), activate yourvirtualenv
virtual environment.source virtualenv/bin/activate
Insall the
tweepy
module.pip install tweepy
Stop the
virtualenv.
deactivate
Copy the following code, save it to a file called
main.py
in thetwitterApp
directory, and add the corresponding credentials that you got from Step 1 (Customer keys) and Step 2 (Cloud Object Storage credentials). Additionally, you can change the Twitter handle that you want to analyze. (In this tutorial, we are using Charlize Theron’s Twitter handle to analyze.) This code gets the data from Twitter, then creates a CSV file that contains the data and uploads it into the object storage service that you created at the beginning. After you run this function, a CSV file containing tweets information is uploaded in your bucket in Cloud Object Storage.import tweepy import sys, json import pandas as pd import csv import os import types from botocore.client import Config import ibm_boto3 #Twitter API credentials consumer_key = <"YOUR_CONSUMER_API_KEY"> consumer_secret = <"YOUR_CONSUMER_API_SECRET_KEY"> screen_name = "@CharlizeAfrica" #you can put your twitter username, here we are using Charlize Theron twitter profile to analyze. def main(dict): tweets = get_all_tweets() createFile(tweets) return {"message": 'success' } def get_all_tweets(): # initialize tweepy auth = tweepy.AppAuthHandler(consumer_key, consumer_secret) api = tweepy.API(auth) alltweets = [] for status in tweepy.Cursor(api.user_timeline, screen_name = screen_name).items(3200): alltweets.append(status) return alltweets def createFile(tweets): outtweets=[] for tweet in tweets: outtweets.append([tweet.created_at.hour, tweet.text, tweet.retweet_count, tweet.favorite_count]) client = ibm_boto3.client(service_name='s3', ibm_api_key_id=<"COS_API_KEY">, ibm_service_instance_id= <"COS_SERVICE_ID">, config=Config(signature_version='oauth'), endpoint_url= "https://" + <"COS_ENDPOINT_URL">) cols=['hour','text','retweets','favorites'] table=pd.DataFrame(columns= cols) for i in outtweets: table=table.append({'hour':i[0], 'text':i[1], 'retweets': i[2], 'favorites': i[3]}, ignore_index=True) table.to_csv('tweets_data.csv', index=False) try: res=client.upload_file(Filename="tweets_data.csv", Bucket=<'BUCKET_NAME'>,Key='tweets.csv') except Exception as e: print(Exception, e) else: print('File Uploaded')
From the
twitterApp
directory, create a .zip archive of thevirtualenv
folder and themain.py
file. These files must be in the top level of your .zip file.zip -r twitterApp.zip virtualenv main.py
Push this function to IBM Cloud by logging in to your IBM Cloud account and making sure to target your organization and space. You can find out more about this process.
ibmcloud login
Create an action called
twitterAction
using the .zip folder that you just created (right-click on the file, and check get info for a Mac or properties for Windows™ to get the path) by specifying the entry point that is themain
function in the code and the--kind
flag for runtime.ibmcloud fn action create twitterAction </path/to/file/>twitterApp.zip --kind python:3.7 --main main
Go back to IBM Cloud, and click Cloud Functions on the left side of the window.
Click Action, making sure that the right namespace is selected. You see the action that was created. Click it, and then click Invoke to run it.
You can also run it directly from the terminal using the following command.
ibmcloud fn action invoke twitterAction --result
If you go to your bucket in the Cloud Object Storage service that you created at the beginning of the tutorial, you see a tweets.csv
file that has been uploaded. This is the file that has the extracted tweets from IBM Cloud Functions.
Create a Trigger
Now, create a Trigger that invokes your Action.
Choose Triggers from the left pane, and click Trigger.
Choose Periodic for the trigger type. This means that your event here is the time. The function will get invoked on a specific time.
Name your trigger, define a timer, and click Create. In this example, the timer is set on Sundays. Every Sunday, the trigger fires at 4:00 am GMT+4 and invokes the action to fetch Twitter data and create a new CSV file with new tweets.
Click Add to connect this trigger to the Action.
Choose the Select Existing tab, select your Action, and click Add. Now, your Action is connected to this Trigger and gets fired based on the time that you specified.
Step 4: Create a Watson Studio service
Similar to how you created the Cloud Object Storage service at the beginning of the tutorial, you’ll use the same process to create a Watson Studio service.
Search for Watson Studio, and select the Lite plan to create it. You can find it instantiated under services in resource summary (the main dashboard of your IBM Cloud account). Click it, and then click Get Started. This launches the Watson Studio platform.
Click Create Project, and then Create an empty project.
Name the project, and give it a description. Make sure to choose the Cloud Object Storage service that you created previously.
Step 5: Create a connection to Cloud Object Storage
Click Add to project. Here, you see many assets that you can use in Watson Studio. You want to create a connection to your Cloud Object Storage service so that you can access the
tweets.csv
file. After you access this file, you have access to the data inside of it. You’ll use this data to build your machine learning model with AutoAI.Click Connection to start creating your connection to your Cloud Object Storage service.
Click Cloud Object Storage.
Name your connection, and complete the information with the credentials that you got from Step 2 (Cloud Object Storage credentials). Just add the
API_KEY
,Resource Instance ID
, andLogin URL
, which is theendpoint
. You can leave the other fields empty.Click Add to project, and click connected data. Select your source, which is the connection created in the previous step, then select your bucket and choose the tweets.csv file. Name your asset, and click Create.
Step 6: Refine the data
The data is already prepared, but you must convert the rows hour and favorites to integer
. Start with hour. In addition, remove the column retweets because it won’t be used for the prediction.
Click the 3 dots, select Convert column, and then choose Integer. Repeat the same process for favorites and retweets.
Click Save and create a job when you’re finished.
Name the job, and click Create and Run.
This job creates a new data set based on the one that you already have, but with your refinements that were responsible to convert three rows to integer
. As you can see, the output of this job is a file named Tweets_shaped.csv
. Wait until the status of the job shows Completed.
Now, you should see three assets just like the following image. The Tweets_shaped.csv
file is now the main file that you’ll use in AutoAI to create your predictive model.
Step 7: Create an AutoAI experiment
Click Add to projects, and choose AutoAI experiment.
Name your project, and choose a machine learning instance. This is needed so that you can deploy your model at the end. If you don’t have one, Watson Studio will ask you to directly create it, and you will be able to proceed normally.
Add your file by selecting the
Tweets_shaped.csv
file that was generated from the Data Refinery.You want to predict the highest number of interactions that you get when you share your tweets, so choose favorites as the prediction column. You see that the prediction type is Regression. That’s because you want to predict a continuous value, and the optimized metric is RMSE (Root Mean Squared Error). You can change and customize your experiment by clicking Experiment Settings.
In Experiment settings, go to Prediction. Here, you can see all of the algorithms that you can use in your experiment. You can change the number of algorithms to use. For example, you can choose 3, which means that the experiment uses the top 3 algorithms. For every algorithm, AutoAI generates four pipelines. In other words, the first pipeline is the regular one with no enhancement added, the second one is with HPO (Hyperparameter Optimization), the third one is with HPO and Feature Engineering, and the last one is with HPO, Feature Engineering, and another HPO. Because you are using three algorithms, you will have a total of 12 pipelines (3×4=12), so AutoAI will build and generate 12 candidates to find your best model.
Step 8: Build and evaluate the models
AutoAI generates your 12 best models for your use case, and there are different ways to understand and visualize the results. The following image shows a Relationship Map, which shows how AutoAI is building and generating the pipelines. Every color represents a type of algorithm, and each one has its own four pipelines that were discussed in the previous step.
You can click Swap view to check the Progress Map, which is another way to visualize how AutoAI is generating your pipelines in a sequential way.
You can see the Pipeline leaderboard to check which model is the best. In this case, Pipeline 11 is the best model using Extra Trees Regressor with two enhancements (first HPO and Feature Engineering).
AutoAI shows you the comparison between all of these pipelines. If you click Pipeline comparison, you see a metric chart that compares your candidates.
Because Pipeline 11 is the best model, click it to get a better understanding of it. For example, you can check its Feature Importance to see the key features in making decisions for the predictive model. In this example, the NewFeature_0 is the most important factor for the prediction. NewFeature_0 is a new feature generated just like NewFeature_3. These are combinations of different features (for example, a combination of text and favorites) that are generated with feature engineering to enhance the model.
Step 9: Save and deploy the model
Now, save and deploy the model so that you can start using it.
Click Save as, and choose Model. This saves the model, and you can now access it from the main dashboard of your project in Assets under the Models section.
Click this new created model, select the Deployments tab, and click Add Deployment to create the deployment (you must give it a name). This is a web deployment that can be accessible using a REST call.
Wait until the status of the deployment is Ready in the Deployments tab, then click the deployment’s name.
Step 10: Test the model
The model is now ready for you to start using.
Select the Test tab, and enter data in the fields. You can put the data in a JSON format if you prefer (this is easier in cases where you have a lot of fields, but here you have only two fields).
Click Predict, and you see the result in
values
. In this example, you have the value of 930. This means that Charlize (remember that you’re using Charlize Theron’s data) will probably get approximately 930 favorites if she shared her tweet at 4:00 pm (hour is 16 in this example). You can put your own user name in the IBM Cloud Function if you want to predict for your account.
If you want to implement this model in your application, click the Implementation tab . It shows the endpoint URL and code snippets for different programming languages (cURL, Java programming, JavaScript, Python, and Scala) that you can use for your application.
Summary
In this tutorial, you learned to extract data from Twitter, create a CSV file that contains this data, and upload it to IBM Cloud Object Storage using IBM Cloud Functions. Then, you learned how to create a predictive model on this data to optimize future tweeting and increase the user’s audience using Watson Studio and AutoAI.