This tutorial is part of the Getting started with Watson Studio learning path.
|100||Introduction to IBM Watson Studio||Article|
|101||Data visualization, preparation, and transformation using IBM Watson Studio||Tutorial|
|201||Automate model building in IBM Watson Studio||Tutorial|
|301||Creating SPSS Modeler flows in IBM Watson Studio||Tutorial|
|401||Build models using Jupyter Notebooks in IBM Watson Studio||Tutorial|
This tutorial explains how to set up and run Jupyter Notebooks from within IBM® Watson™ Studio. We start with a data set for customer churn that is available on Kaggle. The data set has a corresponding Customer Churn Analysis Jupyter Notebook (originally developed by Sandip Datta), which shows the archetypical steps in developing a machine learning model by going through the following essential steps:
Import the data set.
Analyze the data by creating visualizations and inspecting basic statistic parameters (for example, mean or standard variation).
Prepare the data for machine model building (for example, by transforming categorical features into numeric features and by normalizing the data).
Split the data into training and test data to be used for model training and model validation.
Train the model by using various machine learning algorithms for binary classification.
Evaluate the various models for accuracy and precision using a confusion matrix.
Select the model that’s the best fit for the given data set, and analyze which features have low and significant impact on the outcome of the prediction.
Use Watson Machine Learning to save and deploy the model so that it can be accessed outside of the notebook.
The notebook is defined in terms of 35 Python cells and requires familiarity with the main libraries used: Python scikit-learn for machine learning, Python numpy for scientific computing, Python pandas for managing and analyzing data structures, and matplotlib and seaborn for visualization of the data.
To complete the tutorials in this learning path, you need an IBM Cloud account. You can obtain a free trial account, which gives you access to IBM Cloud, IBM Watson Studio, and the IBM Watson Machine Learning Service.
It should take you approximately 30 minutes to complete this tutorial.
The steps to set up your environment for the learning path are explained in the Data visualization, preparation, and transformation using IBM Watson Studio tutorial. These steps show how to:
- Create an IBM Cloud Object Storage service.
- Create an IBM Watson Studio project.
- Provision IBM Cloud services.
- Upload the data set.
You must complete these steps before continuing with the learning path. If you have finished setting up your environment, continue with the next step, creating the notebook.
NOTE: The Watson Machine Learning service is required to run the notebook.
Create the notebook
Create a Jupyter Notebook for predicting customer churn and change it to use the data set that you have uploaded to the project.
In the Asset tab, click Add to Project.
Select the Notebook asset type.
On the New Notebook page, configure the notebook as follows:
Select the From URL tab:
Enter the name for the notebook (for example, ‘customer-churn-kaggle’).
Select the Python 3.6 runtime system
Enter the following URL for the notebook:
- Click Create Notebook. This initiates the loading and running of the notebook within IBM Watson Studio.
Run the notebook
The notebook page should be displayed.
If the notebook is not currently open, you can start it by clicking the Edit icon displayed next to the notebook in the Asset page for the project:
NOTE: If you run into any issues completing the steps to execute the notebook, a completed notebook with output is available for reference at the following URL: https://github.com/IBM/watson-studio-learning-path-assets/blob/master/examples/example-customer-churn.ipynb.
From the notebook page, make the following changes:
Scroll down to the third cell, and select the empty line in the middle of the cell. If not already open, click the 1001 data icon at the upper part of the page to open the Files subpanel.
In the right part of the page, select the Customer Churn data set. Click insert to code, and select Insert pandas DataFrame. This adds code to the data cell for reading the data set into a pandas DataFrame.
Change the generated variable name df_data_1 for the data frame to df, which is used in the rest of the notebook. When displayed in the notebook, the data frame appears as the following:
Select File > Save to save the notebook.
Run the cells of the notebook one by one, and observe the effect and how the notebook is defined.
Background on running notebooks
When a notebook is run, each code cell in the notebook is executed, in order, from top to bottom.
Each code cell is selectable and is preceded by a tag in the left margin. The tag format is
In [x]:. Depending on the state of the notebook, the
x can be:
- A blank, which indicates that the cell has never been run
- A number, which represents the relative order that this code step was run
*, which indicates that the cell is running
There are several ways to run the code cells in your notebook:
- One cell at a time. Select the cell, and then press Play in the toolbar.
- Batch mode, in sequential order. From the Cell menu, there are several options available. For example, you can
Run Allcells in your notebook, or you can
Run All Below, which starts running from the first cell under the currently selected cell, and then continues running all of the cells that follow.
- At a scheduled time. Press the Schedule button that is located in the upper-right section of your notebook page. Here, you can schedule your notebook to be run once at some future time or repeatedly at your specified interval.
Data understanding and visualization
During the data understanding phase, the initial set of data is collected. The phase then proceeds with activities that enable you to become familiar with the data, identify data quality problems, and discover first insights into the data. In the Jupyter Notebook, these activities are done using pandas and the embodied
matplotlib functions of pandas. The
describe function of pandas is used to generate descriptive statistics for the features, and the
plot function is used to generate diagrams showing the distribution of the data.
The data preparation phase covers all activities that are needed to construct the final data set that will be fed into the machine learning service. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record, and attribute selection as well as transformation and cleansing of data for the modeling tools. In the Jupyter Notebook, this involves turning categorical features into numerical ones, normalizing the features, and removing columns that are not relevant for prediction (such as the phone number of the client). The following image shows a subset of the operations.
Modeling and evaluation
In the modeling phase, various modeling techniques are selected and applied and their parameters are calibrated to achieve an optimal prediction. Typically, there are several techniques that can be applied, and some techniques have specific requirements on the form of the data. Therefore, going back to the data preparation phase is often necessary. However, in the model evaluation phase, the goal is to build a model that has high quality from a data analysis perspective. Before proceeding to final deployment of the model, it’s important to thoroughly evaluate it and review the steps that are executed to create it to be certain that the model properly achieves the business objectives.
In the Jupyter Notebook, this involved splitting the data set into training and testing data sets (using stratified cross-validation) and then training several models using distinct classification algorithms such as
GradientBoostingClassifier, support vector machines, random forest, and K-Nearest Neighbors.
Following this step, we continue with printing the confusion matrix for each algorithm to get a more in-depth view of the accuracy and precision offered by the models.
Deploying your model to Watson Machine Learning
In the last section of the notebook, we save and deploy the model to the Watson Machine Learning service. To access the service, we need to cut and paste the Machine learning credentials into this notebook cell:
After the model is saved and deployed to Watson Machine Learning, we can access it in a number of ways.
In the Jupypter Notebook, we can pass data to the model scoring endpoint to test it.
If we go back to the Watson Studio console, we can see in the Assets tab that the new model is listed in the Models section.
If we click on the Deployments tab, we can see that the model has been successfully deployed.
Click on the deployment to get more details. If you click the Implementation tab, you will see the scoring endpoint. In the Code Snippets section, you can see examples of how to access the scoring endpoint programmatically.
On the Test tab, we can pass in a scoring payload JSON object to score the model (similar to what we did in the notebook). After supplying the data, press Predict to score the model.
SPSS model notebook
Now that you have learned how to create and run a Jupyter Notebook in Watson Studio, you can revisit the
Scoring machine learning models using the API section in the SPSS Modeler Flow tutorial. It has instructions for running a notebook that accesses and scores your SPSS model that you deployed in Watson Studio.
This tutorial covered the basics for running a Jupyter Notebook in Watson Studio, which includes:
- Creating a project
- Provisioning and assigning services to the project
- Adding assets such as data sets to the project
- Importing Jupyter Notebooks into the project
- Loading and running the notebook
The purpose of the notebook is to build a machine learning model to predict customer churn using a Jupyter Notebook. Other tutorials in this learning path discuss alternative, non-programatic ways to accomplish the same objective, using tools and features built into Watson Studio.