Overview

Skill Level: Beginner

This recipe shows various ways of predicting customer churn using IBM Watson Studio ranging from a semi-automated approach using the Model Builder, a diagrammatic approach using SPSS Modeler Flows to a fully programmed style using Jupyter notebooks.

Ingredients

Software Requirements

To obtain an IBM Cloud Account and get access to the IBM Cloud and to IBM Watson Studio, please follow the instructions outlined here:

Step-by-step

  1. Introduction

    This recipe demonstrates various ways of using IBM Watson Studio to predict customer churn ranging from a semi-automated approach using the Model Builder, a diagrammatic approach using SPSS Modeler Flows to a fully programmed style using Jupyter notebooks for Python.

    The recipe will follow the main steps of methods for data science (and data mining) such as CRISP-DM (Cross Industry Standard Process for Data Mining) and the IBM Data Science Methodology and will focus on tasks for data understanding, data preparation, modeling, evaluation and deployment of a machine learning model for predictive analytics. It takes its basis in a data set and notebook for customer churn available on Kaggle, and then demonstrate alternative ways of solving the same problem but using the Model Builder, the SPSS Modeler and the IBM Watson Machine Learning service provided by the IBM Watson Studio. At the same time the recipe will also dive into the use of the profiling tool and the dashboards of IBM Watson Studio to support data understanding as well as the Refine tool to solve straightforward data preparation and transformation tasks. 

    The recipe provide the following sections:

    • Section 2 provides a short overview of the methodology and tools used as well as an introduction to the notebook on Kaggle thus setting the scene for the recipe.
    • Section 3 provides the steps needed to create and configure a project, import the artifacts and get the notebook from Kaggle running inside IBM Watson Studio.
    • Section 4 focuses on getting insights into the data set used by using the profile tool and the dashboard capabilities of IBM Watson Studio.
    • Section 5 will briefly introduce the Refine component for defining transformation. This step is optional.
    • Section 6 get you to create and evaluate a Watson Machine Learning model with a few user interactions using the Model Builder.
    • Section 7 will continue with deployment and test of the model using the IBM Watson Machine Learning service.
    • Section 8 will repeat the steps for creating a model but using SPSS Modeler Flows and will demonstrate the capabilities of this tool for data understanding, preparation, model creation and evaluation.
    • Section 9 will let you test the SPSS model using a Jupyter Notebook for Python and the IBM Watson Machine Learning services REST API.
  2. Setting the Scene

    IBM has defined a Data Science Methodology that consists of 10 stages that form an iterative process for using data to uncover insights. Each stage plays a vital role in the context of the overall methodology. At a certain level of abstraction it can be seen as a refinement of the workflow outlined by the CRISP-DM (Cross Industry Standard Process for Data Mining) method for data mining.

    02.01-CRISP-DM-1

    According to both methodologies every project starts with Business Understanding where the problem and objectives are defined. This is followed in the IBM Data Science Method by the Analytical Approach phase where the data scientist can define the approach to solving the problem. The IBM Data Science Method then continues with three phases called Data requirements, Data collection and Data understanding, which in CRISP-DM is presented by a single Data Understanding phase. Once the Data Scientist has an understanding of their data and has sufficient data to get started, they move on to the Data Preparation phase. This phase is usually very time consuming. A data scientist spends about 80% of their time here, performing tasks such as data cleaning and feature engineering.¬† The term “data wrangling” is often used in this context. During and after cleaning the data, the data scientist generally performs exploration ‚Äď such as descriptive statistics to get an overall feel for the data and clustering to look at the relationships and latent structure of the data. This process is often iterated several time until the data scientist is satisfied with their data set. The model training stage is where machine learning is used in building a predictive model. This model is trained and then evaluated by statistical measures such as prediction accuracy, sensitivity, specificity etc. Once the model is deemed sufficient, the model is deployed and used for scoring on unseen data. The IBM Data Science Methodology adds an additional Feedback stage for obtaining feedback from using the model which will then be used to improve the model. Both methods are highly iterative by nature.

    In this recipe we will focus on the phases starting with data understanding and then continue from there preparing the data, building a model, evaluating the model and then deploying and testing the model. The purpose will be to develop models to predict customer churn. Aspects related to analyzing the causes of these churns in order to improve the business is – on the other hand – out of the scope of this recipe. This means that we will be working with various kinds of classification models that can, given an observation of a customer defined by a set of features, give a prediction whether this specific client is at risk of churning or not.

    For all tasks we will use IBM Watson Studio. IBM Watson Studio provides users with environment and tools to solve business problems by collaboratively working with data. Users can choose the tools needed to analyze and visualize data, to cleanse and shape data, to ingest streaming data, or to create, train, and deploy machine learning models.

    02.2-Watson-Studio

    The main functionality offers relates to components for:

    • Create Projects to organize the resources (such as data connections, data assets, collaborators, notebooks) to achieve an analytics goal.
    • Access data from Connections to your cloud or on-premises data sources. Upload files to the project’s object storage.
    • Create and maintain Data Catalogs to discover, index, and share data.
    • Refine data by cleansing and shaping the data to prepare it for analysis.
    • Perform Data Science tasks by creating Jupyter notebooks for Python or Scala to run code that processes data and then view the results inline. Alternatively use RStudio for R.
    • Ingest and Analyze Streams data with the Streams Designer tool.
    • Create, test and deploy Machine Learning and Deep Learning models.
    • Classify images by training deep learning models to recognize image content.
    • Create and share Dashboards of data visualizations without coding.

     

    IBM Watson Studio is technically based on a variety of Open Source technology and IBM products as depicted in the following diagram:

    02.3-Watson-Studio-Architecture

     In context of data science, IBM Watson Studio can be viewed as an integrated, multi-role collaboration platform that support the developer, data engineer, business analyst and last but not least the data scientist in the process of solving a data science problem. For the developer role other components of the IBM Cloud platform may be relevant as well in building applications that utilizes machine learning services. The data scientist however can be build the model using a variety of tools ranging from RStudio and Jupyter Notebooks using a programmatic style, SPSS Modeler Flows adopting a diagrammatic style or the Model Builder component for creating IBM Watson Machine Learning Service which supports a semi-automated style of generating machine learning models. Beyond those 3 main components you will also get to use IBM Cloud Object Storage for storing the data set used to train and test the model, Data Refinery for transforming the data set and IBM Watson Studio dashboards for generating visualizations. A key component is of course the IBM Watson Machine Learning service and its set of REST APIs that can be called from any programming language to interact with a machine learning model. The focus of the IBM Watson Machine Learning service is deployment, but you can use IBM SPSS Modeler or IBM Watson Studio to author and work with models and pipelines. Both SPSS Modeler and IBM Watson Studio use Spark MLlib and Python scikit-learn and offer various modeling methods that are taken from machine learning, artificial intelligence, and statistics.

    02.02-Watson-Studio

    In the recipe we will start out with a dataset for Customer Churn available on Kaggle. The dataset is accompanied with a corresponding Customer Churn Analysis Jupyter Notebook from Sandip Datta that shows the archetypical steps in developing a machine learning model by going through the following essential steps:

    1. Import the dataset.
    2. Analyze the data by creating visualizations and inspecting basic statistic parameters (mean, standard variation etc.).
    3. Prepare the data for machine model building e.g. by transforming categorical features into numeric features and by normalizing the data.
    4. Split data in train and test data to be used for model training and model validation respectively.
    5. Train model using various machine learning algorithms for binary classification.
    6. Evaluate the various models for accuracy and precision using a confusion matrix.
    7. Select the model best fit for the given data set and analyze which features have low and have significant impact on the outcome of the prediction.

     

    The notebook is defined in terms of 25 Python cells and requires familiarity with the main libraries used: Python scikit-learn for machine learning, Python numpy for scientific computing, Python pandas for managing and analyzing data structures and last but not least matplotlib and seaborn for visualization of the data. An outline of the notebook is given by the screenshots in the table below (to be read row by row). More details of the notebook will be briefly covered in the next section where you will download and run the notebook once that you have created a project to manage the relevant assets:

    02.03-Data-Import

    02.04-Data-Understanding
     02.05-Data-Preparation  02.06-Model-Training-and-Evaluation
    02.07-Model-Evaluation-1 02.08-Model-Selection

     

    One objective of this recipe is to show how IBM Watson Studio offers – in addition to Jupyter Notebooks for Python, Scala or R – alternative ways of going through a similar process that may be faster and can be achieved without programming skills. These mechanisms are in essence SPSS Modeler Flow which allows a data scientist to create a model purely graphically by defining a flow and the IBM Model Builder inside IBM Watson Studio which goes one step beyond SPSS by providing a semi-automatic approach to creation, evaluation, deployment and testing of a machine learning model. At the same time we shall demonstrate how IBM Watson Studio provides capabilities out-of-the-box for profiling, visualizing and transforming the data – again without any programming required.

    Following the recipe you will create a project that contains the artifacts shown in the following screenshot.

    02.09-Data-Assets

    The artifacts will be created as follows:

    • Section 3 of the recipe will get you started by creating the project and importing the assets from Kaggle so that you can run the imported notebook named ‘Class – Customer Churn – Kaggle’.
    • Section 4 will let you perform tasks related to the Data Understanding phase, which includes profiling the imported data set to view the distribution and statistical measures like minimum, maximum, mean and standard deviation for numerical features. Moreover you will create a ‘Customer Churn Dashboard’ and a couple of visualizations.
    • Section 5 will cover the Data Preparation phase and will briefly introduce the Refine component where you will create a Data Refinery Flow to transform the input data set. This step is optional.
    • Section 6 will continue with the Modeling and Evaluation phase and will get you to create and evaluate a Watson Machine Learning model with a few user interactions using the Model Builder.
    • Section 7 will continue with Deployment and Test. You will deploy the Machine Learning model as a web service and then test it using test data presented in form of JSON objects.
    • Section 8 will repeat the steps but using SPSS Modeler Flows.
    • Section 9 will let you deploy the SPSS model and then create a Jupyter Notebook for Python that uses the IBM Watson Machine Learning services¬† REST API to request predictions for specific observations.
  3. Getting Started

    We will assume that you have already gained access to IBM Cloud and IBM Watson Studio (see the “Prerequisites” section at the beginning of the recipe for the links needed for registering). If in doubt about how to gain access to IBM Watson Studio you can also follow the instructions in section 3 of the recipe “Analyze archived IoT device data using IBM Cloud Object Storage and IBM Watson Studio“.

    In this section of the recipe you will get started by doing the following:

    1. Create a project.
    2. Provision the IBM Machine Learning, Apache Spark and IBM Cognos Dashboard Embedded services for later use.
    3. Download the dataset from Kaggle and import it to the project.
    4. Download, modify and run the Jupyter notebook for Python that sets the scene for this recipe.

     

    Create IBM Watson Studio Project

    To create the project do the following:

    1. Sign into IBM Watson Studio.
    2. Click Create a project.
    3. In the next page, select the Standard Project template and click Create Project.
    4. In the New Project dialog, give a name to the project such as “Watson Machine Learning” and click Create.
    5. Wait until the the project has been created.

     

    Provision IBM Cloud Services

    To provision the Machine Learning Service and associate it as a service to the current project do the following:

    1. Select the Settings tab for the project at the top of the page.
    2. Scroll down to the Associated Services section.
      03.05-Add-ML-Service
    3. Click the Add Service button.
    4. Select the Watson Menu item.
    5. On the next page, select the Watson Machine Learning Service and click Add.
    6. On the next page, select the New tab to create a new service.
    7. Keep the Lite plan for now (you can change it later if necessary).
    8. Scroll down and click Create to create the service.
    9. Next the Confirm Creation dialog will appear that will let you specify the details of the service such as the region, the plan, the resource group and the service name.
      03.07-Confirm-Creation
    10. Enter a proper name for the service instance e.g. by prefixing the generated name with “Watson Machine Learning”.
    11. Click Confirm.

     

    You many choose to use the default resource group for the services but you may ass well use a dedicated one that you have created in IBM Cloud. You can find the command for creating new resource groups in IBM Cloud using the menu Manage > Account, and then navigate to Account Resources > Resource Groups in the toolbar to the left. The Create button can be found in the top right corner of the page.

    Continue in a similar way to create an instance of the Apache Spark service and the IBM Cognos Dashboard Embedded service. Use whenever possible the Lite plan and provide the same prefix to the auto-generated service name as above.

     

    Upload Data Set

    Next download the data set from Kaggle and upload it to IBM Watson Studio:

    1. Go to the URL for the data set on Kaggle (https://www.kaggle.com/sandipdatta/customer-churn-analysis) and download the file to your local desktop.
    2. Rename the file to something more meaningful, e.g. ‘Customer Churn – Kaggle.csv’.
    3. In IBM Watson Studio, select the Assets tab.
      03.10-Upload-data-set
    4. Drag and drop the file onto the area for uploading data to IBM Watson Studio in the upper right coerner of the page.
    5. Wait until the file has been uploaded.

     

    Import and Test Jupyter Notebook

    Finally create a Jupyter notebook for predicting customer churn and change it to use the data set that you have uploaded to the project.

    1. In the Asset tab, click the command Add to Project.
      03.11-Add-to-project
    2. Select the Notebook asset type.
      03.12-Enter-Notebook-Details
    3. In the New Notebook dialog, configure the notebook as follows:
      1. Select the “From URL” tab and enter ‘https://github.com/EinarKarlsen/ibm-watson-machine-learning/blob/master/Class%20-%20Customer%20Churn%20-%20Kaggle.ipynb‘ as the URL for the notebook.
      2. Enter the name for the notebook, e.g. “Class – Customer Churn – Kaggle”.
      3. Select the runtime system (e.g. the default Python runtime system which is for free).
      4. Optionally, enter a short description for the notebook.
    4. Click Create Notebook.
    5. Scroll down to the third cell and select the empty line in the middle of the cell.
      03.13-Modify-Notebook
    6. In the right part of the window, select the Customer Churn data set. Click insert to code and select Insert pandas DataFrame. This will add code to the data cell for reading the data set into a pandas Data Frame.
      03.14-Inserted-Code
    7. Change the generated variable name df_data_1 for the data frame to df which is used in the rest of the notebook as shown above.
    8. Save the notebook by invoking File > Save.

     

    Run the cells of the notebook one by one and observe the effect and how the notebook is defined.

  4. Data Understanding and Visualization

    During the data understanding phase, the initial set of data is collected. The phase then proceeds with activities that enables you to become familiar with the data, identify data quality problems and discover first insights into the data. In the Jupyter notebook these activities are done using pandas and the embodied matplotlib functions of pandas. The describe function of pandas is used to generate descriptive statistics for the features and the plot function is used to generate diagrams showing the distribution of the data:

    02.04-Data-Understanding

    We can achieve the same in IBM Watson Studio by simple user interactions without a single line of code by using out-of-the-box functionality. To view the data set in IBM Watson Studio, simply locate the data asset and then click the name of the data set to open it:

    04.01-Data-Preview

    IBM Watson Studio will show you a preview of the data in the Preview tab. The Profile tab on the other hand provides you with profiling information that shows the distribution of the values and for numerical features also the maximum, minimum, mean and standard deviation for the feature:

    04.2-Data-set-profile

    Notice that although the numerical columns are identified to be of type varchar, the profiler is sufficient smart to recognize these to be numerical columns and consequently convert them implicitly and compute the mean and the standard deviation.

    To generate the profile the first time simply do the following:

    1. Select the Profile tab,
    2. Then invoke the command Create Profile.
    3. Wait a short while and then refresh the page.

     

    Notice that the churn parameter does not provide a balanced distribution of churn and no-churn observations as already observed in the notebook on Kaggle, which calls for a need for cross validation strategies to be adopted during the model building and evaluation phase.

    We can look further into the dataset by creating a dashboard with associated visualizations. This basically requires 3 steps: 1) create an empty dashboard, 2) add a data source to be used for visualizations and 3) add appropriate visualizations to the dashboard.

    To create the dashboard do the following:

    1. Click the Add to project button at the top of the page.
    2. In the next dialog, click Dashboard to create a new dashboard.
      04.03-Create-Dashboard
    3. On the next page titled New Dashboard do the following:
      1. Enter a Name for the dashboard, e.g. ‘Customer Churn Dashboard’
      2. Provide a Description  for the dashboard (optional).
      3. As Cognos Dashboard Embedded Service, select the dashboard service that you created in the previous section.
      4. Click Save to save the dashboard.
    4. On the next page select the Freeform template.
      04.04-Free-form-diagram
    5. Keep the default setting that will create a Tabbed dashboard.
    6. Click OK to create an empty freeform dashboard with a single Tab.

     

    To add a data connection, go through the following steps:

    1. Click the “Add a source” button in the upper left part of the page:
      04.05-Select-Source
    2. On the next page select the data source named ‘Customer Churn – Kaggle.csv’.
    3. You can now (optionally) Preview the data source now by clicking the eye icon to the right of the data source name.
    4. Click Select to select the data source.
    5. Back in the dashboard, select the newly imported data source.
      04.06-Data-Source-Columns
    6. Expand the data source by clicking > so that you can view the columns.

     

    Notice that you can view and change the properties of the columns. Simply click the 3 dots to the right of the column name, then select Properties in the popup menu. This will display a dialog as shown above, and allow you to alter the default setting for Usage (Identifier, Attribute, Measure) and Aggregate Function (Count, Count Distinct, Maximum, Minimum etc). For now we should be fine with the default settings.

    To create a visualization that shows the distribution of churns and no-churns as a pie chart do the following:

    1. Select the Visualizations icon in the toolbar to the left.
    2. Select a Pie chart.
    3. This will create a form for specifying the properties of the pie chart using e.g. columns of the data set.
      04.07-Create-Visualization
    4. Select the Sources icon in the toolbar to the left (it is the one located above the Visualizations icon).
    5. Drag and drop the churn column onto the Segments property of the pie chart.
    6. Drag and drop the churn column onto the Size column of the pie chart.
      04.08-Visualization-Props
    7. Click the Collapse arrow in the top right of the form as shown above. This will minimize the pie chart and render it on the dashboard.
      04.09-Initial-Dashboard
    8. Select the Tab to the top left, then click the Edit the title button.
    9. Provide a title for the tab (e.g. ‘Customer Churn’)’.

     

    Continue this way creating two more visualizations:

    • A Stacked Column Chart showing State (visualization property Bars) and Churn (Length, Color) on the X and Y axis respectively.
    • A Pie Chart showing the distribution of International Plan (Segments, Length).

     

    This should result in a dashboard looking like below. Notice that you can move visualizations on the dashboard using the Move widget command located on the top of each visualization:

    04.10-Final-Dashboard

    The dashboards are dynamic by nature and supports exploration of the data using e.g. filters. In the visualization showing ‘International Plan’ click the slice associated with the value ‘yes’. This will create a filter which will apply to all other (connected) visualizations on the current dashboard as well:04.11-Filtered-Dashboard

    Notice that the slice for churn in the visualization to the left has increased significantly. This tells us that clients on an international plan are more likely to churn than clients that are not. To remove the filter, simply click the filter icon for the visualization in the top right corner, then select the delete filter button that pops up as a result (the icon is a cross in a circle). Simply clicking the slice again will achieve the same effect.

  5. Data Preparation and Transformation using Refine

    The data preparation phase covers all activities needed to construct the final dataset that will be feed into the machine learning service. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record, and attribute selection, as well as transformation and cleaning of data for the modeling tools. In the original notebook on Kaggle this involved turning categorical features into numerical ones, normalizing the features and removing columns not relevant for prediction (such as e.g. the phone number of the client). A subset of the operations are shown below:

    02.05-Data-Preparation

    If we would just like to create a model semi-automatically or fully automated using the IBM Watson Model Builder and Machine Learning service, no more activity would actually be needed during data preparation (for the current data set) since the Model Builder service will take care of such operations under the hood. We will show how this is done in the next section.

    However, IBM Watson Studio offers a service called Data Refine that allows us to cleanup and transform data without any programming required. We will shortly introduce the service so that you can get a feeling of how it works. However, this step is not strictly necessary for the process:

    1. Click Add to new project in the top bar of the page.
    2. In the Choose asset type dialog, select Data Refinery Flow to create a new flow
    3. On the next page, select the Customer Churn data set and click Add.
    4. This will open up the data source for you so that you can transform and view it.
      05.02-Initial-Data-Set

     

    Notice the tabs to the top left which provides you with capabilities for view the data in a tabular form, for profiling it (as in the previous section) and for creating custom visualizations of the data.

    To transform the data do the following:

    1. Select the 3 dots in the “phone number” column and invoke the Remove command in the pull-down menu. This will delete the column.
    2. Select the “total days minutes” feature column. This is a really a String type but should be numeric.
    3. Click the Operations button in the upper left corner. This will show you some available transformation:
      05.03-Transformation-Operations

     

    You could for example convert the column to another type (say float or integer). However we will not do this for now since the Machine Learning service will do it for us behind the scene automatically, but in principle you could decide e.g. to turn the “total day minutes” column into an integer column and round it to show zero decimals. Alternatively you clould convert it into a floating type. For now let’s just continue executing the flow just defined and view the result:

    1. Click the Run Data Refinery flow button in the toolbar. Its icon is an arrow head.
    2. On the next page you can give a name to the flow as well as the resulting output file. However, leave the default names for now.
    3. Click the Save and Run flow.
    4. In the next dialog named “What’s Next?” select the View Flow command.
      05.04-Flow-Information

     

    The resulting window shows the input file, the output file and the runs. Notice that there is also a tab where you can schedule the flow so that it is executed automatically. Go back to your project and check that the output file and the flow are now part of your project assets.

    Data Refinery Flows allow a user to perform quick transformations of data without need for programming. It is of course by no way a replacement for e.g. Jupyter notebooks and the powerful capabilities of e.g. numpy and pandas but for a quick cleanup process is comes in quite handy. For more complex transformations and computations one should revert to using other means such as e.g. Jupyter notebooks or SPSS Modeler flows (which we will cover in a later section).

  6. Modeling and Evaluation using the IBM Watson Studio Model Builder

    In the modeling phase, various modeling techniques are selected and applied, and their parameters are calibrated to achieve an optimal prediction. Typically, there are several techniques that can be applied and some techniques have specific requirements on the form of data. Therefore, going back to the data preparation phase is often necessary.  In the model evaluation phase however, the goal is to build a model that has high quality from a data analysis perspective. Before proceeding to final deployment of the model, it is important to thoroughly evaluate it and review the steps executed to create it, to be certain the model properly achieves the business objectives.

    In the Jupyter notebook on Kaggle this boiled down to e.g. splitting the data set into training and testing data sets (using stratified cross validation) and then train several models using distinct classification algorithms such as Gradient Boosting Classifier, Support Vector Machines, Random Forest and K-Nearest Neighbors:

    06.01-Model-Training

    Following this step model evaluation continued but printing out the confusion matrix for each algorithm to get a more in-depth view of the accuracy and precision offered by the models:

    06.02-Model-Evaluation

    Using the Model Builder of IBM Watson Studio we can get to a model and an evaluation of it accuracy a bit faster and without any programming required. The model builder in IBM Watson Studio is an interactive tool that guides you, step by step, through building a machine learning model by uploading training data, choosing a machine learning technique and algorithms and finally train and evaluate the model.

    To create a new model using the IBM Watson Studio do the following:

    1. Select the Assets tab for your IBM Watson Studio project.
    2. Locate the Models section and invoke the command New Watson Machine Learning model.
      06.03-Create-Model
    3. In the New Model dialog:
      1. Enter the Name of the machine learning model (e.g. ‘Customer Churn – Manual’).
      2. Select the Watson Machine Learning service that you created in section 2 as the Machine Learning Service.
      3. For the Runtime, select the Apache Spark service that you created in section 2.
      4. Specify Manual as the approach for training the models.
    4. Click Create.
    5. On the next page titled “Select data asset”, simply select the data set that you imported in section 2 (you do not need to use the file that was preprocessed using Refine in the previous section).
    6. Click Next which will take you to the next page where you can select the Machine Learning algorithms to be used for the classification.
      06.05-Machine-Learning-Configuration
    7. On the page titled Select a technique do the following
      1. Select ‘churn’ as the column value to predict.
      2. Leave the default of using all feature columns for the prediction.
      3. Select Binary Classification.
      4. Keep the default settings for the test-validation-hold-out split of the data set.
      5. On the top right of the page select Add Estimators.
      6. Select Random Forest Classifier and click Add.
      7. Repeat the same step for Gradient Boosted Tree Classifier.
    8. Click Next and wait for the moment when the models have been trained.
      06.06-Model-Evaluation
    9. Evaluate the model Model Performance and area under ROC and PR curve. They figures may be slightly different to the figures shown above but the performance of the two estimators should be the same (from Excellent to Good).
    10. Keep Random Forest Classifier as the selected approach and click Save to save the model.
    11. Should IBM Watson Studio asks you for confirmation, e.g. whether to save the model or not, click Save.
    12. The resulting page will provide you with information about the model and its evaluation results.

    06.07-Model-Evaluation-Metrics

    The model evaluation report does no provide exactly the same set of classification approaches and evaluation metrics as the Jupyter notebook did, but it arrived at a result significantly faster.

    I find this Model Builder component of IBM Watson Studio extremely useful in creating an initial machine learning model that can be evaluated with respect to prediction performance and tested as well without time consuming programming efforts. The single prediction delivered by the service (Excellent, Good, Fair, Poor) is also helpful in initially getting an idea whether the data set at hand is at all useful for the purpose that we intend to use it for. Another advantage which can be observed from the page above is that it is possible to configure performance monitoring of the model. This will provide you with the ability to monitor the execution of the model as it is used and retrain the model the model on the run as feedback data are gathered. For an example on how to do this, see for example the tutorial “Build, deploy, test, and retrain a predictive machine learning model” or the video “Build a Continuous Learning Model” that is part of the IBM Watson Machine Learning course on developer Works.

    You can try out this way of using the Model Builder by creating a model using a data set for customer churn that is available in IBM Watson Studio community. Do the following to get this data set into your project:

    1. Select the Community tab in the toolbar of IBM Watson Studio.
    2. Enter ‘Telco’ as search term.
    3. Select the filter icon titled All filters.
    4. Enable ‘Data Sets’ only so that you only see the data sets.
    5. Select the ‘Customers of a telco including services used’ dataset.
    6. Click the + button in the right bottom corner to import the dataset into your project.
    7. Select your project in the Add to project menu.
    8. Click Add and wait for the import to finish.
    9. Select the View Project button to get back to your project.
    10. Select the Asset tab to get back to the page that shows your asset and locate the imported data asset.

     

    You can now continue very fast with data understanding and model building. Open the imported data set to view the attributes. Then repeat the steps to build a model from this data set using a binary classification estimator and ‘churned’ as target attribute. Wait a few minutes and you will get the feedback for the performance of the estimators. It is likely to be Poor for the given data set.

    06.09-Model-Evaluation

  7. Deployment and Test using the IBM Watson Machine Learning Service

    According to the IBM process for Data Science, once a satisfactory model has been developed and is approved by the business sponsors, it is deployed into the production environment or a comparable test environment. Usually it is deployed in a limited way until its performance has been fully evaluated.

    With the Model Builder and Machine Learning service of IBM Watson Studio, we can deploy a model in 3 different ways: as a web service, as a batch program or as real time streaming prediction. In this recipe we shall simply deploy it as a web service and then continue immediately by testing it interactively.

    To deploy the model do the following within the resulting model evaluation page from the previous step. Alternatively, locate the model in the Model section of the Assets tab for the project and click the name of the model to open it:

    1. Select the Deployments tab.
    2. Click Add Deployments in the upper right part of the page.
    3. On the Create Deployment page do the following:
      07.01-Create-Web-Deployment
      1. Enter a Name for the deployment, e.g. ‘Customer Churn – Manual – Web Deployment’.
      2. Keep the default Web Service Deployment type setting.
      3. Enter an optional Description.
    4. Click Save to save the deployment.
    5. Wait until the IBM Watson Studio set the STATUS field to DEPLOYMENT_SUCCES.
      07.02-Deployed-Model

     

    The model is now deployed and can be used for prediction. However, before using it in a production environment it may be wortwhile to test it using real data. This can be done interactively or programmatically using the API for the IBM Machine Learning Service. We shall look into using the API in an upcoming section of the recipe and will continue in this section testing it interactively.

    The Model Builder provides you with two options for testing the prediction: by entering the values one by one in distinct fields (one for each feature), or to specify the feature values using a JSON object. We shall use the second option since it is the most convenient one when tests are performed more than once (which is usually the case) and when a large set of feature values are needed. To get thold on a predefined test data set do the following:

    1. Download the test data from GitHub in the file ibm-watson-machine-learning/Customer Churn Test Data.txt.
    2. Open the file and copy the value.

     

    Notice that the JSON object defines the names of the fields first, followed by a sequence of observations to be predicted – each in the form of a sequence:

    {"fields": ["state", "account length", "area code", "phone number", "international plan", "voice mail plan", "number vmail messages", "total day minutes", "total day calls", "total day charge", "total eve minutes", "total eve calls", "total eve charge", "total night minutes", "total night calls", "total night charge", "total intl minutes", "total intl calls", "total intl charge", "customer service calls"], "values": [["NY",161,415,"351-7269","no","no",0,332.9,67,56.59,317.8,97,27.01,160.6,128,7.23,5.4,9,1.46,4]]}

    Be aware that some of the features such as state (and phone number) are expected to be in the form of strings (which should be no surprise), whereas the true numerical features can be provided as integers or floats as appropriate for the given feature. 

    To test the model at runtime do the following:

    1. Select the deployment that you just created by clicking the link named by the deployment (e.g. ‘Customer Churn – Manual – Web’).
    2. This will open a new page providing you with an overview of the properties of the deployment (e.g. name, creation date, status).
    3. Select the Test tab.
      07.03-Test-Deployment
    4. Select the icon above that allows you to enter the values using JSON.
    5. Paste the JSON object in the downloaded ‘Customer Churn Test Data.txt’ file into the Enter input data field.
    6. Click the Predict button.
      07.04-Test-Results

     

    The result of the prediction is given in terms of the probability that the customer will churn (True) or not (False). You can try it with other values, e.g. by substituting the values with values taken from the ‘Customer Churn – Kaggle.csv’ file. Another test would be to change the phone number to e.g. “XYZ” and then run the prediction again. The result of the prediction should be the same.

  8. Modeling and Evaluation using the SPSS Modeler Flows

    IBM Watson Studio Modeler flows provide an interactive environment for quickly building machine learning pipelines that flow data from ingestion to transformations and model building and evaluation – without needing any code.

    We shall briefly introduce the component in this section of the recipe by going through fhe following steps:

    • Create a new model flow from an existing model flow on GitHub.
    • Change the model flows input file and then run it.
    • Get into the main details of the flow to understand how it works and what kind of features the modeler flow provides for defining machine learning pipelines and models.
    • Deploy the flow to the IBM Watson Machine Learning model.

     

    Once that the model has been deployed we will test it in the next section using a Jupyter notebook for Python.

    To create an initial machine learning flow, do the following:

    1. From the Assets page, click Add to project.
    2. In the Choose asset type dialog, select Modeler Flow.
    3. On the next page titled Modeler, select the ‘From File’ tab.
      08.01-Create-Flow
    4. Download the modle flow named ‘Customer Churn Flow.str’ from https://github.com/EinarKarlsen/ibm-watson-machine-learning.
    5. Drag and drop the downloaded modeler flow file the upload area. This will also set the name for the flow (see above screenshot).
    6. Change the name and provide a description for the machine learning flow if you like (optional).
    7. Click Create. This opens the Flow Editor that ca nbe used to create a machine learning flow.

     

    You have now imported an initial flow that we will explore in the the remainder of this section.

    08.02-Initial-Modeler-Flow

    As you can get an overview of the various supported modeling techniques from the Palette to the right of the page. The first one is Auto Classifier that will try several techniques and then present you with the results of the best one.

    The main flow itself defines a pipeline consisting of several steps:

    • A Data Asset node for importing the data set.
    • A Type node for defining meta data for the features, including a selection of the target attribute for the classification.
    • An Auto Data Prep node for preparing the data for modeling.
    • A Partition node for partitioning the data into a training set and a testing set.
    • An Auto Classifier node called ‘churn’ for creating and evaluating the model.

     

    Additional nodes have been associated with the main pipeline for viewing the input and output respectively. These are:

    • A Table output node called ‘Input Table’ for previewing the input data.
    • A Data Audit node called ’21 fields’ (default name) for auditing the quality of the input data set (min, max, standard deviation etc.).
    • An Evaluation node for evaluating the generated model.
    • A Table output node called ‘Result Table’ for previewing the results of the test prediction.

     

    We will go through the details one by one in the remainder of this section before we finally deploy the model to the IBM Watson Machine Learning Service. But first you will need to run the flow and before doing this you must connect the flow with the appropriate set of test data available in your project. Consequently do the following:

    1. Select the 3 dots of the Data Asset node to the left of the flow (the input node).
    2. Invoke the Open command from the menu. This will show the attributes of the node in the right part of the page.
      08.03-Data-Asset-Properties
    3. Click the Change data asset button to change the input file.
    4. On the next page, select your CSV file containing customer churn and click OK.
    5. Click Save.
    6. Click the Run button (the arrow head) in the toolbar to run the flow.
      08.04-Output-Section

    Running the flow will create a number of outputs or results that can be inspected in more detail.

    If we follow the flow in the original Jupyter notebook on Kaggle, then the first step following data import is to view the data. To achieve this do the following:

    1. Select the Input Table node.
    2. Select the 3 dots in the upper right corner and invoke the Preview command from the popup menu.

     

    The last interaction may run part of the flow again but has the advantage that the page provides a Profile tab for profiling the data and a Visualization tab for creating dashboards:

    08.05-Preview-Data-Set

     

    The Jupyter notebook then continues providing a description for each of the columns listing their minimum, maximum, mean and standard deviation – amongst others. To achieve a similar task with the current flow do the following:

    1. Select the command View outputs and versions from the top right of the toolbar.
      08.04-Output-Section
    2. Select the Output tab.
    3. Double click the output for the node named “21 Fields”.Alternatively select the 3 dots assocaited with the putput and invoke Open from the pupup menu.

     

    This will provide you with the following overview:

    08.06-Automatic-Profiling

    For each feature it shows the distribution in graphical form and whether the feature is categorical or continuous. For numerical features the computed min, max, mean, standard deviation and skewness are shown as well. From the column named Valid we observe that there are 3333 valid values meaning that no values are missing for the listed features and we do not need to bother further with this aspect of preprocessing to filter or transform columns with lacking values.

    You can actually change the initial assessment of the features made by the import using the Type node which happens to be the next node in the pipeline. To achieve this do the following:

    1. Go back to the Flow Editor by selecting ‘Customer Churn Flow’ in toolbar.
    2. Select the Type node.
    3. Invoke the Open command from the popup menu.

     

    This will provide a table showing the features (e.e. fields), their kind (continous, flag etc) and role – amongst others:

    08.07-Type-Node

    The Measure can be changed if needed using this node and it is also possible to specify the role of a feature. In this case the role of the churn feature (which is a Flag with True and False values) has been changed to Target. The Check column may give you more insight into the values of the field.

    The Jupyter notebook continued by transforming categorical fields into numerical ones using label encoders and by normalizing the fields. The same can be achieved with very little work required using the Auto Data Prep node. To continue simply:

    1. Click Cancel to close the property editor for the Type node.
    2. Select the Auto Data Prep node in the flow editor.
    3. Invoke Open from the popup menu.

     

    This node offers a multitude of settings, e.g. for defining the objective of the transformation (optimize for speed or for accuracy).

    08.08-Auto-Data-Prep

    The screenshot above shows that the transformation has been configured to exclude fields with too many missing values (treshhold being 50) and to exclude fields with too many unique categories. I assume that the latter applies to the phone numbers and have therefore decided not to worry more about them.

    The next node in the pipeline is the Partition node, which splits the data set into a training set and a testing set. For the current Partition node a 80-20 split has been used:

    08.09-Partition-Node

    Having transformed and partioned the data the notebook continues by training the model. In the Modeler Flow this is achieved by the Auto Classifier node which – amongst others – provides various settings e.g. for ranking and discarding (using threshold accuracy) the models generated.

    08.10-Model-Node

    Notice that the property Default number of models to use is set to 3 which is the default value. Please feel free to change it to 5 and then click Save to save the changes.

    To get more details about the generated model do the following:

    1. Select the yellow model icon
    2. invoke the View Model command from the menu.

     

    This overview section will provide you with a list of 3 selected classifier models and their accuracy.

    08.11-Model-Evaluation

    The estimator with the least accuracy is the C&R Tree Model.To dive into the detals do the following:

    1. Select name C&RT (it is a link).
    2. On the next page select the Tree Diagram link to the left to get the tree diagram for the estimator.

     

    You can now hover over either one of the nodes or one of the branches in the tree to get more detailed information about decision made at a given point:

    08.16-Tree-Diagram

    Go back by clicking the left arrow in the top left of the corner. Then select the Random Tree estimator to get the details for that estimator:

    08.11-Model-Evaluation-Details

    The confusion matrix shows the distribution for the training data set.  There are other tabs for getting e.g. more detailed metrics regarding the model evaluation as well as the predictor importance.

    Notice that the current pipeline performs a simple split of test and training data using the Partition node. It is also possible to use cross validation and stratified cross validation to achieve slightly better model performance but at the cost of complicating the pipeline. We refer to the article ‘k-fold Cross-validation in IBM SPSS Modeler‘ by Kenneth Jensen for details on how this can be achieved.

    Showing predictor importance was the last step in the original notebook on Kaggle. To get that information for the Random Tree classifier select the Predictor Importance tab to the left:

    08.12-Model-Predictor-Importance

    There are two more ways of viewing the results of the evaluation.

    1. Go back to the flow editor for the Customer Churn Flow.
    2. Select View outputs and version from the top toolbar.
    3. Select the output named ‘Evaluation of [$XF-churn] : Gains’ by double clicking it.

     

    You will see the generated outputs for the model. Moreover, select the output node named Evaluation, then double click it to get the Gain information:

    08.13-Gains

    After you create, train, and evaluate a model, you can deploy it.

    08.18-Save-Branch-as-Model-1

     

    To deploy the SPSS model do the following:

    1. Go back to the flow editor for the model flow.
    2. Select the output node shown above (or one of the other output nodes).
    3. Invoke the command ‘Save branch as model from the popup menu.
    4. A new window opens.
      08.19-Save-Model
    5. Type a model name, e.g. ‘Customer Chrun – SPSS Model’
    6. Click Save.
    7. The model is saved to the current project.

     

    If interested in seeing other examples for using the SPSS Modeler to predict customer churn please see the tutorial ‘Predict Customer Churn by Building and Deploying Models Using Watson Studio Flows

  9. Scoring Machine Learning Models using the API

    In section 7 we tested the Machine Learning service interactively. In this section we shall see how the service can be used for predicting customer churn using the Machine Learning Service API and a Jupyter notebook for Python. The notebook is quite simple and consists of 4 code cells:

    09.01-Python-Notebook

    The first code cell imports the libraries needed for submitting REST requests. The second defines the credentials for the IBM Watson Machine Learning service. The third cell defines the payload for the scoring – basically the same payload that you used in section 7 to test the model generated by the Model Builder. The fourth cell constructs a HTTP POST request and sends it to the server to get the scoring for the payload. The requests needs the credentials for the IBM Watson Machine Learning service and the API scoring endpoint for the created model.

    To get the notebook to run in your environment you will need to do the following:

    1. Deploy the machine learning model and get the code template for calling the API endpoint for scoring using Python.
    2. Obtain the credentials for your IBM Watson Machine Learning service.
    3. Create a new Jupyter notebook for Python from the basis of a notebook on GitHub.
    4. Modify the notebook to use the endpoint of your machine learning model and IBM Watson Machine Learning service.
    5. Run the notebook.

     

    To deploy the model and get the template code for scoring the model do the following:

    1. Locate the Watson Machine Learning Models that you have created and open the one named ‘Customer Churn – SPSS Model’.
    2. Select the Deployment tab.
    3. Create a new Web service deployment named ‘Customer Churn – SPSS Model – Web Service’.
    4. Wait until the deployment has been created, then open the deployment by clicking on the name.
    5. Select the Implementation tab.
    6. Select the Python tab to render the Python template code for using the API to get a prediction.
      09.02-Template-Python-Code
    7. Save the code for later use.

     

    The code defines the API endpoint, the payload for scoring as well as the header to be passed to the POST request to get the prediction. This header will need the credentials for the IBM Watson Machine Learning service.

    1. Go back to your Watson Studio Project.
    2. From the toolbar select Services > Watson Services. This will provide you with a list of all IBM Cloud Watson services that you have used.
    3. Select the Watson Machine Learning Service that you are using in this project. This will open the dashboard for the service.
    4. Select the Service credentials tab to the left of the dashboard
    5. Click the New Credential button to the right to create the credentials
    6. Copy the credentials (including username, password and API key) to a local file.

     

    If you are in doubt which IBM Watson Machine Learning service you are using in the project, simply select Settings from the IBM Watson Studio toolbar and you will get a list of all services associated with the project.

    Next import a notebook from GitHub and modify the notebook to use the credentials and endpoint for your model:

    1. In the Asset tab of your IBM Watson Studio project, select the command New Notebook.
    2. Select the From URL tab.
    3. Click the following hyperlink ‘Test SPSS Customer Churn Machine Learning Model‘ and copy the URL. Then paste the URL to the URL field.
    4. Select the Free Python runtime system.
    5. Click Create Notebook.
    6. Copy your Machine Learning service credentials into the second code cell as shown in the first screenshot in this section.
    7. Replace the content of the 4th cell with the similar code fragments for your deployment (the important part of the code to replace is the API endpoint)
    8. Invoke File > Save.

     

    Having modified the code you can run the cells one by one and finally get the score. Feel free to test the prediction with other values.

  10. Conclusion

    In this recipe we have briefly presented 3 approaches for creating machine learning models in IBM Watson Studio: Jupyter notebooks with Python, SPSS Modeler Flows and last but not least the Model Builder.

    The Model Builder provides the highest degree of automation and makes it possible to generate a machine learning model that can be evaluated, deployed and tested within a few minutes by simple user interactions with IBM Watson Studio. It does not however give much insight into what is going on behind the scene with regard to data preparation and transformation, the training process or the detailed evaluation metrics. It is however very useful in generating models very fast that can be used right away in a business context or to get an assertion whether the data set at hand can at all be used as a basis for training models (in its raw form). This component is backed up with capabilities of IBM Watson Studio such as dashboards and Refine that come in handy during the Data Understanding and Data Transformation phase when the transformations needed are of limited complexity.

    The SPSS Modeler Flow provides a graph editor for composing machine learning pipelines with an extensive palette of operations for data transformation (cleansing, filtering, normalization etc) as well as a large set of data science estimators to choose from. One of these is the Auto Classifier that will automatically train several models at once enabling the user to pick the most suitable one at the end. This is backed up with an extensive set of capabilities supporting the Data Understanding and Model Evaluation phase – all using a graphical notation and without the need to get deeply involved in any kind of programming. Straight forward pipelines can therefore be built in a short time, and the approach provide significantly more transparency and control compared to e.g. the Model Builder.

    In context of a more intensive need for data transformations during the Data Preparation phase or specific approaches for e.g. model training and model evaluation during the Modeling phase (e.g. using stratified cross validation) Jupyter notebooks and Python numpy, pandas and scikit-learn are probably still the place to be. However this does not necessarily imply that everything need to be done in Python as in the original notebook. Task such as Data Understanding can more easily be undertaken using e.g. the Profiler and Dashboard capabilities of IBM Watson Studio. Final deployment of machine learning models can also be achieved using e.g. IBM Watson Machine learning – although this capability has been out of scope for the current recipe. Last but not least, once deployed the models can be monitored and retrained using the capabilities of the IBM Machine Learning service.

     

  11. Acknowledgement

    This recipe started out with a dataset and a corresponding Jupyter Notebook for predicting customer churn from Sandip Datta available on Kaggle. I would like to thank Sandip Datta for making both assets – of very good quality – available for use by others.

Join The Discussion