Kubernetes with OpenShift World Tour: Get hands-on experience and build applications fast! Find a workshop!

Creating SPSS Modeler flows in Watson Studio

This tutorial is part of the Introduction to IBM Watson Studio learning path.

Introduction

This tutorial explains how to graphically build and evaluate machine learning models by using the SPSS Modeler flow feature in IBM® Watson™ Studio. IBM Watson SPSS Modeler flows in Watson Studio provide an interactive environment for quickly building machine learning pipelines that flow data from ingestion to transformation to model building and evaluation, without needing any code. This tutorial introduces the SPSS Modeler components and explains how you can use them to build, test, evaluate, and deploy models.

As with the other tutorials in this learning path, we use a customer churn data set that is available on Kaggle.

Prerequisites

To complete the tutorials in this learning path, you need an IBM Cloud account. You can obtain a free trial account, which gives you access to IBM Cloud, IBM Watson Studio, and the IBM Watson Machine Learning Service.

Estimated time

It should take you approximately 60 minutes to complete this tutorial.

Steps

The SPSS Modeler flow feature is also available on IBM Watson Studio Desktop. The same steps to create an SPSS Modeler flow mentioned below for Watson Studio on IBM Cloud also apply to Watson Studio Desktop. To skip to Watson Studio Desktop, see SPSS Modeler flow using Watson Studio Desktop.

The steps to set up your environment for the learning path are explained in the Data visualization, preparation, and transformation using IBM Watson Studio tutorial. These steps show how to:

  1. Create an IBM Cloud Object Storage service.
  2. Create an IBM Watson Studio project.
  3. Provision IBM Cloud services.
  4. Upload the data set.

You must complete these steps before continuing with the learning path. If you have finished setting up your environment, continue with the next step, creating a model flow.

Create model flow

To create an initial machine learning flow:

  1. From the Assets page, click Add to project.

  2. In the Choose asset type page, select Modeler Flow.

  3. On the Modeler page, select the ‘From File’ tab.

    create-flow

  4. Download the model flow that is named ‘customer-churn-flow.str’ from https://github.com/IBM/watson-studio-learning-path-assets/data.

  5. Drag the downloaded modeler flow file to the upload area. This also sets the name for the flow.

  6. Change the name and provide a description for the machine learning flow (optional).

  7. Click Create. This opens the Flow Editor that can be used to create a machine learning flow.

You have now imported an initial flow that we’ll explore in the rest of this tutorial.

initial-flow

Under the Modeling drop-down menu, you can see the various supported modeling techniques. The first one is Auto Classifier, which tries several techniques and then presents the results of the best one.

The main flow itself defines a pipeline consisting of several steps:

  • A Data Asset node for importing the data set
  • A Type node for defining metadata for the features, including a selection of the target attributes for the classification
  • An Auto Data Prep node for preparing the data for modeling
  • A Partition node for partitioning the data into a training set and a testing set
  • An Auto Classifier node called ‘churn’ for creating and evaluating the model

Additional nodes have been associated with the main pipeline for viewing the input and output. These are:

  • A Table output node called ‘Input Table’ for previewing the input data
  • A Data Audit node called ’21 fields’ (default name) for auditing the quality of the input data set (min, max, standard, and deviation)
  • An Evaluation node for evaluating the generated model
  • A Table output node called ‘Result Table’ for previewing the results of the test prediction

Other input and output types can be viewed by selecting the Outputs drop-down menu.

Assign data asset and run the flow

To run the flow, you must first connect the flow with the appropriate set of test data available in your project.

  1. Select the three dots of the Data Asset node to the left of the flow (the input node).

  2. Select the Open command from the menu. This shows the attributes of the node in the right part of the page.

    data-asset-properties

  3. Click Change data asset to change the input file.

  4. On the next page, select your .CSV file that contains the customer churn, and click OK.

  5. Click Save.

  6. Click Run (the arrow head) in the toolbar to run the flow.

    run-command

Running the flow creates a number of outputs or results that can be inspected in more detail.

run-and-output

Understanding the data

Now that you have run the flow, take a closer look at the data.

  1. Select the Input Table node at the top of the flow diagram.

  2. Select the three dots in the upper-right corner and invoke the Profile command from the pop-up menu.

    preview-input-data-option

The last interaction might run part of the flow again but has the advantage that the page provides a Profile tab for profiling the data and a Visualization tab for creating dashboards.

preview-data-set

Now, let’s take a closer look at each of the data columns, such as the values for their minimum, maximum, mean, and standard deviation:

  1. Click one level back in the bread crumb list at the top of the page to return to your flow.

    bread-crumb

  2. Select the View outputs and versions command from the upper-right portion of the toolbar.

  3. Select the Outputs tab.

    outputs-tab

  4. Double-click the output for the “data audit” node named “21 Fields.” Alternatively, select the three dots associated with the output and select Open from the pop-up menu.

    21-fields-option

This gives you an overview like the one in the following image.

21-fields-output

For each feature, the overview shows the distribution in graphical form and whether the feature is categorical or continuous. For numerical features, the computed min, max, mean, standard deviation, and skewness are shown as well. From the column named Valid, you can see that there are 3333 valid values, which means that no values are missing for the listed features and you do not need to bother further with this aspect of preprocessing to filter or transform the columns with lacking values.

Data preparation

You can change the initial assessment of the features made by the import by using the Type node, which happens to be the next node in the pipeline. To achieve this:

  1. Go back to the Flow Editor by selecting ‘customer-churn-flow’ in the toolbar.

  2. Select the Type node.

  3. Select the Open command from the pop-up menu.

This provides a table that shows the features (such as fields), their kind (for example, continuous or flag), and role, along with others.

type-node-output

The Measure can be changed if needed using this node and it is also possible to specify the role of a feature. In this case, the role of the churn feature (which is a Flag with True and False values) has been changed to Target. The Check column might give you more insight into the values of the field.

Click Cancel to close the property editor for the Type node.

The next node in the pipeline is the Auto Data Prep node. This node automatically transforms the data, such as converting categorical fields into numerical ones. To view its results:

  1. Select the Auto Data Prep node in the flow editor.

  2. Select Open from the pop-up menu.

This node offers a multitude of settings, for example, for defining the objective of the transformation (optimize for speed or for accuracy).

auto-data-prep

The previous image shows that the transformation has been configured to exclude fields with too many missing values (threshold is 50) and to exclude fields with too many unique categories. Assume that the latter applies to the phone numbers and don’t worry about them.

The next node in the pipeline is the Partition node, which splits the data set into a training set and a testing set. For the current Partition node, an 80-20 split has been used.

partition-node

Training the model

The next node in the SPSS Modeler flow is the Auto Classifier node named “churn.” This node trains the model based on various build options, such as how to rank and discard generated models (using threshold accuracy).

auto-classifier-node

If you Open the node and select the BUILD OPTIONS option from the drop-down menu, you see the property Number of models to use is set to 3, which is the default value. Feel free to change it to a higher number, and then click Save to save the changes.

NOTE: Remember to rerun the flow if you change any build settings.

Evaluating the model

To get more details about the generated model:

  1. Select the yellow model icon.

  2. Select View Model from the drop-down menu.

    view-model-option

This overview section gives you a list of classifier models and their accuracy. In this example, I set the Number of models to use to 10.

model-evaluation

As you navigate through this overview section, you’ll notice that the number of options and views that are associated with each estimator varies. In some cases, a hyperlink is provided to dig down into more details.

For example, take a look at the poor performing ‘C&R’ Tree Model by clicking the name in the table.

On the next page, select the Tree Diagram link to the left to get the tree diagram for the estimator.

You can now hover over either one of the nodes or one of the branches in the tree to get more detailed information about a decision made at a given point.

tree-diagram

Go back by clicking the left arrow in the upper-left part of the page. Then, select the MPL Neural Network link to get the details for that estimator. Note that has different options than the tree model.

Click the Feature Importance tab.

feature-importance

This graphs the relative performance of each predictor in estimating the model.

Click the Confusion Matrix tab.

model-eval-confusion-matrix

The table compares what is predicted versus what it observed. The numbers of correct predictions are shown in the cells along the main diagonal.

If you would like to get the confusion matrix for the complete data set, you can add a Matrix Output node to the canvas.

  1. Go back to the flow.

  2. Add a Matrix node from the Outputs menu.

    matrix-output

  3. Attach the matrix node to the specified model output node.

    add-matrix

    NOTE: To attach the new node, click the right-side bubble of the existing ‘churn’ model output node and drag the connector to the new matrix node.

  4. Open the Matrix node.

  5. Put the target attribute ‘churn’ in the Rows and the binary prediction ‘$XF-churn’ in the Columns.

    matrix-columns

  6. For Cell contents, select Cross-tabulations.

  7. Click Appearance and select Counts, Percentage of Row, Percentage of Column, and Include row and column totals.

    matrix-appearance

  8. Click Save.

  9. Run the Matrix node.

  10. Select View Output and Versions in the upper-right corner.

  11. Open the output for the Matrix node (named ‘churn x $XF-churn’) by double-clicking it.

    confisuion-matrix-pct

The main diagonal cell percentages contain the recall values as the row percentages (100 times the proportions metric that’s generally used) and the precision values as the column percentages. The F1 statistics and weighted versions of precision and recall over both categories would need to be manually calculated. The results shown are the combined results applying all three algorithms. If you want to see the results just for the Random Forest, go back to the Auto Classifier node. Open it and uncheck the boxes for all models other than Random Forest. Then, rerun the flow.

If you want to just get the confusion matrix, open the Matrix Output node and unselect ‘Percentage of Row’ and ‘Percentage of Column’ in the appearance section. Then, repeat steps 7-11 above.

confusion-matrix

A more graphical way of showing the confusion matrix can be achieved by using SPSS visualizations. For that purpose, you need to select the Result Table output node, then select the Profile option in the drop-down menu.

results-profile

Click the Visualizations tab. Then, click more options (the double arrow icon) to view the types of charts available. Select the Treemap chart.

treemap-selector

Set the Columns values to churn and $XF-churn, and select Count in the Summary.

treemap-isualization

Notice that the current pipeline performs a simple split of test and training data using the Partition node. It’s also possible to use cross-validation and stratified cross-validation to achieve slightly better model performance, but at the cost of complicating the pipeline. See the article k-fold Cross-validation in IBM SPSS Modeler for details on how this can be achieved.

There are two more ways of viewing the results of the evaluation.

  1. Go back to the flow editor for the Customer Churn Flow.

  2. Select View outputs and version from the top toolbar.

  3. Double-click the output named Evaluation of [$XF-churn] : Gains to select it.

    eval-xf-churn-gains

The generated outputs for the model appear.

model-gains

Saving and deploying the model

Note: Deploying the model feature is not part of Watson Studio Desktop (Subscription). However, you can download the SPSS model flow stream from Watson Studio Desktop, then import that to Watson Studio on IBM Cloud. You can run it again and create the model that you can deploy using the following steps.

After you create, train, and evaluate a model, you can save and deploy it.

To save the SPSS model:

  1. Go back to the flow editor for the model flow.

  2. Select the Predicted Output node and open its pop-up menu by selecting the 3 dots in the upper-right corner.

  3. Select Save branch as model from the pop-up menu.

    save-branch-as-model

    A new window opens.

    save-model

  4. Type a model name (for example, ‘customer-churn-spss-model’).

  5. Click Save.

    The model is saved to the current project.

The model should now appear in the Models section of the Assets tab for the project.

model-list

To deploy the SPSS model:

  1. Click the saved model in the project Models list.

  2. Select the Deployments tab.

  3. Click Add Deployment to create a new web service deployment named ‘customer-churn-spss-model-web-service.’

  4. Set the deployment type to Web Service.

  5. Click Save.

    model-deploy-success

  6. Wait until the deployment status is DEPLOY_SUCCESS.

Testing the model

Now, the model is deployed and can be used for prediction. However, before using it in a production environment it might be worthwhile to test it using real data. You can do this interactively or programmatically using the API for the IBM Machine Learning Service. For now, we test it interactively.

The UI provides two options for testing the prediction: by entering the values one by one in distinct fields (one for each feature) or by specifying the feature values using a JSON object. We use the second option because it is the most convenient one when tests are performed more than once (which is usually the case), and when a large set of feature values is needed. To get a predefined test data set:

  1. Download the test data from GitHub in the file customer-churn-test-data.txt.

  2. Open the file and copy the value.

Notice that the JSON object defines the names of the fields first, followed by a sequence of observations to be predicted, each in the form of a sequence:

{"input_data":[{"fields": ["state", "account length", "area code", "phone number", "international plan", "voice mail plan", "number vmail messages", "total day minutes", "total day calls", "total day charge", "total eve minutes", "total eve calls", "total eve charge", "total night minutes", "total night calls", "total night charge", "total intl minutes", "total intl calls", "total intl charge", "customer service calls"], "values": [["NY",161,415,"351-7269","no","no",0,332.9,67,56.59,317.8,97,27.01,160.6,128,7.23,5.4,9,1.46,4]]}]}

Note that some of the features, such as state and phone number, are expected to be in the form of strings (which should be no surprise), whereas the true numerical features can be provided as integers or floats as appropriate for the given feature.

To test the model at run time:

  1. Select the deployment that you just created by clicking the deployment name (for example, ‘customer-churn-spss-model-web-service’).

  2. This opens a new page that shows an overview of the properties of the deployment (for example, name, creation date, or status).

  3. Select the Test tab.

  4. Select the file icon, which then lets you enter the values using JSON.

  5. Paste the JSON object in the downloaded Customer Churn Test Data.txt file into the Enter input data field.

  6. Click Predict to view the results.

deploy-est-result

The prediction result is given in terms of the probability that the customer will churn (True) or not (False). You can try it with other values, for example, by substituting the values with values taken from the customer-churn-kaggle.csv file. Another test is to change the phone number to something like “XYZ” and then run the prediction again. The prediction result should be the same, which indicates that the feature is not a factor in the prediction.

If interested in seeing other examples for using the SPSS Modeler to predict customer churn, look at the tutorial Predict Customer Churn by Building and Deploying Models Using Watson Studio Flows.

Scoring machine learning models using the API

As mentioned previously, you can also access the model using the IBM Watson Machine Learning API. One way to do this is with a Jupyter Notebook, which is discussed in our next tutorial – Running Jupyter Notebooks in IBM Watson Studio.

After you have completed that tutorial and feel comfortable running Jupyter Notebooks, you can try out a sample notebook that will score the SPSS model you just created.

When creating the notebook, use the From URL option and enter:

https://github.com/IBM/watson-studio-learning-path-assets/blob/master/notebooks/spss-customer-churn.ipynb

To run the notebook, you will need to update it with:

  • Your Watson Machine Learning credentials, which are located in the Service Credentials tab of your service in IBM Cloud.

    ml-creds

  • The Scoring End Point URL for your deployed model, which is located in the Implementation tab for your deployed model.

    deploy-python-code

SPSS Modeler flow using Watson Studio Desktop

The SPSS Modeler flow feature is also available in Watson Studio Desktop. Watson Studio Desktop brings the power of best-in-class data science and AI tools from IBM to Windows and MacOS, empowering business leaders and data scientists alike. For more information, see IBM Watson Studio Desktop.

Steps

  1. Download and install Watson Studio Desktop. You get a free 30-day trial of Watson Studio Desktop, which also includes a trial for SPSS Modeler.
  2. Log in using your IBM Cloud credentials. If you don’t have an IBM Cloud account, you can sign up for one.
  3. Create a project.
  4. To create a model, use the same steps described previously for Watson Studio. See Create model flow.

Conclusion

This tutorial covered the basics of using the SPSS Modeler flow feature in Watson Studio, which included:

  • Creating a project
  • Provisioning and assigning services to the project
  • Adding assets to the project, such as data sets
  • Creating a Modeler flow
  • Using the Modeler flow editor to run and examine the model
  • Training and evaluating the model
  • Deploying the model as a web service
  • Scoring the machine learning model with test data

Using the SPSS Modeler flow feature of Watson Studio provides a non-programming approach to creating a model to predict customer churn. It provides an alternative to the fully programmed style of using a Jupyter Notebook, as described in the next learning path tutorial, Running Jupyter Notebooks in IBM Watson Studio.

Rich Hagarty
Einar Karlsen