Build, train, and evaluate a machine learning product-based classifier on a customer profile

Building, training, and evaluating machine learning models can be done graphically by using the SPSS® Modeler flow feature in IBM® Watson™ Studio Desktop. IBM Watson SPSS Modeler flows in Watson Studio Desktop provide an interactive environment for quickly building machine learning pipelines that flow data from ingestion to transformation to model building and evaluation, without needing any code.

This tutorial uses the SPSS Modeler components in IBM Watson Studio Desktop to build, train, and evaluate a machine learning product-based classifier on a customer profile to predict whether a customer would subscribe to a mortgage, savings, or pension account. As an alternative, you can use IBM Watson Studio in IBM Cloud.

In this tutorial, I use existing financial data to train and evaluate the model.

Prerequisites

To complete this tutorial, you need IBM Watson Studio Desktop. You can get a free trial version of Watson Studio Desktop, which also includes the SPSS Modeler flow feature.

As an option to using Watson Studio Desktop, you can use IBM Cloud and create services like Watson Studio, IBM Watson Machine Learning Service, and IBM Cloud Object Storage. For this, you need to create an IBM Cloud account. You can get a free trial account if you don’t already have one.

Estimated time

It should take you approximately 60 minutes to complete this tutorial.

Steps

If you need to understand the basics of the SPSS Modeler flow, read the Creating SPSS Modeler flows in Watson Studio tutorial.

Create the model flow

First, upload the initial flow. After installing Watson Studio Desktop:

  1. Create a new empty project and give it a name.

    Create Project

  2. Click Add to Project, and select Modeler Flow. From the From File tab, provide a name and an optional description and upload the modeler flow (finance-products-promotion.str). Click Create to create the modeler flow.

    Import modeler flow

Assign data assets

To run the flow, you must first connect the flow with the appropriate set of test data available in your project.

  1. Select the Data Asset node to the left of the flow (the input node).

  2. Select Open from the menu. The following image shows the attributes of the node to the right.

    Import modeler flow

  3. Click Change data asset to change the input file.

  4. Select your .csv file that contains the financial data, and click OK.

  5. Click Save.

Data background

In 2016, a retail bank sold several products (mortgage accounts, savings accounts, and pension accounts) to its customers. It kept a record of all historical data, and this data is available for analysis and reuse. Following a merger in 2017, the bank has new customers and wants to launch some marketing campaigns.

From the historical data, you can train a machine learning product-based classifier on a customer profile (age, income, or account) to predict whether a customer would subscribe to a mortgage, savings, or pension account. You can apply this predictive model to the new customer data and predict what products they will subscribe to.

Understanding the data

The following steps show how we analyze the historical data of customers from 2016. Because we are promoting financial products to customers, let’s see how many customers have subscribed to each product through the flow.

  1. Select the Mortgage data node. It’s a Select type where you can select data with a condition like in an SQL query. Click Open.

    Mortgage node

  2. There is a condition where Mortgage = 1. This selects all of the customers who bought a Mortgage product. Click the Mortgage data node again, and select Run. This gives you the output by applying the condition to the data set.

    Mortgage node run

  3. Select the three dots on the output, and click Open. This gives you the result and shows the total count of Mortgage, which indicates the number of customers who bought the Mortgage product.

    Mortgage node

You can use these same steps to find out how many customers bought Savings and Pension products. Use the Savings and Pension nodes instead of the Mortgage node.

Next, we’ll look to see how many customers bought multiple products

  1. Click the nb_product derive node, then click Open. You see that in the expression field there is a formula for this new derived column. It’s the sum of the values in each of the Mortgage, Savings, and Pension fields. A 1 in this field means that the customer has subscribed to this product, otherwise it’s 0.

    Multiple product derive node

  2. After the new field is run, you can add a Select node where you can have a condition to select the rows where the new derived field (nb_product) has a value greater than 1. This means these are the customers who bought more than one product.

    Multiple product select node

  3. Click Run by selecting the three dots on the same derive node. Then, click Open on the output from the right pane to see the result. You see a new derived field with the sum of the values.

    Multiple product run

    The output shows the total number of customers who bought more than one product.

    Multiple product result

Now, we’ll find out how many customers bought all three products.

  1. From the same derive node, connect it to a Select node, and the condition for this use case is to select the rows where the new derived field (nb_product) has a value equal to 1.

    All product

  2. Click Run by selecting the three dots on the same derive node. Then, click Open on the output from the right pane to see the result. You see a new derived field with the sum of the values.

    All product run

    The output shows the total count of customers who bought all three products.

    All product result

Understanding historical data through plotting

Using SPSS Modeler flow, it’s possible to visualize the data using Graph node types.

Let’s add few graph nodes to analyze the data visually. For Mortgage, let’s visually analyze the members_in_household versus loan_accounts.

  1. From the Mortgage node, connect it to a Plot graph. Click the node and select Open.

  2. Choose members_in_household as the X Field and loan_accounts as the Y field. You can use the defaults as the rest. Click Save.

    Mortgage visualization

  3. Click the Plot node again, and click Run. Select the output, and click Open to see the graph.

    Mortgage visualization

    Mortgage visualization

In the visualization, you see the behavior of the customers in 2016 for the Mortgage. The darker color indicates that a customer bought a product. The depth of the color indicates the number of purchases.

You can apply this same approach to Pension and Savings to visualize them using graphs.

From the analysis, you see that:

  • The greater a customer’s income, the more likely that they will buy a savings account.

  • The older a customer is, the more likely that they will buy a pension account.

  • There is a correlation between the number of people in a customer’s household, the number of loan accounts that are held by the customer, and the likelihood that a customer buys a mortgage account. To see the correlation, look at the upper right and lower left corners of the mortgage chart.

Predicting 2017 customer behavior

To predict the future behavior of the customer, we need to build and train a simple machine learning algorithm to predict what the new clients will buy. We use the Auto Classifier node from the modeling category of SPSS Modeler. In this tutorial, we create three models, one for each product.

We use the Type node to understand the metadata of the data. The metadata shows what kind of data we are dealing with. The data could be continuous, categorical, Flag, Nominal, or Ordinal.

Metadata

Now, let’s prepare the data for creating three different models. We will use the Auto Data Prep node. This node prepares the data for training and applies the machine learning algorithm.

  1. Click the three dots in the Auto Data Prep node, and click Open. In the Fields section, select Mortgage as the Target, and add age, income, members_in_household, and loan_accounts as input fields. Selecting Mortgage as the Target means that we are modeling the Mortgage field so that it can be predicted for future data.

    Mortgage Model

  2. Choose the algorithm for modeling. We use the Auto Classifier node from the Modeling category and connect it to the Auto Data Prep node. The Auto Classifier algorithm applies the top three best algorithms to create the model and chooses the best one among them. The top three algorithms are chosen based on the metrics you select. In this case, we are using the defaults.

    Mortgage Model

  3. Click the Data Asset node, then click Run to apply the algorithm to create the machine learning model for the Mortgage product.

    Mortgage Model

    Mortgage Model

  4. After the building and training is complete, the process creates a model node that can then be used for analysis and evaluation. Let’s add a Table node and see the data after training. This creates two extra fields called $XS_Mortgage_transformed and $XSC_Mortgage_transformed. The $XS_Mortgage_transformed field is the predicted target value of the Auto Classifier node’s ensemble of top three models. The $XSC_Mortgage_transformed field is the “confidence” of the prediction for each record.

    Mortgage Model

    Mortgage Model

  5. To evaluate the model, let’s use the Evaluation graph node. Click the node, and select Open. Keep all of the defaults because it will use the _transformed column for evaluation. Click Run on the same node.

    Mortgage Evaluation

    To see the output, select the output that it generates after clicking Run.

    Mortgage Evaluation

The previous graph is a Cumulative Gains chart used to evaluate the model created. For cumulative charts, higher lines indicate better models, especially on the left side of the chart. In many cases, when comparing multiple models the lines cross, so one model is higher in one part of the chart and another line is higher in a different part of the chart. In this case, you must consider what portion of the sample you want (which defines a point on the x axis) when deciding which model to choose.

The model is now created, and we are ready to apply the model to test data. Because Watson Studio Desktop doesn’t allow you to save the model, applying the model to test data is not part of this tutorial. To apply the model, you must use the perpetual version of Watson Studio Desktop.

Conclusion

This tutorial walked you through the steps to use the SPSS Modeler components in Watson Studio Desktop to build, train, and evaluate a machine learning product-based classifier on a customer profile to predict whether a customer would subscribe to a mortgage, savings, or pension account.