Taxonomy Icon

Analytics

Learning objectives

In this how-to, you will prepare data and build a predictive model using IBM SPSS Modeler to assess the risk of a loan application, and either approve the application and give the loan to a customer, or reject the application.

The loan risk dataset used in this how-to is free, open-source, and available on the BigML website.

The dataset contains details about customers applying for loans. Some of the details available are checking status, duration, credit history, purpose, credit amount, savings status, employment, and more.

Prerequisites

IBM SPSS is available on IBM Watson Studio as one of many options to build predictive models.

If you want more flexibility in preparing your data and building your models than what Watson Studio’s Automatic Modeler offers, but still want the ease of use of a GUI interface and less code writing and complexity, you can use IBM SPSS Modeler.

Estimated time

It takes approximately 1 hour to read and follow the steps in this how-to.

Steps

Upload the dataset to IBM Watson Studio

The first step is to upload the dataset to IBM Watson Studio. To upload the dataset, make sure that you have a right-side panel open where you will find an area prompting you to import your data. You can drag and drop the dataset (.csv file) from your computer to that area, or click Browse to open file explorer on your computer where you can select the desired file. If you don’t have the right-side panel open by default, click the Data tab (upper right, second-to-last) in the toolbar to trigger the panel to open.

Note: The name of the dataset in the screenshot below is different from the dataset you are using.

uploading Watson Studio

Create a new flow

Go to the Modeler Flows section in the main dashboard of IBM Watson Studio.

main dashboard: Create New Flow

Name your Modeler task

Type a Name for your Modeler task, select Modeler Flow as the flow type and select IBM SPSS Modeler as runtime, then click Create. modeler task: Name your Modeler Task

Import the dataset

Click Import on the left-side panel to expand this section’s options. spss: Import the Dataset

Add Data Assets to the Modeler canvas

Drag and drop the Data Assets node onto the Modeler canvas spss: Select Data Asset

Edit the Data Assets node

Double-click the Data Assets node on the canvas to edit its properties. This opens a right-side panel with all configuration options for the selected node. spss: Edit Data Assets Node

Define the source of the data

From the right-side panel and in the Data section, click Change Data Asset to define the data source. spss: Define Data Source

Select the data asset

A list of all the data assets available in this project are listed. Select your dataset by its name, in this example I named the dataset customers_credit_status.csv. Note: Names of datasets in the screenshots might be different from what you have.

After you select the dataset, click Ok. data sources: Select the Data Asset

If the name of your dataset is added to the Data section, under Source Location the dataset imported successfully. Click Save at the bottom of the panel. spss: Check Data is Loaded

Check the data quality

Before you start to work with your data, check its quality and get an overview of what’s inside. To do that, use the Data Audit functionality, found in the Outputs section in the nodes (left-side) panel. Drag and drop the Data Audit node onto the canvas. spss: Check Data Quality

Connect the nodes

Dragging from the right end of the Data Asset node to the left end of the Data Audit node connects the nodes.

spss: Connect the Nodes

Run the flow to get the results of the data audit

Right-click the Data Audit node on the canvas, and select Run from the menu.

spss: Run the Flow

Wait for the process to finish. Depending on the size of your data this might take a couple of minutes. This example should take a minute. spss: Flow Running

After the flow finishes running, open the Outputs tab on the right-side panel, select the most recent output (most recent is always on top) to view the results of running the flow so far. spss: Selecting Outputs Tab

The results provide an idea about the fields contained in the dataset, some statistics of each field, and the distribution of the features. Note that the data needs normalization. Scrolling down provides more insights about the problems in the dataset including missing values, outliers and extremes, and the methods used to fix these problems. outputs: View Output

Go back to SPSS Modeler and continue working there. In the upper toolbar and from the breadcrumb, navigate to the name of your SPSS modeler to go back to the canvas. In the image below, I navigate back to Loan Approval SPSS Modeler. outputs: Go Back to Canvas

Partition the data

To split the data into Train and Test sets, go to the Field Operations section from the nodes (left-side) panel, then drag and drop the Partition node onto the canvas. Connect the Partition node to the Data Asset node.

One of the important steps to prepare the data to feed into a model for training is splitting it into Train and Test sets. There are approaches that split the data before pre-processing and other approaches to split the data after pre-processing. I split the data first in this example. spss: Partition the Data

Double-click the Partition node to see its configuration options. For now, leave everything as-is. spss: Configure the Partition Node

Find and fix issues

Now, it’s time to fix any problems present in the dataset and prepare it for the modeling step. From the nodes (left-side) panel, go to the Field Operations section and drag and drop the Auto Data Prep node onto the canvas. Connect the Auto Data Prep node to the Partition node. spss: Auto Data Prep

Attach another Data Audit node to access the results of the Auto Prep process. Go to Outputs in the nodes (left-side) tab and select the Data Audit node. Drag and Drop the node onto the canvas and connect it to the Auto Data Prep node. Right-click the Data Audit node to view the menu, select Run to start running the flow we have so far. spss: Data Audit

To view the results, go to the Outputs tab in the right-side panel. Select the most recent Data Audit results, which will be on top. spss: Select Output

Note that the data is normalized (Mean is 0 and Standard Deviation is 1, for fields with continuous values) and the problems in the dataset are fixed. outputs: View Data Audit Results

Select a model

Now it’s time for the modeling steps. To select a model, go to the Modeling section in the nodes (left-side) panel. There are many models listed. The model you choose depends on your dataset and the problem you’re trying to solve. For this example, use LSVM which means Linear Support Vector Machine Model and is used for data classification. Because we are classifying loan applications as approved or rejected, the LSVM model is appropriate for this use case and well-suited for use with datasets that have a large number of predictor fields.

Drag and drop the LSVM node onto the canvas and connect it to the Auto Data Prep node so the model is fed the cleaned, normalized version of data.

spss: Model Selection

Double-click the LSVM node from the canvas to change its configuration. The most important step here is to define the model, the predictor fields (features), and related targets (labels). In the Fields tab in the right-side panel, check Use custom field roles, then select class_transformed as the target (labels) column and click Save.

The class_transformed field contains the class of customers (good is more likely to pay on time, bad is not likely to pay on time) in the current dataset.

Select all other columns as Inputs, these are the predictor fields or data that affects the class of the customer.

The model uses the inputs to find a formula that relates all the input fields with the output (customer class) and uses that same formula on new data to predict outputs.

spss: LSVM Model

Run the flow

Right-click the LSVM node on the canvas and when a menu opens, choose Run. spss: Run the Model Flow

Wait for the process to finish. It might take a few minutes. The model is being fed the data and using that data as input for training.

spss: Model Training

After the model finishes training, it produces a new node that holds information about the performance of the model. The new node is placed just under the related model node and connected to that node by default. spss: Model Output

Analyze the model ouput

To look at the model output you need to add a node to extract the output into a readable format. Go to the Outputs section from the nodes (left-side) panel and drag and drop the Analysis node onto the canvas. spss: Analysis

Connect the Analysis node to the model output node. Right-click on the Analysis node from the canvas and select Run from the menu. This re-runs the flow, so it might take a couple of minutes. spss: Run Analysis

To view the output, go to the Outputs tab on the right-side panel and select the Analysis result which should be on top (because it’s the most recent output). spss: View Analysis output

Now you can see information about how the model performed, the number of correct and wrong predictions in each data split, and the accuracy of the model which is shown as the percentage. outputs: Results

I added a few other models for comparison. Feel free to try your own combinations and follow the steps of Selecting a model as a guide. spss: Added models

Compare models

Now there are many models and you want to select the best one for deployment. A comparison will give you detailed information about each model used. Right-click the LSVM model node and choose View model to see more information about different performance metrics.

First, from the Model Evaluation tab, you will see details about overall model accuracy. These details include false positives, false negatives, model precision, recall, and f1 score. The overall accuracy of the LSVM model here is 80.9% which is fair. model evaluation: Model Evaluation

The Confusion Matrix tab shows you the percentages of correct predictions for each class. model evaluation: Confusion Matrix

From the Predictor Importance tab you can see the order of fields that had the highest imapct on the predictions or outputs. model evaluation: Predictor Importance

Use the steps above in the Analyze the model ouput section to check the Random Forest Classifier model’s performance. Right-click the model node on the canvas and select View Model. spss: Compare Models

Model evaluation shows overall accuracy of 68.9% which is not so good. model evaluation: Model Evaluation Random Forest

Confusion Matrix. model evaluation: Confusion Matrix Random Forest

Predictor Importance. model evaluation: Predictor Importance Random Forest

Save the model

The first model (LSVM) has better overall performance. Use the top breadcrumb to navigate back to your modeler name. Right-click on the node LSVM and select Save branch as model. spss: Choose Best Model

Type a name for your model. A machine learning service should be detected (refer to the pre-requisites section) and added automatically. Click Save. save model: Save Model

The model is saved successfully, which means the trained model is published to a repsitory on the cloud tied and to your IBM Cloud account. Your IBM Cloud account also has all the models you previously trained and saved. This step is important to allow for model deployment later. save model: Success

You can access your saved model in the Models panel in the main dashboard. Deploying and using the model on the cloud is just a couple of clicks away. main dashboard: View Saved Models

You can easily iterate through these steps and do some tweaks in the configuration options of each step/node to achieve better accuracy.

Summary

In this how-to, you learned how to implement a complete data-science workflow, which typically includes importing data, cleaning the data, and then selecting a suitable model to train on the data. You learned how to compare models based on their evaluation metrics to select the best performing model to save as a predecessor to model deployment on the cloud.