Learning objectives
In this how-to, you will prepare data and build a predictive model using IBM SPSS Modeler to assess the risk of a loan application, and either approve the application and give the loan to a customer, or reject the application.
The loan risk dataset used in this how-to is free, open-source, and available on the BigML website.
The dataset contains details about customers applying for loans. Some of the details available are checking status, duration, credit history, purpose, credit amount, savings status, employment, and more.
Prerequisites
- An IBM Cloud Account
- A running Object Storage Service Instance from the IBM Cloud catalog
- A running Machine Learning Service Instance from the IBM Cloud catalog
- A Watson Studio Service Instance from IBM the IBM Cloud catalog
IBM SPSS is available on IBM Watson Studio as one of many options to build predictive models.
If you want more flexibility in preparing your data and building your models than what Watson Studio’s Automatic Modeler offers, but still want the ease of use of a GUI interface and less code writing and complexity, you can use IBM SPSS Modeler.
Estimated time
It takes approximately 1 hour to read and follow the steps in this how-to.
Steps
Upload the dataset to IBM Watson Studio
The first step is to upload the dataset to IBM Watson Studio. To upload the dataset, make sure that you have a right-side panel open where you will find an area prompting you to import your data. You can drag and drop the dataset (.csv file) from your computer to that area, or click Browse to open file explorer on your computer where you can select the desired file. If you don’t have the right-side panel open by default, click the Data tab (upper right, second-to-last) in the toolbar to trigger the panel to open.
Note: The name of the dataset in the screenshot below is different from the dataset you are using.
Create a new flow
Go to the Modeler Flows section in the main dashboard of IBM Watson Studio.
Name your Modeler task
Type a Name for your Modeler task, select Modeler Flow as the flow type and select IBM SPSS Modeler as runtime, then click Create.
Import the dataset
Click Import on the left-side panel to expand this section’s options.
Add Data Assets to the Modeler canvas
Drag and drop the Data Assets node onto the Modeler canvas
Edit the Data Assets node
Double-click the Data Assets node on the canvas to edit its properties. This opens a right-side panel with all configuration options for the selected node.
Define the source of the data
From the right-side panel and in the Data section, click Change Data Asset to define the data source.
Select the data asset
A list of all the data assets available in this project are listed. Select your dataset by its name, in this example I named the dataset customers_credit_status.csv
.
Note: Names of datasets in the screenshots might be different from what you have.
After you select the dataset, click Ok.
If the name of your dataset is added to the Data section, under Source Location the dataset imported successfully. Click Save at the bottom of the panel.
Check the data quality
Before you start to work with your data, check its quality and get an overview of what’s inside. To do that, use the Data Audit functionality, found in the Outputs section in the nodes (left-side) panel. Drag and drop the Data Audit node onto the canvas.
Connect the nodes
Dragging from the right end of the Data Asset node to the left end of the Data Audit node connects the nodes.
Run the flow to get the results of the data audit
Right-click the Data Audit node on the canvas, and select Run from the menu.
Wait for the process to finish. Depending on the size of your data this might take a couple of minutes. This example should take a minute.
After the flow finishes running, open the Outputs tab on the right-side panel, select the most recent output (most recent is always on top) to view the results of running the flow so far.
The results provide an idea about the fields contained in the dataset, some statistics of each field, and the distribution of the features. Note that the data needs normalization. Scrolling down provides more insights about the problems in the dataset including missing values, outliers and extremes, and the methods used to fix these problems.
Go back to SPSS Modeler and continue working there. In the upper toolbar and from the breadcrumb, navigate to the name of your SPSS modeler to go back to the canvas. In the image below, I navigate back to Loan Approval SPSS Modeler.
Partition the data
To split the data into Train and Test sets, go to the Field Operations section from the nodes (left-side) panel, then drag and drop the Partition node onto the canvas. Connect the Partition node to the Data Asset node.
One of the important steps to prepare the data to feed into a model for training is splitting it into Train and Test sets. There are approaches that split the data before pre-processing and other approaches to split the data after pre-processing. I split the data first in this example.
Double-click the Partition node to see its configuration options. For now, leave everything as-is.
Find and fix issues
Now, it’s time to fix any problems present in the dataset and prepare it for the modeling step. From the nodes (left-side) panel, go to the Field Operations section and drag and drop the Auto Data Prep node onto the canvas. Connect the Auto Data Prep node to the Partition node.
Attach another Data Audit node to access the results of the Auto Prep process. Go to Outputs in the nodes (left-side) tab and select the Data Audit node. Drag and Drop the node onto the canvas and connect it to the Auto Data Prep node. Right-click the Data Audit node to view the menu, select Run to start running the flow we have so far.
To view the results, go to the Outputs tab in the right-side panel. Select the most recent Data Audit results, which will be on top.
Note that the data is normalized (Mean is 0 and Standard Deviation is 1, for fields with continuous values) and the problems in the dataset are fixed.
Select a model
Now it’s time for the modeling steps. To select a model, go to the Modeling section in the nodes (left-side) panel. There are many models listed. The model you choose depends on your dataset and the problem you’re trying to solve. For this example, use LSVM which means Linear Support Vector Machine Model and is used for data classification. Because we are classifying loan applications as approved or rejected, the LSVM model is appropriate for this use case and well-suited for use with datasets that have a large number of predictor fields.
Drag and drop the LSVM node onto the canvas and connect it to the Auto Data Prep node so the model is fed the cleaned, normalized version of data.
Double-click the LSVM node from the canvas to change its configuration. The most important step here is to define the model, the predictor fields (features), and related targets (labels). In the Fields tab in the right-side panel, check Use custom field roles, then select class_transformed
as the target (labels) column and click Save.
The class_transformed
field contains the class of customers (good is more likely to pay on time, bad is not likely to pay on time) in the current dataset.
Select all other columns as Inputs, these are the predictor fields or data that affects the class of the customer.
The model uses the inputs to find a formula that relates all the input fields with the output (customer class) and uses that same formula on new data to predict outputs.
Run the flow
Right-click the LSVM node on the canvas and when a menu opens, choose Run.
Wait for the process to finish. It might take a few minutes. The model is being fed the data and using that data as input for training.
After the model finishes training, it produces a new node that holds information about the performance of the model. The new node is placed just under the related model node and connected to that node by default.
Analyze the model ouput
To look at the model output you need to add a node to extract the output into a readable format. Go to the Outputs section from the nodes (left-side) panel and drag and drop the Analysis node onto the canvas.
Connect the Analysis node to the model output node. Right-click on the Analysis node from the canvas and select Run from the menu. This re-runs the flow, so it might take a couple of minutes.
To view the output, go to the Outputs tab on the right-side panel and select the Analysis result which should be on top (because it’s the most recent output).
Now you can see information about how the model performed, the number of correct and wrong predictions in each data split, and the accuracy of the model which is shown as the percentage.
I added a few other models for comparison. Feel free to try your own combinations and follow the steps of Selecting a model as a guide.
Compare models
Now there are many models and you want to select the best one for deployment. A comparison will give you detailed information about each model used. Right-click the LSVM model node and choose View model to see more information about different performance metrics.
First, from the Model Evaluation tab, you will see details about overall model accuracy. These details include false positives, false negatives, model precision, recall, and f1 score. The overall accuracy of the LSVM model here is 80.9% which is fair.
The Confusion Matrix tab shows you the percentages of correct predictions for each class.
From the Predictor Importance tab you can see the order of fields that had the highest imapct on the predictions or outputs.
Use the steps above in the Analyze the model ouput section to check the Random Forest Classifier model’s performance. Right-click the model node on the canvas and select View Model.
Model evaluation shows overall accuracy of 68.9% which is not so good.
Confusion Matrix.
Predictor Importance.
Save the model
The first model (LSVM) has better overall performance. Use the top breadcrumb to navigate back to your modeler name. Right-click on the node LSVM and select Save branch as model.
Type a name for your model. A machine learning service should be detected (refer to the pre-requisites section) and added automatically. Click Save.
The model is saved successfully, which means the trained model is published to a repsitory on the cloud tied and to your IBM Cloud account. Your IBM Cloud account also has all the models you previously trained and saved. This step is important to allow for model deployment later.
You can access your saved model in the Models panel in the main dashboard. Deploying and using the model on the cloud is just a couple of clicks away.
You can easily iterate through these steps and do some tweaks in the configuration options of each step/node to achieve better accuracy.
Summary
In this how-to, you learned how to implement a complete data-science workflow, which typically includes importing data, cleaning the data, and then selecting a suitable model to train on the data. You learned how to compare models based on their evaluation metrics to select the best performing model to save as a predecessor to model deployment on the cloud.