Due to the financial crisis and limited credit supply, banks tightened their lending systems and turned to machine learning to more accurately identify the risky loans. Therefore, major banks hope to use effective methods to better control the risk of such loans. Banks might be interested in predicting whether a loan default or not based on various attributes of a borrower.
James, bank account manager, would like to rate the customer’s credit and ultimately establish a commercial bank personal credit assessment model. The model helps to justify approving or denying a loan application.
The C5.0 algorithm is accurate among the widely used decision tree algorithms in data mining and machine learning. Basically, decision trees are based on the concept of classification that aims to assign to a certain class a number of records (observations) with several attributes.¬†While there are other machine learning techniques such as topological analysis or neural networks, which perform much better in many cases, the C5.0 algorithm is practical. It‚Äôs easy to understand and applicable to various problems. The C5.0 decision tree is good at dealing with the classification problem that James is facing and expecting to solve.
Data Preparation: Creating random training and test data sets
First, James accesses the bank data that contains information on the loans that are obtained from a credit agency in Germany. The bank data includes 1,000 observations and 21 features that are a combination of factors and integers.
Data with these characteristics is available in a data set that is donated to the UCI Machine Learning Data Repository by Hans Hofmann of the University of Hamburg. The data set contains information on loans that are obtained from a credit agency in Germany.
James imports the data and builds the model based on the following Modeler product:
One field is ‚ÄúDefault,‚ÄĚ which indicates whether the loan applicant is unable to meet the agreed payment terms and goes into default. He opens to analyze the ‚Äėdefault‚Äô field.
The definition of customer credit rating is divided into two categories: ‚Äúgood‚ÄĚ(1.0) and ‚Äúpoor‚ÄĚ(2.0). The ‚Äúgood‚ÄĚ customer is provided credit by the credit institution. The credit agencies expect those credit customers to repay their debts on time. By contrast, the ‚Äúpoor‚ÄĚ customers are not expected to be able to repay their debts. The credit institution will be reluctant to provide consumer credit. The distribution of the ‚Äúgood‚ÄĚ(1.0) and ‚Äúpoor‚ÄĚ(2.0) customers in the sample set is shown in Table 1.
Creating random training and test data sets
To make the C5.0 model more accurate, James splits the bank‚Äôs data into two portions: a training data set and a test data set. The training data set is to build the decision tree. The test data set is to evaluate the performance of the model with the new data. He uses 70 % of the data for training and 30 % for testing, which provides him with 283 records to simulate new applicants.
A high rate of default is undesirable for a bank, because it means that the bank is unlikely to fully recover its investment. So the model must be able to identify those applicants at high risk to default, allowing the bank to reject their credit requests.
James specifies the output type as ‚ÄėDecision tree‚Äô in C5.0 node to build the model in Table 2.
The following table includes the analysis results:
The structure diagram on the left in Table 3 shows:¬†‚ÄúIf the checking account balance is unknown or greater than 200 DM, then classify as ‚Äúnot likely to default.‚ÄĚ Otherwise, if the checking account balance is less than zero DM or between DM, then classify as ‚Äúlikely to default.‚ÄĚ‚ÄĚ
The predictor importance histogram reflects relative importance of each predictor in estimating the model. From Table 3, the most important factor is ‚Äėchecking_balance,‚Äė which means that this field has the largest impact on the model. Since the values are relative, the sum of the values for all predictors on the display is 1.0. Predictor importance does not relate to model accuracy. It just relates to the importance of each predictor in making a prediction, not whether the prediction is accurate.
It is especially important to evaluate a decision tree model with a test data set. James selects 30% test data to verify the correctness of the model, and exports the accuracy and the error rate in the following Table 4.
As for the 283 test loan application records, his model correctly predicted that 198 did not default and 85 did default, resulting in an accuracy of 69.96 percent and an error rate of 30.04 percent. The accuracy of the model is not very high, so he wants to know where the problem is. To further understand the performance of the algorithm, James exports coincidence matrices.
Coincidence matrices display¬†the pattern of the matches between each generated (predicted) field and its target field for categorical targets (‚Äúdefault‚ÄĚ). The pattern of matches is useful for identifying systematic errors in the prediction. The matrix table is displayed with the rows that are defined by actual value (169 records) and the columns that are defined by predicted value (66 records), with the number of records that have that pattern in each cell. In other words, the bank identified 66 customers who do not meet the loan conditions as good customers. To move one step ahead, James tries to figure out how to reduce this error rate to ensure that banks keep the right decisions in high-risk loans.
Fortunately, there are a couple of simple ways to adjust the C5.0 algorithm that might help him improve the performance of the model.
Boosting the accuracy of decision trees
The C5.0 algorithm has a special method to improve its accuracy rate, called boosting. It can use a weighted voting procedure to combine the separate predictions into one overall prediction.
James compares the accuracy of the training sample model and the boosting training sample model.
The Training sample model accuracy is 80.06 % in the following Table 5.
The boosting training sample model accuracy is 94.28 percent in Table 6.
James uses the boosting method to reduce the training model‚Äôs error rate from 19.94 % down to 5.72 % in the boosted model. He finds out that the boosting technology can more effectively improve the fitting degree of the decision tree model to the training data, thereby improving the classification accuracy of the training samples.
For the original testing model, the number of the false positive records is 66 in Table 7.
For the boosting testing model, the number of the false positive records is 26 in Table 8.
James compares the adjusted C5.0 tree model and the original one as summarized in the previous table. The number of the ‚Äėbad customer‚Äô records reduces by more than 60% (from 66 down to 26). The false positive rate and error rate are significantly better than no optimization, implying that the evaluation result of the adjusted C5.0 tree model is better.
He also wants to know what kind of rule conditions can be used to identify the customers who do not meet the loan conditions. Rule Set models are broken down by consequent (predicted category) and are presented, which can help James get the classification results more quickly.
Using Rule Set to show customer‚Äôs rule
James selects ‚ÄúRule set‚ÄĚ in the options for setting the model in Table 9 and quickly gets the results that he expected in Table 10.
He gets a result ‚ÄúIf the checking account balance is unknown or months loan duration less than 39, then classify as ‚Äúnot likely to default.‚ÄĚ Otherwise, if the checking account balance is less than zero DM or 1 – 200 DM, then classify as ‚Äúlikely to default.‚ÄĚ In other words, when the customer meets the rule 2 conditions, they do not meet the conditions of the loan, so the bank does not give loans to those customers.
James establishes a personal credit rating model for commercial banks. He then conducts an empirical test on the personal credit data of a German bank by using the C5.0 decision tree model and rule set model. The classification and recognition results based on the decision tree C5.0 model had a good accuracy. It helps banks avoid loan risks and improve the accuracy of loan, which will help him make the next decision.
How to identify risky bank loans with the C5.0 algorithm in Watson Studio
As the number of loans increased, the bank’s data volume increased exponentially. James hopes the C5.0 decision tree can help him build a model of credit evaluation, classify customers, and determine the credibility of the model through precision. The most important thing is that it can support the processing of big data.
Fortunately, James heard from his friend that distribution of C5.0 analysis in IBM Watson Studio¬†can help him analyze this kind of data, so he wrote a¬†notebook¬†that uses Scala for this use case. We can see it in IBM Watson Studio.
The following steps are used to create the model and forecasting to answer the previous questions in the Watson studio notebook: We use Scala 2.11 with Spark2.1 as its kernel. You can also get the full code¬†here.
The C5.0 tree model can build a single decision tree model. In general, it grows to depth and prunes it to avoid overfitting. Use the training data set to build a C5.0 tree model as below:
In the StatXML, we can see the estimated model parameters in the following < ModelExplanation> section:
<ModelExplanation> is displayed with the rows that are defined by actual value (499 records) and the columns that are defined by predicted value (13 records), with the number of records that have that pattern in each cell. In other words, the bank has identified 13 customers who do not meet the loan conditions as good customers.
Use the training data, build a C5.0 ruleset model to identify risky bank loans. Build the C5.0 ruleset model by setting “setBuildRuleSet” to true.
Export C5.0 Rule Set model as below
He gets a result ‚ÄúIf the checking account balance is unknown or ‚Äė> 200 DM‚Äô, then classify as ‚Äúnot likely to default.‚ÄĚ Otherwise, if the checking account balance is less than zero DM or 1 – 200 DM, then classify as ‚Äúlikely to default.‚ÄĚ In other words, when the customer meets the rule conditions, they do not meet the conditions of the loan, so the bank does not give them loans.