The Analyze bank marketing data using XGBoost code pattern is for anyone new to Watson Studio and machine learning (ML). It’s also useful to anyone who is interested in using XGBoost and creating a scikit-learn-based classification model for a data set where class imbalances are very common.
The code pattern uses the bank marketing data set from the UCI repository, and the data is related to direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls where often, more than one contact to the same client was required to determine if the product (a bank term deposit) would be subscribed (‘yes’) or not (‘no’).
Why do data scientists need to care about class imbalance?
For context on why XGBoost and class imbalances are relevant topics, it’s important to understand that most of the real-life data set contains a very small amount of samples of interest. In this case, the number of people who purchased the bank’s term deposit financial product is very small compared to the number of people who didn’t purchase it. When we built a supervised machine learning model with this data set our model didn’t perform well because while training, it thinks that because it’s predicting most of the negative samples right, it’s doing great. However, as the user, we are mostly interested in the positive samples.
What and why of using XGBoost?
We chose to use XGBoost because:
- XGBoost is an extreme gradient boosting algorithm based on trees that tends to perform very well out of the box compared to other ML algorithms.
- XGBoost is popular with data scientists and is one of the most common ML algorithms used in Kaggle Competitions.
- XGBoost allows you to tune various parameters.
- XGBoost allows parallel processing.
Explore, analyze, and predict a bank client’s CD subscription
This code pattern allows data scientists to understand the various steps needed, such as:
- How to explore data sets to get insights
- How to do data pre-processing and preparation
- How to build a scikit-based ML pipeline
- How to perform multiple iterations of model tuning and improve performance
Our goal is to create a classifier that can predict whether a bank’s client will buy the certificate of deposit from the bank. We use a data set from a marketing campaign of a Portuguese Bank.
As data scientists, the first step we do before we start building the model is to explore the data set to get preliminary insights. In this example, we use Seaborn and Matplotlib because they provide a plethora of options to visualize the data.
Data scientists typically spend more than 60% of their time in data preparation and cleaning. It’s an important step because having a bad data set and erroneous data can affect the model’s performance significantly. We again use a scikit-learn-based utility for data processing and data transformations.
Now we have to know how to prepare our data. The next question we ask is “Hey, I did all this preprocessing on the training data. But at the model scoring time, I have to do all those steps again? Kind of repetitive tasks.” Scikit-learn provides a great collection of transformers that can be put into a machine learning pipeline like beads on a necklace. We will build an ML pipeline so that we don’t have to do similar operations at the scoring time.
Typically, if your data set is balanced, that is, the number of positive samples is about the same as the number of negative samples, then any classifier will produce a great result. However, often there is a class imbalance. For example, in our training data set, the number of clients who bought the bank’s product is low compared to the total number of bank clients. Because of this, we can’t use accuracy as the model performance parameter. Instead, we use confusion matrix, ROC and Precision, and Recall (PR Curve). For our example, the best use case is to use PR Curve, and especially Recall, which is the one we will be optimizing for.
We use three different iterations–one after another–to improve our classifier performance. With each iteration, we do a detailed analysis using various statistics:
- Naive approach
- Weighted samples approach
- Feature selection with weighted samples approach
Finally, we test the trained model on the test (or unknown) data set to see its efficacy. In the end, we point the user to literature SMOTE algorithms like this SMOTE Paper for creating better classifiers.
Awesome job going through the blog! Now go try and take this further or apply it to a different use case!
With this information, I encourage you to use a different data set to apply the various techniques discussed. I hope you’ll take the time to check out my code, follow along, build upon it, and beat the score!