Class imbalance — where the total number of a class of data (positive) is far less than the total number of another class of data (negative) — is a common problem in data science. As a data scientist, you want to solve this problem and create a classifier with good performance. This code pattern uses XGBoost, scikit-learn, and Python in IBM Watson™ Studio along with a highly imbalanced data set to predict if a client will purchase a certificate of deposit (CD) from a banking institution.
Class imbalance, where the number of positive samples is significantly less than the number of negative samples, is a common problem in data science. A typical machine learning algorithm works best when the number of instances of each class is roughly equal. Problems can appear when the number of instances of one class greatly exceeds the other. XGBoost (Extreme Gradient Boosting Decision Tree) is a common tool for creating machine learning models for classification and regression, but it can need some tweaking to create good classification models for imbalanced data sets.
In this code pattern, we show how the machine learning classification is performed using XGBoost, which is usually a better choice compared to logistic regression and other techniques. We use a real-life data set that is highly imbalanced. In our case, the data imbalance is due to the high number of banking customers versus the small number who actually purchase a CD.
When you have completed this code pattern, you will have worked through these conceptual steps:
- Data set description
- Exploratory analysis to understand the data
- Preprocessing techniques to clean and prepare the data
- Naive XGBoost to run the classification
- Cross validation to get the model
- Plot, precision recall curve, and ROC curve
- Tuning it and use weighted positive samples to improve the classification performance
- Oversampling of the majority class and undersampling of the minority class
- SMOTE algorithms
- Log in to IBM Watson Studio.
- Upload the data as a data asset into Watson Studio.
- Start a notebook in Watson Studio and input the data asset previously created.
- Pandas are used to read the data file into a dataframe for the initial data exploration.
- Use Matplotlib and its higher-level package Seaborn to create various visualizations.
- Use scikit-learn to create the ML pipeline to prep the data to be fed into XGBoost.
- Use XGBoost to create and train the ML model.
- Evaluate their predictive performance.
Find the detailed steps for this pattern in the README. The steps will show you how to:
- Sign up for Watson Studio.
- Create a new Watson Studio project.
- Create the Spark Service.
- Create the notebook.
- Upload data.
- Run the notebook.
- Save and share.