Predicting fraud using skewed data


Predictive analytics uses historical data to predict future events. Typically, historical data is used to build a mathematical model that captures important trends. That predictive model is then used on current data to predict what will happen next or to suggest actions to take for optimal outcomes. We use the same approach to create a solution for credit card fraud detection problems. Using a predictive model, we can automatically identify and prioritize likely fraudulent activity. Fraud units can then investigate only those incidents likely to require it. This code patterns shows how to handle skewed data using different sampling techniques and to generate accurate predictions using different statistical algorithms.


Credit card fraud is a growing problem worldwide and costs upwards of billions of dollars per year. It is a wide-ranging term for theft and fraud committed using or involving a payment card, such as a credit card or debit card, as a fraudulent source of funds in a transaction. According to 2016 data released by ACI Worldwide and financial industry consultant Aite Group, nearly 1 in 3 consumers globally have been a victim of card fraud in the past five years. The benchmark survey also reported that 14 of the 17 countries surveyed experienced an increase in card fraud between 2014 and 2016. A 2016 iovation/Aite Group study projected impact on financial fraud reports that credit card fraud losses might climb as high as $10 billion in the United States alone by 2020. Therefore, it is imperative to use technology to try to reduce these alarming numbers.

Fraudulent transactions are costly, but it is too expensive and inefficient to investigate every transaction for fraud. Even if possible, investigating innocent customers might prove to be a poor customer experience, leading some clients to leave. Using a predictive model, you can automatically identify and prioritize likely fraudulent activity. Compared to the other solutions available, this is an efficient and accurate solution devoid of human error. The goal is to minimize instances where fraud is predicted but it is not actually fraud (false positives) and where it is fraud but is not predicted (false negatives).

After you have completed this code pattern, you will understand how to:

  • Build predictive models using bagging and boosting statistical techniques
  • Run different statistical models and evaluate the results
  • Sample the data to create a balance between the majority and minority populations to handle skewed data
  • Demonstrate how the sampling techniques can increase the accuracy of the predictive model



  1. Log in to Watson Studio and create an instance that includes object storage.
  2. Upload the .CSV file to the object storage.
  3. Import a Jupyter Notebook from the URL.
  4. Run the statistical models and sampling techniques in the notebook.
  5. Export the predictive modeling results to the object storage.


Find the detailed instructions in the README. The steps show you how to:

  1. Create an account with IBM Cloud.
  2. Create a new Watson Studio project.
  3. Create the notebook.
  4. Add the data.
  5. Insert the dataframe.
  6. Run the notebook.
  7. Analyze the results.