Kubernetes with OpenShift World Tour: Get hands-on experience and build applications fast! Find a workshop!

Accelerate tree-based model training with Watson Machine Learning Accelerator and Snap ML

Introduction

Machine Learning (ML) plays an increasingly important role in the everyday business of many enterprises. Businesses are realizing that its ability to reduce running costs, due to automation, increase revenue, due to higher productivity and even safeguard against security and regulatory concerns is increasingly valuable. Once a suitable ML model is identified for a particular application, it typically needs to be tuned (trained multiple times) with different hyper-parameter values until it achieves a desired level of predictive accuracy. The ability to train ML models quickly not only saves valuable resource usage time and thus cost, but also increases productivity as the model can be used for serving customers and enables adaptivity to rapidly changing environments.

We will demonstrate how to train a random forest model, a popular ML model 2.0 times faster than with scikit-learn, the ubiquitous open-source ML library used by data scientists today. We achieve this by using a new library for ML training, called Snap Machine Learning, or Snap ML. The Snap ML library offers GPU-acceleration and distributed-computing capabilities that accelerate the training of ML models and enable handling large datasets efficiently. We will show how to deploy Snap ML within the Watson™ Machine Learning Accelerator platform, IBM’s offering for managing the entire ML/DL project pipeline.

We have chosen to showcase an application from the financial services sector, namely credit default risk prediction. This is a critical application in every bank or lending institution that provides loans to customers. It captures data about their past financial activity in order to determine the likelihood of the applicant repaying their loan. ML can be used to improve the efficiency of handling loan applications and to increase profit for the lending institute by avoiding less credit-worthy customers.

Learning objectives

In this tutorial, we will train a random forest model on a credit default risk dataset to solve a binary classification financial task: predict if a credit applicant will default or not. The dataset uses 1.1 GB of credit data in which each example is a credit described by features, such as credit history, transaction amount, account type, state, etc. The dataset contains 10 million examples and 18 features. We will explain how to configure WMLA to run a Jupyter notebook in which we use Snap ML to accelerate the training of random forest models.

This tutorial consists of two parts:

Part 1: Installation and Configuration

  • Download the Anaconda installer
  • Import the Anaconda installer into WLMA
  • Deploy the newly configured Anaconda distribution
  • Create a Conda environment using the newly deployed Anaconda
  • Create a Notebook environment
  • Create a Spark Instance Group (SIG) for the Notebook
  • Create the notebook server and upload a notebook where we run Snap ML

Part 2: Running a random forest model for Credit Default Risk Prediction (Snap ML vs. scikit-learn)

Prerequisites

The tutorial requires access to a GPU-accelerated IBM Power Systems server model AC922 or S822LC. In addition to acquiring a server, there are multiple options to access Power Systems servers listed on the PowerAI Developer Portal.

Part 1: Installation and configuration

0. Download, install, and configure the IBM Watson Machine Learning Accelerator Evaluation

  1. Download the IBM Watson Machine Learning Accelerator Evaluation software from the IBM software repository. This is a 4.9GB download and requires an IBM ID.
  2. Install and configure Watson Machine Learning Accelerator using the instructions listed in the IBM Knowledge Center or the OpenPOWER Power-Up User Guide.

1. Download the Anaconda installer

Download the following script to your workstation. You can use wget or a browser download option for the URL.

wget https://repo.continuum.io/archive/Anaconda3-2018.12-Linux-ppc64le.sh

2. Import the Anaconda installer into WLMA

  1. Open the Spark Anaconda Management panel by using the Spectrum Conductor management console: Workload > Spark > Anaconda Management.

    alt

  2. Add a new Anaconda by clicking Add and fill in the details.

    alt alt

    a) Distribution name: Anaconda-2018 b) Use Browse to find and select the Anaconda installer downloaded in Task 1. c) Anaconda version: 2018.12 d) Python version: 3 e) Operating system: Linux on Power 64-bit little endian (LE)

The fields in c), d), e) are automatically filled after selecting the Anaconda installer in step b).

alt

  1. Click Add to begin the Anaconda upload. Uploading and extracting the distribution package can take several minutes depending on your network speed.

3. Deploy the newly configured Anaconda distribution

  1. On all compute nodes, create a directory on the local disk space for an Anaconda deployment. In this example, the local disk space is /localhome/egoadmin, and the execution user used in the SIG is egoadmin. Your local disk space and execution user may be different.
mkdir -p /localhome/egoadmin/anaconda
chown egoadmin:egoadmin /localhome/egoadmin/anaconda
  1. Select the Anaconda distribution created in Task 2 and click Deploy.

    alt

  2. Fill in the required information. Start with the Deployment Settings field.

    alt

The deployment directory matches the one created in Task 3.1. The consumer in this example is the Root Consumer. In your example, it may be different.

  1. Click on the Environment Variables tab and add PATH and IBM_POWERAI_LICENSE_ACCEPT), then click Deploy. The Anaconda distribution is being deployed. The deployment may take several minutes depending on your network speed. When deployment is successful, the user should see the following:

    alt

If the deployment fails, check the log files. Make sure your deployment directory is on the local (not shared) disks of the compute nodes (Task 3.1).

4. Create a Conda environment using the newly deployed Anaconda

  1. Download or create a powerai16.yml file on your workstation with the following content (notice the tabulation). This is a YAML file used to create an Anaconda environment. If you do not have a YAML-enabled editor, consider verifying that the file format is valid by pasting the contents into an online YAML verification tool.
name: powerai16
channels:
  - https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/linux-ppc64le/
  - defaults
dependencies:
  - conda=4.5.12
  - jupyter
  - tornado=5.1.1
  - sparkmagic
  - numpy
  - numba
  - openblas
  - pandas
  - python=3.6.8
  - keras
  - matplotlib
  - scikit-learn
  - scipy
  - cuml
  - cudf
  - powerai=1.6.1
  - cudatoolkit-dev
  - pai4sk=1.4.0
  - pip:
    - sparkmagic

In case you want to include other Conda and pip packages, add them to the dependencies and pip list in the YAML file.

  1. Select the Anaconda distribution created in Task 3. Click Add to add a Conda environment.

    alt alt

Now the Conda environment is being created. It creates an environment with over 200 packages. If Add fails, check the logs and verify that the YAML file is formatted correctly. Retry the Add after the issue is resolved. This operation takes several minutes to complete. Once successfully completed, the user should see the following (e.g., 236 Conda packages were installed on the compute hosts):

alt

Note: The YAML file used in this tutorial creates a Conda environment with 236 packages. Not all of them are required to run our Snap ML vs Scikit-learn demo. However, we included as many packages as possible so that the user can reuse the environment for other ML tasks as well.

5. Create a Notebook environment

  1. We use the IBM Spectrum Conductor-provided notebook. Open the Spark Notebook Management panel by using the Spectrum Conductor management console. Workload -> Spark -> Notebook Management.

    alt

  2. Notice that there is a notebook called Jupyter, version 5.4.0. If you select it and click Configure, you can view the settings for this notebook: the notebook package name, the scripts in use, use of SSL and the required Anaconda.

    alt

Currently, due to RAPIDS package dependency called faiss, we need to apply a patch to the standard Jupyter 5.4.0 Notebook’s deploy.sh script. Download this notebook to your workstation and replace the one that comes with Conductor by clicking Browse and selecting the patched notebook, then click Update Notebook.

alt

6. Create a SIG for the Notebook

  1. On either node, create a directory within the shared filesystem that will store the data for the execution user. In this example, the shared disk space is /home/egoadmin, and the execution user for the SIG is egoadmin. Your shared disk space and execution user may be different.

    mkdir -p /home/egoadmin/notebook-snapml
    chown egoadmin:egoadmin /home/egoadmin/notebook-snapml
    
  2. Create a new SIG and include the added notebook. Workload > Spark > Spark Instance Groups.

    alt

  3. Click New to create a new SIG for the newly added notebook in Task 5.

    alt

  4. Fill in the information with the following values: a. Instance group name: Notebook-SnapML b. Deployment directory: /localhome/egoadmin/notebook-snapml (local disk folder) c. Spark version: use the latest one available.

    alt

  5. Select the Jupyter 5.4.0 notebook and set the following properties: a. base data directory: /home/egoadmin/notebook-snapml (created in Task 6.1) b. select the anaconda distribution instance created in Task 3.5 c. select the Conda environment you created in Task 4

    alt

  6. Scroll down and click on the Spark Instance Group of the Consumers section. The process automatically creates a consumer (in this example /Notebook-SnapML) that we need to change:

    alt

  7. Scroll down until you find the standard suggested consumer name (in this example, /Notebook-SnapML) and click on the X to delete it:

    alt

  8. In this example, we will select the root consumer and create the new SIG Notebook-SnapML as a child. In your case, the new SIG may need to be the child of a different consumer. Click Create > Select.

    alt

  9. Scroll down to the Resource Groups and Plans section and select the GPUHosts resource group for Spark Executors (GPU slots). Do not change the other fields.

    alt

  10. Click Create and Deploy Instance Group at the bottom of the page. Watch as the newly created instance group Notebook-SnapML gets deployed.

    alt

  11. After the deployment completes, start the SIG by clicking Start.

    alt alt alt

7. Create the notebook server and upload a notebook where we run Snap ML

  1. After the SIG is started, go to the Notebook tab a click Create Notebooks for Users.

    alt

  2. Select the users for the selected users and click Create.

    alt

  3. After the notebook has been created, refresh the screen to see My Notebooks. Clicking it shows the list of notebook servers created for this SIG. Select the Jupyter 5.4.0 notebook to bring up the notebook server URL.

    alt

Sign on to the notebook server using your credentials.

alt

Download the tutorial-snap-ml-credit-risk-rf-notebook and upload it to the notebook server by clicking Upload. You must click Upload again after specifying the notebook to upload.

alt

Part 2: Running the random forest model for Credit Default Risk Prediction (Snap ML vs. scikit-learn)

The notebook uploaded in Task 7 downloads a credit risk dataset, pre-processes it (e.g., handles the categorical features), splits it into two train and test datasets, trains a random forest model on CPUs using Snap ML and evaluates the performance of the trained model on the test dataset. In the notebook we also train a random forest model using scikit-learn to compare the training time and the prediction accuracy of Snap ML vs. the popular open-source ML library scikit-learn.

Snap ML trains a random forest model on a train dataset with 7M examples and 51 features in about 120 seconds using 40 CPU threads, while scikit-learn requires almost 230 seconds to train a model of similar accuracy as Snap ML. As the results show, Snap ML is nearly 2 times faster in training a random forest model that predicts credit default risk.

alt

Summary

The Snap ML library accelerates the training of ML models. In this tutorial we trained a random forest classifier to predict credit default risk. This is a highly relevant application for financial companies. Snap ML speeds up this ML training workload nearly two times, by accelerating the execution time from about 230 seconds (using scikit-learn) to about 120 seconds.

Kelvin Lui
Andreea Anghel
Haris Pozidis