Kubernetes with OpenShift World Tour: Get hands-on experience and build applications fast! Find a workshop!

Train XGboost models within Watson Machine Learning Accelerator

IBM Watson® Machine Learning Accelerator is a software solution that bundles Watson Community Edition, IBM Spectrum Conductor®, IBM Spectrum Conductor Deep Learning Impact, and support from IBM for the whole stack, including the open source machine learning and deep learning frameworks. Watson Machine Learning Accelerator provides an end-to-end machine learning and deep learning platform for data scientists. This includes complete lifecycle management from installation and configuration to data ingest and preparation to building, optimizing, and distributing the training model, and moving the model into production. Watson Machine Learning Accelerator excels when you expand your machine learning and deep learning environment to include multiple compute nodes. There’s even a free evaluation available. See the prerequisites from the introductory tutorial: Classify images with Watson Machine Learning Accelerator.

XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. It implements machine learning algorithms under the gradient boosting framework. XGBoost provides parallel tree boosting (also known as GBDT, GBM) that helps solve many data science problems in a fast and accurate way. XGBoost is maintained in this GitHub repository, and V0.82 is available in Watson Machine Learning Community Edition 1.6.1.

Learning objectives

This is the fifth tutorial of IBM Watson Machine Learning Accelerator education series. After completing this tutorial, you’ll understand how to:

  • Download the Anaconda installer
  • Import the Anaconda installer into Watson Machine Learning Accelerator and create a Conda environment
  • Create a Jupyter Notebook environment
  • Create an Apache Spark instance group with a Notebook that uses the Conda environment
  • Start the Notebook server and upload a Notebook to train a XGBoost model with CPU or with GPU

Estimated time

This end-to-end tutorial takes approximately two hours, which includes about 30 minutes of model training, plus installation and configuration, as well as driving the model through the GUI.

Prerequisites

The tutorial requires access to a GPU-accelerated IBM Power System server model AC922. In addition to acquiring a server, there are multiple options to access Power System servers in the PowerAI Developer Portal.

Steps

Step 1. Download, install, and configure the IBM Watson Machine Learning Accelerator evaluation

  1. Download the IBM Watson Machine Learning Accelerator evaluation software from the IBM software repository. This is a 4.9GB download and requires an IBM ID.

  2. Install and configure IBM Watson Machine Learning Accelerator using the instructions listed in the IBM Knowledge Center or the OpenPOWER Power-Up User Guide.

Step 2. Configure operating system user

  1. At the OS level, as root on all nodes, create an OS group and user for the OS execution user:

    1. groupadd egoadmin
    2. useradd -g egoadmin -m egoadmin
  2. The GID and UID of the created user/group must be the same on all nodes.

Step 3: Import Anaconda installer and create Conda environment

  1. Refer https://github.com/IBM/wmla-assets/blob/master/runbook/WMLA_installation_configuration.md, Steps 5.1 to 5.5.

  2. The following image shows the screen after you’ve successfully added the Anaconda distribution. Click Close.

    Successfully added Anaconda

  3. Select the newly added Anaconda distribution and select Deploy.

    Deploy Anaconda

  4. On the Deployment Settings tab, provide the following:

    • Instance name: Anaconda3-2019
    • Deployment directory: /home/egoadmin/Anaconda3-2019
    • Consumer: / (Root Consumer)
    • Resource group: ComputeHosts
    • Execution user: egoadmin

      Don’t click Deploy before completing the next step.

      Deploy Anaconda distribution

  5. Refer https://github.com/IBM/wmla-assets/blob/master/runbook/WMLA_installation_configuration.md, Steps 5.9 to 5.11

  6. Download the xgb.yml file. This file contains the IBM Watson Machine Learning Community Edition Conda package channel, packages required for the Python-based GPU package for XGBoost, and other packages required for proper functioning of the Jupyter Notebook environment in IBM Watson Machine Learning Accelerator.

     name: py-xgb-gpu
     channels:
       - https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/
       - defaults
     dependencies:
       - conda
       - jupyter
       - tornado=5.1.1
       - python=3.6.8
       - pyyaml
       - py-xgboost-gpu
    
  7. Click Add to create a new Conda environment.

    Adding new Conda environment

  8. Select the downloaded YML file and click Add. If there are any errors in the format of the file, an error message will display.

    Adding conda environment

  9. The next screen shows that the request to add the Conda environment was successfully submitted. Click Close.

    Add environment success

  10. Watch the environment getting created. The duration varies based on the list of Conda packages and dependencies that need to be installed in the new Anaconda environment.

    Watch environment created

It creates an environment with about 91 packages. If there are errors, you can check the logs in host — in this case, the logs are placed in this directory on all the hosts: /home/egoadmin/Anaconda3-2019/operationlogs.

Anaconda environment

Step 4: Create a Notebook environment

  1. Select the Resource Groups section.

    Select Resource groups

  2. Select Create a Resource Group.

    Create a resource group

  3. Create a new GPU Resource Group called rg-gpus.

    • Enter ngpus in Advanced formula and select Static for the resource selection method.
    • Select all the member hosts to include in this new resource group.

      Creating resource group

  4. Select Workload > Instance Groups to create a new SIG and enable Jupyter Notebook in SIG.

  5. Click Create a Spark Instance Group.

    Create Spark Instance Group

  6. Fill in the following values:

    • Instance Group name: xgb-sig
    • Deployment directory: /home/egoadmin/xgb-sig
    • Execution user for instance group: egoadmin
    • Spark Version: Spark 2.3.1

      Required fields

  7. Select the Jupyter 5.4.0 Notebook and set the following properties:

    1. Select the Anaconda distribution instance for the instance name you created in Step 2 — in this case, Anaconda3-2019.
    2. Also, select the Conda environment created in Step 2.
    3. You could also give a custom path for the base data directory. Here, it has been kept empty to choose the default value of {DEPLOY_DIR_OF_SIG}/{NOTEBOOK_NAME}-{NOTEBOOK_VERSION}, where SIG specifies the name of the Spark Instance Group. In this case, because the DEPLOY_DIR_OF_SIG is /home/egoadmin/xgb-sig and NOTEBOOK_NAME}-{NOTEBOOK_VERSION} is Jupyter-5.4.0, and the base data directory created on the hosts is /home/egoadmin/xgb-sig/Jupyter-5.4.0.

      Basic settings

  8. Scroll down and select the rg-gpus resource group for Jupyter 5.4.0. Do not change anything else. Click Create and Deploy Instance Group.

    Selecting resource group

  9. Watch as your instance group gets deployed.

    Deploying instance

  10. Click Continue to Instance Group to see the SIG deployment status.

    Deployment status

  11. After the deployment completes, start the SIG by clicking Start.

    Starting SIG

  12. You’ll observe that the SIG processing begins.

    Process start

  13. Soon you should see that the SIG has started.

    SIG started

Step 5: Create the Jupyter Notebook server for users and upload a Notebook to train an XGBoost model

  1. After the SIG is started, go to the Notebook tab and click Create Notebooks for Users.

    Creating notebook

  2. Select the users and click Create.

    Selecting users

  3. After the Notebook is created, you should see the success message on screen.

    Success message

  4. Click the refresh button on top panel of the screen to see My Notebooks.

    My notebooks

  5. Clicking the My Notebooks drop-down shows the list of Notebook servers created for this SIG. Click My Notebooks > Jupyter 5.4.0 – Owned by Admin.

    List of notebooks

  6. This opens the Jupyter login page. Provide the login credentials for Admin and click Log in.

    Jupyter log in page

Step 6: Test XGBoost installation

  1. Create new Python 3 Notebook.

    Creating Python notebook

  2. In it, import XGBoost and check its version.

    Import XGboost

  3. Download the XGBoost example notebook.

  4. Click Upload to upload the XGBoost example notebook downloaded in the previous step.

    Uploading xgboost

  5. Select the xgboots-demo.ipynb file and click Upload.

    Selecting file

  6. Click on the notebook to open it and execute the cells.

  7. This next screenshot shows the import of required Python modules, including XGBoost. Then covertype dataset (classification) used for training is downloaded using scikit-learn. The dataset is then split into train/test and converted to DMatrix data format, which is an internal data structure used by XGBoost.

    Import of Python modules

To train a model with higher-accuracy num_rounds, it has to be increased to a large value, such as 3000, as shown here, but it takes a long time to train in a CPU. So the example notebook has the num_rounds value reduced to a low value of 20 for CPU training to complete in reasonable time to demonstrate. You can try with larger values of num_rounds.

Specify boosting iterations

The next screen shows training of the model on the GPU and the XGBoost parameters used to perform that training. Using the XGBoost parameters, we could control the number of GPUs used for training. In this example, 'n_gpus':1 and 'gpu_id':0 has been specified, which uses one GPU with device-id 0 on the host.

Model training

The next screen shows training the same model on the CPU and the XGBoost parameters used to perform that training on the CPU.

Training same model

We submitted another another training with 500 iterations of boosting. test-merror is the multiclass classification error rate. It is calculated as #(wrong cases)/#(all cases). We have displayed the test-merror value after the first, 100th, 200th, 300th, 400th, and 500th round of training. Observe that the error reduces with each round of training, and the accuracy of models using CPU and GPU are comparable.

But there is a significant difference in the training time between CPU and GPU: The training of the model in CPU takes about 949 seconds (~16 mins), and it completes within 49 seconds in a GPU.

[0]test-merror:0.254831
..
[99]test-merror:0.130063
..
[199]test-merror:0.090828
..
[299]test-merror:0.072591
..
[399]test-merror:0.061851
..
[499]test-merror:0.054994
CPU Training Time: 949.3834052085876 seconds

[0]test-merror:0.254804
..
[99]test-merror:0.131302
..
[199]test-merror:0.090752
..
[299]test-merror:0.073623
..
[399]test-merror:0.064446
..
[499]test-merror:0.055069
GPU Training Time: 48.26804542541504 seconds

Conclusion

IBM Watson Machine Learning Accelerator is an excellent accelerated AI platform that drives high performance and throughput of machine learning and deep learning training. IBM Power Systems AC922 with GPU Tesla V100 is custom-built hardware to support machine learning and deep learning workloads. The blend of IBM Watson Machine Learning Accelerator and AC922 offers training performance optimization to accelerate the execution time of XGBoost training on GPU when compared to CPU, and driving data scientist productivity.

Kelvin Lui
Sangeeth Keeriyadath
Sivakumar Krishnasamy