Expedite retail price prediction with Watson Machine Learning Accelerator hyperparameter optimization

The adoption of artificial intelligence (AI) has been increasing across all business sectors as more industry leaders understand the value that data and machine learning models can bring to their business. Some of the benefits that cut across many economy sectors are lower operational costs due to process automation, higher revenues thanks to better productivity and enhanced user experiences, and better compliance and reinforced security.

In particular, AI in retail can provide benefits with optimization, automation, and scale. Retailers today are using data to understand customers and enhance existing offerings to differentiate from the competition. They can also better understand shopping behavior data, anticipate customer needs and interests, and respond with personalized offers and experiences to increase the effectiveness of their promotions and boost sales.

One key workload for every retailer is price optimization, for example, the determination of a suitable offering price for a particular item. Here, the opportunity that AI brings is optimization across a wide assortment of items based on various factors. AI models can be used to determine the best price for each item using data on seasonality along with real-time inputs on inventory levels and competitive products and prices. AI can also show retailers likely outcomes of different pricing strategies so that they can create the best promotional offers, acquire more customers, and increase sales.

To realize these potential benefits, it’s crucial to design and deploy machine learning models that are able to predict the most suitable price for each item with the highest possible accuracy. It’s widely acknowledged today that Gradient Boosting Machine (GBM) is among the most powerful machine learning models, offering the highest generalization accuracy for most tasks that involve tabular data sets. With that motivation, this tutorial focuses on GBM as the machine learning model of choice for the price optimization task. We use a public data set from Kaggle, the Mercari price optimization competition. We then use a popular GBM model, XGBoost. To achieve good generalization accuracy with XGBoost, we perform hyperparameter tuning (HPT), that is, trying different hyperparameter sets and selecting the ones that give the best accuracy in a validation data set. To do that, we use the Watson Machine Learning Accelerator suite, which is a resource orchestrator and task scheduler and can seamlessly distribute the HPT task across a cluster of nodes and GPUs.

In this tutorial, using Watson Machine Learning Accelerator on-premises, learn about Watson Machine Learning Accelerator’s ease of use and high resource efficiency for distributed machine learning jobs as well as the power of the HPT process, which produces an XGBoost model of higher generalization accuracy in unseen data.

Prepare cluster for HPO job submission: Create conda environment (on all nodes)

To prepare a POWER cluster:

  1. Create a conda environment and install a pre-built XGBoost (as egoadmin user) library.

     conda create --name dli-xgboost python=3.6 py-xgboost-gpu

To prepare an x86 cluster:

  1. Compile XGBoost.

     git clone --recursive https://github.com/dmlc/xgboost
     cd xgboost/
     mkdir build
     cd build
     cmake3 .. -DUSE_CUDA=ON
     make -j
  2. Create a conda environment and install XGBoost into it.

     conda create --name dli-xgboost --yes pip python=3.7
     conda activate dli-xgboost
     conda install numpy scikit-learn scipy
     python setup.py install
     source deactivate

Create XGBoost BYOF plug-in (on management node only)

To create the plug-in:

export DLI_EGO_TOP=/opt/ibm/spectrumcomputing
  1. Check the value of DL_NFS_PATH.

     $ cat $DLI_EGO_TOP/dli/conf/dlpd/dlpd.conf | grep DL_NFS_PATH
         "DL_NFS_PATH": "/dlishared",
     export DL_NFS_PATH=/dlishared
  2. Create a file called XGboost.conf with the following content.

         "desc" :
         [{" ": "XGboost. Currently in development phase."},
          {" ": "Examples:"},
          {" ": "$ python dlicmd.py --exec-start XGboost <connection-options> --ig <ig> --model-main XGboost_Main.py"}
         "deployMode Desc": "Optional",
         "deployMode": "cluster",
         "appName Desc" : "This is required",
         "appName": "dlicmdXGboost",
         "numWorkers Desc": "Optional number of workers",
         "numWorkers": 1,
         "maxWorkers Desc": "User can't specify more than this number",
         "maxWorkers": 1,
         "maxGPUPerWorker Desc": "User can't specify more than this number",
         "maxGPUPerWorker": 10,
         "egoSlotRequiredTimeout Desc": "Optional",
         "egoSlotRequiredTimeout": 120,
         "workerMemory Desc" : "Optional",
         "workerMemory": "2G",
         "frameworkCmdGenerator Desc": "",
         "frameworkCmdGenerator": "XGboostCmdGen.py"
  3. Move it to the following directory.

     sudo mv XGboost.conf $DLI_EGO_TOP/dli/conf/dlpd/dl_plugins
  4. Create a file called XGboost_wrapper.sh with the following content.

     source activate dli-xgboost
     python $@
  5. Create a file called XGboostCmdGen.py with the following content.

     #!/usr/bin/env python2
     import os.path, sys
     from os import environ
     def main():
        cmd = ""
        if "DLI_SHARED_FS" in os.environ:
           print (environ.get('DLI_SHARED_FS'))
           cmd = environ.get('DLI_SHARED_FS') + "/tools/spark_tf_launcher/launcher.py"
           print("Error: environment variable DLI_SHARED_FS must be defined")
        if "APP_NAME" in os.environ:
           cmd = cmd + " --sparkAppName=" + environ.get('APP_NAME')
           print("Error: environment variable APP_NAME must be defined")
        if "MODEL" in os.environ:
           cmd = cmd + " --model=" + environ.get('MODEL')
           print("Error: environment variable MODEL must be defined")
        if "REDIS_HOST" in os.environ:
           cmd = cmd + " --redis_host=" + environ.get('REDIS_HOST')
           print("Error: environment variable REDIS_HOST must be defined")
        if "REDIS_PORT" in os.environ:
           cmd = cmd + " --redis_port=" + environ.get('REDIS_PORT')
           print("Error: environment variable REDIS_PORT must be defined")
        if "GPU_PER_WORKER" in os.environ:
           cmd = cmd + " --devices=" + environ.get('GPU_PER_WORKER')
           print("Error: environment variable GPU_PER_WORKER must be defined")
        cmd = cmd + " --work_dir=" + os.path.dirname(environ.get('MODEL'))
        cmd = cmd + " --app_type=executable"
        cmd = cmd + " --model=" + environ.get('DLI_SHARED_FS') + "/tools/dl_plugins/XGboost_wrapper.sh --"
        cmd = cmd + " " + environ.get('MODEL')
        # adding user args
        for i in range(1, len(sys.argv)):
           cmd += " " + sys.argv[i]
        # Expected result in json
        print('{"CMD" : "%s"}' % cmd)
     if __name__ == "__main__":
  6. Move those files and make them executable.

     sudo mv XGboost_wrapper.sh $DL_NFS_PATH/tools/dl_plugins
     sudo mv XGboostCmdGen.py $DL_NFS_PATH/tools/dl_plugins
     sudo chmod +x $DL_NFS_PATH/tools/dl_plugins/XGboost_wrapper.sh
     sudo chmod +x $DL_NFS_PATH/tools/dl_plugins/XGboostCmdGen.py

Download and prepare the data set

mkdir $DL_NFS_PATH/data sets/higgs
cd $DL_NFS_PATH/datasets/higgs
wget --no-check-certificate https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
gunzip HIGGS.csv.gz
mkdir train
mkdir val
mkdir test
  1. Create the following Python script (preprocess.py) in the current folder.

     import pandas as pd
     from sklearn.model_selection import train_test_split
     import xgboost as xgb
     df = pd.read_csv("HIGGS.csv", header=None)
     data = df.values
     y = data[:,0]
     X = data[:,1:]
     X_tmp, X_test, y_tmp, y_test = train_test_split(X,y, random_state=42)
     X_train, X_val, y_train, y_val = train_test_split(X_tmp,y_tmp, random_state=42)
     print("Number of features: %d" % (X_train.shape[1]))
     print("Number of training  examples: %d" % (X_train.shape[0]))
     print("Number of validation examples: %d" % (X_val.shape[0]))
     print("Number of test       examples: %d" % (X_test.shape[0]))
     dx_train = xgb.DMatrix(X_train, y_train)
     dx_val   = xgb.DMatrix(X_val, y_val)
     dx_test  = xgb.DMatrix(X_test, y_test)
  2. Execute the pre-processing script to generate the data files.

     conda activate dli-xgboost
     conda install pandas
     python preprocess.py

    You should see the following output:

     Number of features: 28
     Number of training   examples: 6187500
     Number of validation examples: 2062500
     Number of test       examples: 2750000
  3. Check the value of the DLI_DATA_FS.

     $ cat $DLI_EGO_TOP/dli/conf/dlpd/dlpd.conf | grep DLI_DATA_FS
         "DLI_DATA_FS": "/dlidata/",
  4. Copy the train and validation data set to 'DLI_DATA_FS'.

     $ pwd
     $ ll -lt
     -rw-rw-r-- 1 egoadmin egoadmin 210025976 Nov  7 14:38 pp_val.dmatrix
     -rw-rw-r-- 1 egoadmin egoadmin 630075452 Nov  7 14:38 pp_train.dmatrix

Run XGBoost with default parameters

  1. Create the train_xgb_default.py file.

     from sklearn.metrics import roc_auc_score
     import xgboost as xgb
     import argparse
     CLI.add_argument("--trainFile", type=str, default="")
     CLI.add_argument("--valFile", type=str, default="")
     CLI.add_argument("--testFile", type=str, default="")
     args = CLI.parse_args()
     # Set params
     params = {
       'tree_method': 'gpu_hist',
       'max_bin': 64,
       'objective': 'binary:logistic',
     # Load data
     dtrain = xgb.DMatrix(args.trainFile)
     ddev = xgb.DMatrix(args.valFile)
     dtest = xgb.DMatrix(args.testFile)
     # Get labels
     y_train = dtrain.get_label()
     y_dev = ddev.get_label()
     y_test = dtest.get_label()
     # Train
     gbm = xgb.train(params, dtrain)
     # Inference
     p1_train = gbm.predict(dtrain)
     p1_dev  = gbm.predict(ddev)
     p1_test  = gbm.predict(dtest)
     # Evaluate
     auc_train = roc_auc_score(y_train, p1_train)
     auc_dev = roc_auc_score(y_dev, p1_dev)
     auc_test = roc_auc_score(y_test, p1_test)
     print("auc_test: %f, auc_val: %f, auc_test: %f" % (auc_train, auc_dev, auc_test))
  2. Run the model with the default parameter (using the dli-xgboost environment as before).

(dli-xgboost)# python train_xgb_default.py --trainFile  /dlidata/dataset/price_prediction/pp_train.dmatrix --testFile /dlidata/dataset/price_prediction/pp_val.dmatrix
[04:16:57] 833433x93 matrix with 77509269 entries loaded from /dlidata/dataset/price_prediction/pp_train.dmatrix
[04:16:57] 277812x93 matrix with 25836516 entries loaded from /dlidata/dataset/price_prediction/pp_val.dmatrix
mse_test: 1231.55

Tuning XGboost with Watson Machine Learning Accelerator hyperparameter optimization

Let’s see whether we can do better with Watson Machine Learning Accelerator hyperparameter optimization.

  1. Install and configure Watson Machine Learning Accelerator by running Steps 1 – 4 of the runbook.

  2. Download the 05_tuning_xgboost_with_hpo.ipynb notebook, and open the notebook with your preferred tool.

  3. Download the model with the folder: xgb-model/main.py.

  4. Update the first cell of the notebook including:

     - hostname
     - username, password
     - protocol (http or https)
     - http or https port
     - sigName
     - Dataset location
  5. Update the second cell of the notebook including:

     - maxJobNum:  total number of tuning jobs to be running
     - maxParalleJobNum:  total number of tuning jobs running in parallel, which is equivalent to total number of GPUs available in the cluster

In this notebook, we tune five parameters of the XGBoost model. Run the notebook to start your parallel model tuning jobs.

  1. Run the fourth cell to monitor the job progress. The recommended optimal set of parameters with the best metric is returned.

     Hpo task Admin-hpo-83966261958354 state RUNNING progress 56%
     Hpo task Admin-hpo-83966261958354 completes with state FINISHED
         "best": {
             "appId": "Admin-84189701779622-1370733872",
             "driverId": "driver-20191108160851-0342-bacfbcb3-ed76-4f70-92f5-65062f92d1cb",
             "endTime": "2019-11-08 16:11:07",
             "hyperParams": [
                     "dataType": "double",
                     "fixedVal": "0.9597590292372464",
                     "name": "learning_rate",
                     "userDefined": false
                     "dataType": "int",
                     "fixedVal": "565",
                     "name": "num_rounds",
                     "userDefined": false
                     "dataType": "int",
                     "fixedVal": "13",
                     "name": "max_depth",
                     "userDefined": false
                     "dataType": "double",
                     "fixedVal": "1584.7191653582931",
                     "name": "lambda",
                     "userDefined": false
                     "dataType": "double",
                     "fixedVal": "0.47",
                     "name": "colsample_bytree",
                     "userDefined": false
             "id": 6,
             "maxiteration": 0,
             "metricVal": 1036.0960693359375,
             "startTime": "2019-11-08 16:08:51",
             "state": "FINISHED"
  2. Create the train_xgb_tuned.py file with parameters returned from the tuning job.

     import xgboost as xgb
     import argparse
     import numpy as np
     from sklearn.metrics import mean_squared_error
     CLI.add_argument("--trainFile", type=str, default="")
     CLI.add_argument("--testFile", type=str, default="")
     args = CLI.parse_args()
     # Set params as found by WML-A HPO
     params = {
       'tree_method': 'gpu_hist',
       'learning_rate': 0.9597590292372464,
       'num_rounds': 565,
       'max_depth': 13,
       'lambda': 1584.7191653582931,
       'colsample_bytree': 0.47
     # Load training and test data
     dtrain = xgb.DMatrix(args.trainFile)
     dtest = xgb.DMatrix(args.testFile)
     # Training
     gbm = xgb.train(params, dtrain, params['num_rounds'])
     # Evaluate
     true_price = np.expm1(dtest.get_label())
     pred_price = np.expm1(gbm.predict(dtest))
     mse_test = mean_squared_error(true_price, pred_price)
     # Output
     print("mse_test: %.2f" % (mse_test))

Run XGBoost with tuned parameters

Use the following code to run XGBoost with tuned parameters.

(dli-xgboost)# python train_xgb_tuned.py --trainFile  /dlidata/dataset/price_prediction/pp_train.dmatrix --testFile /dlidata/dataset/price_prediction/pp_val.dmatrix
[04:23:26] 833433x93 matrix with 77509269 entries loaded from /dlidata/dataset/price_prediction/pp_train.dmatrix
[04:23:26] 277812x93 matrix with 25836516 entries loaded from /dlidata/dataset/price_prediction/pp_val.dmatrix
mse_test: 1036.10


Mean squared error (MSE) measures the average of the squares of the errors, which is the average squared difference between the estimated values and the actual value. The lower the value of MSE represents a smaller approximation error and delivers better generalization accuracy. In our experiment, the MSE value with the Watson Machine Learning Accelerator hyperparameter optimization-tuned parameter is 1036.10, which delivers better generalization accuracy compared to the default parameter of MSE value 1231.55.

In this tutorial, we demonstrated Watson Machine Learning Accelerator’s ease of use and efficiency with automating parallel hyperparameter tuning jobs, and explained how it delivers more accurate retail price predictions with better generalization accuracy of XGBoost models.