Digital Developer Conference: Cloud Security 2021 -- Build the skills to secure your cloud and data Register free

Perform parallel machine learning (ML) inferencing using H2O Driverless AI on IBM AIX

Introduction

In the machine learning (ML) workflow with the H2O Driverless AI artificial intelligence platform, many different training pipelines can be created with different options. Before the ML pipeline is deployed in the production environment, these different pipelines can be tested in order to determine the best for use (that is, in terms of accuracy or run time, or both). This testing can be done in an automated parallel manner on IBM Power servers running IBM AIX.

This tutorial describes the steps to:

  • Load training and test data in IBM Db2 running on AIX
  • Create multiple ML scoring pipelines on a Linux logical partition (LPAR), pulling data from IBM Db2
  • Run multiple scoring pipelines in parallel on AIX for scoring and evaluation, pulling data from IBM Db2

Prerequisites

It is assumed that the user has some experience with and a valid license for the H2O Driverless AI product on Linux, as well as access to Linux and AIX LPARs on an IBM Power system. Experience with Db2 on AIX is helpful but not required.

Estimated time

Installation and data-loading steps can take 1 to 2 hours. The time to generate and store the ML scoring pipelines, as well as the time to run the test data, will vary, depending on the size of the training and test data, the number of scoring pipelines created, and the options used in generating the pipelines.

Steps

The steps for this tutorial are:

  1. Set up and load the data into Db2 on AIX for both training and testing.
  2. Install and run H2O Driverless AI on Linux to obtain various MOJO scoring pipelines.
  3. Run the test data through those MOJO scoring pipelines in parallel on the AIX system.

Set up and load data into Db2 on AIX

We will use the IBM Db2 database to store the test and training data, and then retrieve that data using a JDBC connection. For this tutorial, we will use the publicly available ‘Churn’ data. Download this data from: https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv

  1. Download and install Db2 on the AIX LPAR of your IBM Power system. Refer https://www.ibm.com/support/pages/step-step-db2-installation-aix-server for details.

  2. Create Db2 tables for training data and test data, and then load the training and test data. For this tutorial, we will use the first 1000 records from Telco-Customer-Churn.csv for training, and the remaining records for testing the scoring pipelines.

     # su – db2inst1
     # db2start
     # db2 create database telco
     # db2 connect to telco
     # head -n 1001 Telco-Customer-Churn.csv | tail -n 1000 > churn_training.csv
     # tail -n 6043 Telco-Customer-Churn.csv > churn_testing.csv
     # db2 drop table CHURN_TRAINING
     # db2 "CREATE TABLE CHURN_TRAINING (CUSTOMERID VARCHAR(25), GENDER VARCHAR(12), SENIORCITIZEN INT, PARTNER VARCHAR(4), DEPENDENTS VARCHAR(4), TENURE INT, PHONESERVICE VARCHAR(30), MULTIPLELINES VARCHAR(30), INTERNETSERVICE VARCHAR(30), ONLINESECURITY VARCHAR(30), ONLINEBACKUP VARCHAR(30), DEVICEPROTECTION VARCHAR(30), TECHSUPPORT VARCHAR(30), STREAMINGTV VARCHAR(30), STREAMINGMOVIES VARCHAR(32), CONTRACT  VARCHAR(30), PAPERLESSBILLING VARCHAR(3), PAYMENTMETHOD VARCHAR(100), MONTHLYCHARGES DECFLOAT, TOTALCHARGES DECFLOAT, CHURN VARCHAR(3))"
     # db2 import from churn_training.csv of del replace into CHURN_TRAINING
     # db2 drop table CHURN_TESTING
     # db2 "CREATE TABLE CHURN_TESTING (CUSTOMERID VARCHAR(25), GENDER VARCHAR(12), SENIORCITIZEN INT, PARTNER VARCHAR(4), DEPENDENTS VARCHAR(4), TENURE INT, PHONESERVICE VARCHAR(30), MULTIPLELINES VARCHAR(30), INTERNETSERVICE VARCHAR(30), ONLINESECURITY VARCHAR(30), ONLINEBACKUP VARCHAR(30), DEVICEPROTECTION VARCHAR(30), TECHSUPPORT VARCHAR(30), STREAMINGTV VARCHAR(30), STREAMINGMOVIES VARCHAR(32), CONTRACT  VARCHAR(30), PAPERLESSBILLING VARCHAR(3), PAYMENTMETHOD VARCHAR(100), MONTHLYCHARGES DECFLOAT, TOTALCHARGES DECFLOAT, CHURN VARCHAR(3))"
     # db2 import from churn_testing.csv of del replace into CHURN_TESTING
     # db2 “List Tables”
    

H2o Driverless AI installation on Linux

On a Linux partition (LPAR), install either the native H2O Driverless AI product or the H2O Driverless AI Docker image. Additionally, set up the JDBC connection to the Db2 database on AIX to retrieve the data for training.

Native installation

  1. Obtain and install the H2O Driverless AI installation image for the IBM Power server from https://www.h2o.ai/download/

  2. Set up a JDBC connector for H2O to connect to the AIX host for retrieving the data from the Db2 database.

    1. Download the Db2 JDBC 4.0 driver from https://www.ibm.com/support/pages/db2-jdbc-driver-versions-and-downloads based on the version of DB2 running on AIX. Extract the contents of the file to find the db2jcc4.jar file.
    2. Create the config.toml file and add the following text in it. Update the hostname, port, and file path as appropriate for your configuration.

      enabled_file_systems = "file, upload, jdbc"
      jdbc_app_configs = '{"db2": { "url": "jdbc:db2://hostname.ibm.com:50000/TELCO", "jarpath": "/jdbc/db2jcc4.jar", "classpath": "com.ibm.db2.jcc.DB2Driver"}}'
      
  3. Start the H2O Driverless AI application, and connect to the H2O Driverless AI GUI (http://<hostname>:12345/)

Docker installation

  1. Obtain the H2O Driverless AI Docker image for the IBM Power server from https://www.h2o.ai/download/. For example, to use the 1.8 TLS:

    # wget https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/dai/rel-1.8.10-18/ppc64le-centos7/dai-docker-centos7-ppc64le-1.8.10-10.0.tar.gz
    
  2. Load the H2O Driverless AI Docker image.

    # docker load < dai-docker-centos7-ppc64le-1.8.10-10.0.tar.gz

  3. Set up a JDBC connector for H2O to connect to the AIX host for retrieving the data from the Db2 database.

    1. Download the Db2 JDBC 4.0 driver from https://www.ibm.com/support/pages/db2-jdbc-driver-versions-and-downloads based on the version of Db2 running on AIX. Extract the contents of the file to find the db2jcc4.jar file.
    2. Create the config.toml file and add the following text in it. Update the hostname, port, and file path as appropriate for your configuration.

      enabled_file_systems = "file, upload, jdbc"
      jdbc_app_configs = '{"db2": { "url": "jdbc:db2://hostname.ibm.com:50000/TELCO", "jarpath": "/jdbc/db2jcc4.jar", "classpath": "com.ibm.db2.jcc.DB2Driver"}}'
      
  4. Start the Docker container with the H2O Driverless AI image.

    1. Create three folders in your home directory and copy the H2O license key to the license/license.sig file.

      # mkdir tmp/log license

    2. Start the container.

       # docker run --pid=host --init --rm -u `id -u`:`id -g` -e DRIVERLESS_AI_CONFIG_FILE=/config/config.toml -e CLASSPATH=/jdbc -p 12345:12345 -v `pwd`/data:/data -v `pwd`/log:/log -v `pwd`/license:/license -v `pwd`/tmp:/tmp -v `pwd`/config/config.toml:/config/config.toml -v `pwd`/jdbc/jdbc_sqlj/db2jcc4.jar:/jdbc/db2jcc4.jar h2oai/dai-centos7-ppc64le:1.8.10-cuda10.0
      
  5. Connect to the H2O Driverless AI GUI (http://<hostname>:12345/).

Generating scoring pipelines on Linux

Load your data, then generate the MOJO from the GUI, download and save the MOJO files.

  1. On the H2O Driverless AI GUI, click ADD DATASET and then click JDBC. Fill in the appropriate fields based on your environment.
  2. After the dataset is retrieved from Db2, click PREDICT to launch the training.
  3. Select and set the different parameters as appropriate, based on the training settings of your choice. For example, change the accuracy, time, interpretability, or scoring settings, or navigate to EXPERT SETTINGS to change the model types and modes to be used.
  4. Click LAUNCH EXPERIMENT to start the training.
  5. After the training is complete, click DOWNLOAD MOJO SCORING PIPELINE and save the ZIP file with a name specific to the settings from step 3. Transfer this ZIP file to the AIX host for testing.
  6. Repeat the tasks from step 2 (selecting different parameters as in step 3) to create additional MOJO scoring pipelines to test.
  7. After completing these tasks, copy the MOJO scoring pipeline ZIP files to the AIX host where the testing will happen. The Linux LPAR can be shut down if needed.

Run MOJO scoring pipelines on AIX

After successfully creating a number of MOJO scoring pipelines on the Linux LPAR and transferring them to the AIX host, we will now test these against the larger test dataset.

  1. Extract the MOJO ZIP files into their own directories.
  2. Find the db2jcc4.jar file from the previous step.
  3. Modify and compile the Main.java program (see Appendix B: Main.java).

    # javac -cp mojo-pipeline/mojo2-runtime.jar Main.java

  4. Modify and then run the MOJO script (see Appendix A: tutorial_do_mojo_dirs.sh).
  5. Review the output to determine the MOJO pipelines that work best for your data in terms of accuracy, time of execution, and so on.

Summary

Automated and parallel testing of different MOJO scoring pipelines can be an important part of the ML workflow, and the example in this tutorial showed how that can be done on an AIX LPAR alongside an existing Db2 database.

Appendix A: tutorial_do_mojo_dirs.sh

Source code for script:

#!/bin/bash
# do_mojo_dirs.sh [ # clients ] [ # mojos_each ] [ java_path ]
## this will start #clients in parallel, each doing #mojos_each from the MOJO_PATH

CLIENTS=${1:-1}
MOJOS_EACH=${2:-1}
JAVA=${3:-java}

# restrict java, since the MOJO is singled threaded
JHEAP=${JHEAP:-"-Xmx4g"}
JPERF=${JPERF:-"-XX:ActiveProcessorCount=8"}

# build list of mojo's
MOJO_PATH=/root/Churn
MOJO_LIST=()
for i in ${MOJO_PATH}/*;do
    if [ ! -d $i ]; then continue; fi
    if [ ! -f ${i}/pipeline.mojo ]; then continue; fi
    MOJO_LIST+=(${i##$MOJO_PATH/})
done
echo ${#MOJO_LIST[@]} "mojos:" ${MOJO_LIST[@]}

DB2HOST=localhost
NUMRECS=6000

CMD='${JAVA} ${JHEAP} ${JPERF} -cp "${MOJO_PATH}/${M}/mojo2-runtime.jar:db2jcc4.jar:." Main ${MOJO_PATH}/${M}/pipeline.mojo ${DB2HOST} 1 ${NUMRECS}'

do_client()
{
    CLIENT=$1
    START=$((MOJOS_EACH*CLIENT))
    echo $CLIENT $START
    for ((m=0; m<$MOJOS_EACH;m++)); do
        M=${MOJO_LIST[$((m+START))]}
        echo ${CMD}
        eval ${CMD} > output_${CLIENT}_${M}.out 2> &1
    done
}

for ((c=0;c<$CLIENTS;c++));do
    do_client $c &
done
wait

echo "Run complete, clients: $CLIENTS   mojos each: $MOJOS_EACH   java: $JAVA"

Appendix B: Main.java

Source code for Java module:

import java.io.IOException;
import java.sql.*;

import ai.h2o.mojos.runtime.MojoPipeline;
import ai.h2o.mojos.runtime.frame.MojoFrame;
import ai.h2o.mojos.runtime.frame.MojoFrameBuilder;
import ai.h2o.mojos.runtime.frame.MojoRowBuilder;
import ai.h2o.mojos.runtime.lic.LicenseException;
import ai.h2o.mojos.runtime.utils.CsvWritingBatchHandler;
import com.opencsv.CSVWriter;
import java.io.BufferedWriter;
import java.io.OutputStreamWriter;
import java.io.Writer;

public class Main {

  public static void main(String[] args) throws IOException, LicenseException {
    String url;
    String query;
    String user;
    String password;
    String csvlocation;
    Connection con;
    Statement stmt;
    ResultSet rs;
    String prediction="";
    if (args.length<4)
    {
       System.err.println("1st Argument is mojo model path");
       System.err.println("2nd Argument is host name where DB2 is running");
       System.err.println("3rd Argument is start row number");
       System.err.println("4th Argument is row count");
       System.err.println("\n Usage: java Main <H2O MOJO Path> <DB2 HOST> <start #> <count #>");
       System.exit(1);
    }
    try
    {
        // Load Model
        MojoPipeline model = MojoPipeline.loadFrom(args[0]);
        System.out.println("**** Loaded the H2O Mojo Pipeline Model");

        Class.forName("com.ibm.db2.jcc.DB2Driver");
        System.out.println("**** Loaded the JDBC driver");

        //url = "jdbc:db2://<ipaddress>:<port>/<database>";
        url = "jdbc:db2://" + args[1] + ":50000/TELCO";
        System.out.println("**** url \"" + url + "\"");
        user = "db2inst1";
        password = "db2inst1";

        con = DriverManager.getConnection (url, user, password);

        // Commit changes manually
        con.setAutoCommit(false);
        System.out.println("**** Created a JDBC connection to the data source");
        // Create the Statement
        stmt = con.createStatement(ResultSet.TYPE_SCROLL_SENSITIVE, ResultSet.CONCUR_READ_ONLY);

        int row_start = Integer.parseInt(args[2]);
        int row_count = Integer.parseInt(args[3]);
        // Execute a query and generate a ResultSet instance
        query = "select * from TELCO_CUSTOMERS_FULL50K";
        System.out.println("**** query \"" + query + "\"");
        rs = stmt.executeQuery( query );
        rs.absolute(row_start);
        System.out.println("**** Reading from DB2 Database and Scoring on rows");
        for (int l = 0;l < row_count; l++)
        {
            // Get and fill the input columns
            MojoFrameBuilder frameBuilder = model.getInputFrameBuilder();
            MojoRowBuilder rowBuilder = frameBuilder.getMojoRowBuilder();
            rowBuilder.setValue("GENDER", rs.getString(2));
            rowBuilder.setValue("SeniorCitizen",rs.getString(3));
            rowBuilder.setValue("PARTNER", rs.getString(4));
            rowBuilder.setValue("DEPENDENTS", rs.getString(5));
            rowBuilder.setValue("TENURE", rs.getString(6));
            rowBuilder.setValue("PHONESERVICE", rs.getString(7));
            rowBuilder.setValue("MULTIPLELINES", rs.getString(8));
            rowBuilder.setValue("INTERNETSERVICE", rs.getString(9));
            rowBuilder.setValue("ONLINESECURITY", rs.getString(10));
            rowBuilder.setValue("ONLINEBACKUP", rs.getString(11));
            rowBuilder.setValue("DEVICEPROTECTION", rs.getString(12));
            rowBuilder.setValue("TECHSUPPORT", rs.getString(13));
            rowBuilder.setValue("STREAMINGTV", rs.getString(14));
            rowBuilder.setValue("STREAMINGMOVIES", rs.getString(15));
            rowBuilder.setValue("CONTRACT", rs.getString(16));
            rowBuilder.setValue("PAPERLESSBILLING", rs.getString(17));
            rowBuilder.setValue("PAYMENTMETHOD", rs.getString(18));
            rowBuilder.setValue("MONTHLYCHARGES", rs.getString(19));
            rowBuilder.setValue("TOTALCHARGES", rs.getString(20));

            frameBuilder.addRow(rowBuilder);

            // Create a frame which can be transformed by MOJO pipeline
            final MojoFrame iframe = frameBuilder.toMojoFrame();

            // Transform input frame by MOJO pipeline
            final MojoFrame oframe = model.transform(iframe);

            // Oputput prediction as CSV
            Writer writer = new BufferedWriter(new OutputStreamWriter(System.out));
            CSVWriter csvWriter = new CSVWriter(writer);
            CsvWritingBatchHandler.csvWriteFrame(csvWriter, oframe, true);

            rs.next();
        } // for row_count
        rs.close();
        stmt.close();
        con.commit();

        con.close();

    }
    catch (ClassNotFoundException e)
    {
      System.err.println("Could not load JDBC driver");
      System.out.println("Exception: " + e);
      e.printStackTrace();
    }
    catch(SQLException ex)
    {
      System.err.println("SQLException information");
      while(ex!=null) {
        System.err.println ("Error msg: " + ex.getMessage());
        System.err.println ("SQLSTATE: " + ex.getSQLState());
        System.err.println ("Error code: " + ex.getErrorCode());
        ex.printStackTrace();
        ex = ex.getNextException(); // For drivers that support chained exceptions
      }
    }
  }
}