Win $20,000. Help build the future of education. Answer the Call for Code. Learn more

Setting up a machine learning environment using H2O Driverless AI on IBM AIX

Introduction

Automated machine learning (ML) is the latest trend (at the time of writing this tutorial) in the industry to apply ML to real-world problems. In automated ML, there is an end-to-end automation with no or least intervention of data scientist. H2O Driverless AI is one such artificial intelligence platform for automated ML. H2O Driverless AI can run on any Linux® system running on an IBM® Power Systems™ server.

In an ML pipeline, after a model is trained, it is deployed in the production environment to get insights from business data. This business data very often resides on IBM Power Systems servers running IBM AIX®. The process of getting insight from data using a trained model is called scoring or inferencing. IBM Power Systems servers running IBM AIX is a highly resilient platform and, in many cases, business want to use AI capabilities without moving the data out of these highly resilient Power Systems servers.

This tutorial describes how to:

  • Train an ML model using a H2O Driverless AI platform on a Linux logical partition (LPAR) in IBM Power® servers using data from IBM DB2® on AIX.
  • Deploy the model on AIX for scoring or inferencing.

This tutorial is using the IBM DB2 database to retrieve data for training and inferencing, however any database that is running on AIX can also be used.

H2O Driverless AI installation

H20 Driverless AI can be installed in multiple ways on the Power server that runs Linux. It can be installed using Docker, RPM, and TAR download. In the tutorial, we are using a H2O Driverless AI Docker image to install and configure the Driverless AI platform. Refer H2O Installation to identify other ways of installation.

H2O Driverless AI can be installed on any LPAR running Linux within the same IBM Power server (for example, IBM PowerVM® virtualized IBM Power Systems servers such as IBM Power System E980, E950, and S924) where AIX is also running. Therefore, no new infrastructure is needed to run H2O Driverless AI. This setup ensures faster and safer data transfer within the system.

Scoring pipelines in H2O Driverless AI

Driverless AI provides several scoring pipelines to generate models. They are:

  • A stand-alone Python Scoring Pipeline
  • A low-latency, standalone MOJO (Model Object, Optimized) Scoring Pipeline that works with both Java™ and C++ as backend

The Python Scoring Pipeline is implemented as a Python WHL file. The scoring service is generally implemented as a client/server architecture and supports interfaces for TCP and HTTP.

The MOJO Scoring Pipeline provides a stand-alone scoring pipeline that converts experiments to MOJOs, which can be scored in real time. The MOJO Scoring Pipeline is available as either a Java runtime or a C++ runtime. This tutorial uses MOJO scoring pipeline to run scoring on AIX.

The objective of this tutorial is to run scoring on AIX with H2O model using following simple steps that are explained in detail in subsequent sections:

  1. Configure and load H2O Docker image on Linux LPAR running on a Power server.
  2. Set up a JDBC connector to connect to an IBM AIX host for retrieving data from the DB2 database for training.
  3. Start the H2O container and launch the GUI.
    Note: The instructions for steps 1, 2, and 3 are covered in the Installing and configuring H2O on Linux on IBM Power Systems servers section.
  4. Train using the data extracted from DB2 on AIX and generate a model in the MOJO format. Refer to the Data loading and MOJO model generation using H2O section for details.
  5. Deploy the MOJO pipeline on AIX and run scoring on new data using Java code. Refer to the Deploying and scoring on AIX using the H2O MOJO model section for details.

Use case

The example used in this tutorial deals with a communications company database for predicting customer churn. The model is built on Linux running on an IBM Power enterprise server using H20 Driverless AI with data from DB2 running on AIX. After the training is completed, the model is saved in the MOJO format, which is deployed on AIX system for scoring using Java. This is explained in detail in the following sections.

The communication company database has data about the customers using services such as phone, internet, movie streaming, online backup, and so on. All the customers have a column called Churn which indicates if the customer has left within the last month. This data needs to be analysed and a model must be generated to predict the customers who might churn in the coming months. Using this prediction for customers who might churn, the communications service provider can develop customer retention programs for the specified customer or a group of customers. The data set that is used for training can be downloaded from the Churn database

In this database, there are 7043 records out of which 6943 records will be used for training and the remaining 100 records for scoring.

Loading the training data in DB2 on AIX

We are using the IBM DB2 database to store the data which will be retrieved using a JDBC connection for training. This section explains how to set up an IBM DB2 module and insert the training data in to the DB2 database. We need the IBM DB2 module to be installed to connect to the DB2 database using the Python script.

  1. Install the IBM DB2 Python module.
    # python3 -m pip install ibm_db

  2. Insert the data to be used for training in to the DB2 database running on AIX. To insert data from the churn_dataset.csv file in to DB2 database, a script is provided. You can run it as follows:

    Login as the db2inst1 user.
    # su - db2inst1

    Run the script to insert data.
    # ./create_churn_db.sh

    After running the above script, the records present in the churn_dataset.csv file will be inserted into a new database called TELCO in the TELCO_CUSTOMERS table.

    Details of the scripts and the data files is given in Table 1.

Table 1. Files to train and test data set in the DB2 database on AIX
File name Purpose
create_churn_db.sh This is a shell script that creates a database called TELCO with a table called TELCO_CUSTOMERS and inserts data in to the DB2 database from the churn_dataset.csv file on AIX. Download create_churn_db.sh
churn_dataset.csv The data in this file is used for training. This file contains the first 6943 records from the data in churn database
create_churn_db_newcustomers.sh This is a shell script that creates a new table called telco_new_customers and inserts data from new_customers_telco.csv on AIX. Download create_churn_db_newcustomers.sh
new_customers_telco.csv This file contains the remaining 100 records from churn database which are not in churn_dataset.csv. The data in this file is used for scoring on AIX.

Installing and configuring H2O on Linux on IBM Power Systems servers

This section explains how to install and configure H2O in IBM along with setting up JDBC connection to DB2 database on AIX to retrieve the data for training. The steps given in this section can be performed on any Linux system running on an IBM Power server to install and configure H2O Driverless AI.

  1. Obtain the H2O Docker image for the IBM Power server (assume /home/H2O is the directory from where this command is run).

    #wget https://s3.amazonaws.com/artifacts.h2o.ai/releases/ai/h2o/dai/rel-1.7.1-36/ppc64le-centos7/dai-docker-centos7-ppc64le-1.7.1-10.0.tar.gz

    For this tutorial, we are using version 1.7.1.

  2. Load the H2O Docker image using the following command. Output is shown in Figure 1.

    # docker load < dai-docker-centos7-ppc64le-1.7.1-10.0.tar.gz

    Figure 1. Output of the Docker load command on H2O image

    img1 View image larger

  3. Set up a JDBC connector for H2O. This is needed to connect to the AIX host for retrieving the data from DB2 database.

    1. Download the DB2 JDBC 4.0 driver file from https://www.ibm.com/support/pages/db2-jdbc-driver-versions-and-downloads. The file name is db2jcc4.jar

      The version of the db2jcc4.jar file must be selected based on the DB2 version running on AIX.

      For example, if DB2 running on AIX is 11.1.1 version, the file name which will be downloaded from the above location will be in the format v11.1.1fp1_jdbc_sqlj.tar.gz.

      Create a folder with name jdbc and copy the above GZ file into the directory. Extract the contents of the GZ file, which will create a directory called jdbc_sqlj. Inside the jdbc_sqlj directory, there will be a zip file, db2_db2driver_for_jdbc_sqlj.zip. Extract this zip file to retrieve the DB2 jar file db2jcc4.jar. The location of the db2jcc4.jar file relative to your home directory will be ~/jdbc/jdbc_sqlj/db2jcc4.jar.

    2. Define the configuration file for H2O to enable the JDBC connector. Create a file called config.toml in your home directory with the following content:

      enabled_file_systems = "file, upload, jdbc"
      jdbc_app_configs = '{"db2": { "url": "jdbc:db2://hostname.ibm.com:50000/TELCO", "jarpath": "/jdbc/db2jcc4.jar", "classpath": "com.ibm.db2.jcc.DB2Driver"}}'
      ~
      
      • jdbc_app_configs is the variable which is expected by H2O server to look for JDBC data. This variable must be set to enable JDBC option in the H2O GUI.
      • url contains the host name and port (50000 is the standard port where DB2 server is running). TELCO is the database name for churn data.
      • jarpath is the path inside the container to find the JDBC JAR file. The same path should be given in the Docker run command.
      • classpath: “com.ibm.db2.jcc.DB2Driver” is the standard class location of the DB2 driver.
  4. Start the container with the H2O image loaded in step 2.

    1. Create three folders in /home/H2O.

      #mkdir tmp log license

    2. Copy the H2O license key to the license/license.sig file.

    3. Start the container as shown in Figure 2.

      #docker run --pid=host --init  --rm -u `id -u`:`id -g` -e DRIVERLESS_AI_CONFIG_FILE=/config/config.toml -e CLASSPATH=/jdbc -p 12345:12345 -v `pwd`/data:/data -v `pwd`/log:/log -v `pwd`/license:/license -v `pwd`/tmp:/tmp -v `pwd`/config/config.toml:/config/config.toml -v `pwd`/jdbc/jdbc_sqlj/db2jcc4.jar:/jdbc/db2jcc4.jar h2oai/dai-centos7-ppc64le:1.7.1-cuda10.0
      

      The DRIVERLESS_AI_CONFIG_FILE variable is for Docker to find the config.toml file inside the container.

      In -v pwd /config/config.toml:/config/config.toml, the path before semicolon refers to the path in the host and path after semicolon refers to the path in the container. The path in the host is mapped inside the container using the -v option.

      In -p 12345:12345, the port 12345 of the host is mapped to 12345 of the container.

      If the -d option is given, it starts the container in the detached mode. if -d is not given, it gives you the following output at the terminal, which blocks the terminal.

      Figure 2. Starting the Docker container with H2O image
      img2 View image larger

    4. Check the H2O container as shown in Figure 3.

      #docker container ls

      Figure 3. Listing the Docker containers
      img3 View image larger

  5. Check if your H2O container is correctly started (only for debugging purpose).

    1. Check the dai.log file using the find. -name dai.log command.

    2. Check the h2oai_server.log file using the tail -f tmp/h2oai_server.log command to see what all has been done while server is starting.

  6. Connect to the H2O GUI using http://<hostname>:12345. Any username and password can be given to login into the H2O dashboard. (For example, admin/admin). After logging in, it shows the screen as shown in Figure 4. Because we have not pulled the data from anywhere, the data sets page is empty.

    Figure 4. H2O GUI page after logging in
    img4 View image larger

Data loading and MOJO model generation using H2O

This section explains the steps to load the data from DB2 running on AIX to H2O to start the training process for creating a MOJO model.

  1. To retrieve the data from a DB2 host running on AIX using the JDBC connector, click Add Dataset in the H2O Driverless AI GUI and click JDBC. The screen as shown in Figure 5 is displayed. Some fields are automatically populated (as given in config.toml).

    • JDBC username and JDBC password of the DB2 database of the AIX host must be provided here (example db2inst1/db2inst1).
    • Destination name should be the name with which the retrieved data should be saved.
    • JDBC query that must be run on the TELCO database should be given to obtain the data which will be saved in the Destination Name field.

    For example, in the following figure, all the data from telco_customers in the TELCO database will be saved in H2O with the file name as churn_customers_data.

    Figure 5. JDBC query page in H2O GUI
    img5 View image larger

  2. After the data is retrieved from the DB2 host running on AIX, we need to train this data to obtain a model. As shown in Figure 6, the data is saved as churn_customers_data and click PREDICT to launch the training.

    Figure 6. Datasets page in H2O
    img6 View image larger

    The interface, as shown in Figure 7 is displayed.

    Figure 7. Training page in H2O GUI
    img7 View image larger

    The target column that needs to be predicted must be selected manually by the user. In our training data, CHURN is the column name which is our target feature that needs to be predicted.

  3. After selecting the appropriate target column, click LAUNCH EXPERIMENT to start the training.

    The training process for this experiment and data set might take 15 to 20 min based on the available resources. After the training process is completed, the screen as shown in Figure 8 is displayed.

    Figure 8. Final page after training in H2O GUI
    img8 View image larger

    Figure 8 shows that training is complete for this data with an accuracy of 0.8492 from the XGBOOSTGBM model.

  4. Click BUILD MOJO SCORING PIPELINE to generate the MOJO version of the model.

  5. After building the MOJO scoring pipeline, download the MOJO file and copy the mojo.zip file to the AIX host.

Deploying and scoring on AIX using the H2O MOJO model

On the AIX host, we can run scoring on the new data (in DB2 database) to predict if a customer will churn or not using the MOJO model. To run this example, the AIX operating system version should be either 7.1 or 7.2 and the Java version should be 8. Perform the following steps to deploy and run the H2O MOJO model on AIX:

  1. Extract the mojo.zip file and read the README.txt file to know about the files contained in mojo.zip (in /home/H2O/experiment). The list of files after extracting the mojo.zip file is shown in Figure 9.

    Figure 9. Listing the contents of the MOJO pipeline on the AIX system
    img9

    Table 2 lists the files used for our demonstration.

    Table 2: List of files to run scoring on AIX

    FilenamePurpose
    run_example.shA bash script to score a sample test set
    Pipeline.mojoStand-alone scoring pipeline in the MOJO format
    mojo2-runtime.jarMOJO Java runtime for Little Endian platforms
    mojo2-runtime-2.1.4-all.jarMOJO Java runtime for Big Endian platforms
    example.csvSample test set
    Main.javaJava Code to retrieve data from DB2 running on AIX and run scoring using pipeline.mojo. Download Main.java
  2. Download the new runtime that can help load the MOJO model on Big Endian platforms. The default runtime mojo2-runtime.jar file that comes with mojo.zip does not work on Big Endian platforms. Download the runtime file from the following location on the AIX system.

    http://artifacts.h2o.ai.s3.amazonaws.com/releases/ai/h2o/mojo2-runtime/2.1.4/any/mojo2-runtime-2.1.4-all.jar

  3. Download the DB2 JDBC driver. This is used to connect to DB2 from a Java program on AIX.

    You can download the DB2 JDBC 4.0 driver file from https://www.ibm.com/support/pages/db2-jdbc-driver-versions-and-downloads. The file name is db2jcc4.jar. The version of db2jcc4.jar must be selected based on the DB2 version running on AIX.

  4. Set the environment variables.

    export DRIVERLESS_AI_LICENSE_FILE=/home/H2O/license.sig
    export CLASSPATH=/home/H2O/java_scoring:/home/H2O/mojo2-runtime-2.1.4-all.jar:/home/H2O/jdbc/jdbc_sqlj/db2jcc4.jar
    
  5. Run the example code written in Java that connects to DB2 running on AIX and the retrieves data to run scoring using the MOJO model.

    # javac Main.java (compile)
    # java Main (execute)

    Usage:

    java Main <H2O Model Path> <DB2 HOST> <Database Name>

    Example:

    # java Main /home/H2O/experiment3/mojo-pipeline/pipeline.mojo hostname TELCO

    Figure 10. Scoring on AIX to predict churn for a customer
    img10 View image larger

    The output of the command is shown in Figure 10. The output shows that a there is a probability of 87.48% that a particular customer will not churn.

Conclusion

AIX users can take advantage of machine learning models for their data using the method explained in this tutorial. The data on AIX can be used to train a model using H2O Driverless AI on IBM Power system running Linux and that model in MOJO format can be deployed on AIX for scoring. The training data used for this tutorial is stored in DB2 on AIX which is retrieved using a JDBC connection. The data can be in any database also like Oracle that is supported on AIX. The method to score using Java APIs on AIX is also explained in this tutorial. Existing Java applications running on AIX can easily incorporate the APIs to run predictions on their data. This tutorial demonstrated how to run the H2O MOJO model on AIX by using a customer churn database. Applications running on AIX can get the benefit of running real-time predictions using the model on AIX with very low latency.