Train Keras and MLlib models within a Watson Machine Learning Accelerator custom notebook

IBM Watson Machine Learning Accelerator is a software solution that bundles IBM Watson Machine Learning Community Edition, IBM Spectrum Conductor, IBM Spectrum Conductor™ Deep Learning Impact, and support from IBM for the whole stack, including the open source deep learning frameworks. Watson Machine Learning Accelerator provides an end-to-end deep learning platform for data scientists. This includes complete lifecycle management from installation and configuration to data ingest and preparation to building, optimizing, and distributing the training model, and to and moving the model into production. Watson Machine Learning Accelerator excels when you expand your deep learning environment to include multiple compute nodes. There’s even a free evaluation available. See the prerequisites from our first introduction tutorial, Classify images with Watson Machine Learning Accelerator.

This article has been updated for Watson Machine Learning Accelerator v 1.2.x. It leverages the Anaconda and Notebook envionment creation function provided by IBM Spectrum Conductor.

Learning objectives

This is the second tutorial of this IBM Watson Machine Learning Accelerator education series.

Tasks

  • Configure the resource groups
  • Configure the roles
  • Configure the Consumer
  • Create a user
  • Import the Anaconda installer into WLM-A and create a conda environment.
  • Create a Notebook environment
  • Create a Spark instance group with a notebook that uses the Anaconda environment.
  • Start the notebook server and upload a notebook to train a Keras model.
  • Connect to a Hadoop cluster from a notebook and execute a Spark MLlib model.

Estimated time

It should take you about two hours to complete this tutorial, which includes roughly 30 minutes of model training, installation, configuration, and getting the model through the GUI.

Prerequisites

The tutorial requires access to a GPU-accelerated IBM Power® Systems server model AC922 or S822LC. In addition to acquiring a server, there are multiple options to access Power Systems servers listed on the IBM PowerAI developer portal.

Task 1: Configure the resource groups

  1. Log on as the cluster Admin user.

  2. Open the Resource Group configuration.

    Resource Group configuration

  3. Select the ComputeHosts resource group.

    ComputeHosts

  4. Properly configure the number of slots to a value that makes sense. If the server is an 8-thread capable system, use 7 number of processors. If it’s a 4-thread capable system, go with 3 number of processors.

    slots

  5. Optional, but recommended, change the resource selection method to static, and then select only the servers that will provide computing power (processor power) to the cluster.

    static resource selection

  6. Click Apply to commit the changes.

  7. Create a new resource group.

    new resource group

  8. Call it GPUHosts.

    GPUHosts

  9. The number of slots should use the advanced formula and equals the number of GPUs on the systems by using the keyword ngpus.

    GPU slots

  10. Optionally, but recommended, change the resource selection method to static and select the nodes that are GPU-capable.

    static GPUs

  11. Under the Members Host column, click preferences and select the attribute ngpus to be displayed.

    nGPUs preferences nGPUs preference details

  12. Click Apply and validate that the Members Host column now displays ngpus.

    Apply nGPUs

  13. Finish the creation of the resource group by clicking Create.

  14. Go to Resources -> Resource Planning (slot) -> Resource Plan.

    Resource Plan

  15. Change the allocation policy of the ComputeHosts resource group to balanced.

    balanced policy

Task 2: Configure the roles

  1. To start, we create a role of a Chief Data Scientist. The reason for this is so that we create a role with intermediate privileges between an Admin account and a Data Scientist account. This Chief Data Scientist role has the authority of a data scientist plus additional privileges to start and stop instance groups. The idea is that users do not need to go up to a cluster Admin to start or stop their instance groups. Instead, they have the Chief Data Scientist do so.

  2. Go to Systems & Services -> Users -> Roles.

    User Roles

  3. Select the Data Scientist role and duplicate it by clicking the duplicate button.

    Data Scientist role

  4. Call the new role Chief Data Scientist.

    Chief Data Scientist role

  5. Select the Chief Data Scientist role and add a couple of privileges. a. Conductor -> Spark Instance Groups -> Control b. Ego Services -> Services -> Control (exemplified below)

    role privileges

  6. Click Apply to commit the changes.

Task 3: Configure the Consumer

  1. At the OS level, as root, on all nodes, create an OS group and user for the OS execution user. a. groupadd demoexec b. useradd -g demoexec -m demoexec

  2. The GID and UID of the created user / group must be the same on all nodes.

  3. Now go to Resources -> Consumers.

    Consumers

  4. Click Create a consumer.

    Create consumer

  5. Name your consumer DemoConsumer (for best practices, use starting capital letters), and use demoexec in the list of users.

    demo consumer

  6. Scroll down and enter demoexec as the OS user for execution, and select the Management, Compute, and GPU resource groups.

    OS user

  7. Click Create to save.

  8. On the left-side column, click the DemoConsumer consumer that you just created, and then click Create a consumer.

    create a consumer

  9. Name your consumer Anaconda3-DemoConsumer (for best practices, use starting capital letters). Leave the Inherit the user list and group list from parent consumer selected.

    name the consumer

  10. Scroll down and use demoexec as the operating system user for workload execution, and make sure all of the resource groups are selected.

    os user for consumer

  11. Your Anaconda3-DemoConsumer should now appear as a child of DemoConsumer.

Task 4: Create a user

  1. Go to Systems & Services -> Users -> Accounts.

    User accounts

  2. Click Create New user account.

    new user

  3. Create a demonstration account called DemoUser.

    DemoUser

  4. Go to Systems & Services -> Users -> Roles.

    Roles

  5. Select your newly defined user (make sure you do not unselect Admin in the process), and then assign it to the DemoConsumer consumer you created in Step 2.

    Assign role

  6. Click OK and then Apply to commit the changes. Do not forget to click Apply!

Task 5: Import Anaconda installer into WLM-A and create an environment

  1. Download the following file to your workstation. You can use wget or a browser download option for the URL.

     wget https://repo.continuum.io/archive/Anaconda3-2019.03-Linux-ppc64le.sh
    
  2. Open the Spark Anaconda Management panel by using the Spectrum Conductor management console.

    cluster management console

  3. Add a new Anaconda.

    anaconda management panel

  4. Fill in the details for the Anaconda and click Add.

    • Distribution name is Anaconda3

    • Use browse to find and select the Anaconda installer downloaded in Task 1

    • Anaconda version: 2019.03

    • Python version: 3

    • Operating system: Linux on Power 64-bit little endian (LE)

      Add Anaconda Distribution window

  5. Click Add to begin the Anaconda upload. The upload time varies based your network speed.

After the Anaconda add is complete, you can deploy it and create an environment for it.

  1. Deploy preparation: On all nodes, create a directory on the local disk space for an Anaconda deployment. In this example, the local disk space is /cwslocal, and the execution user we are going to use in the Spark Instance Group is demoexec. Your shared disk location and execution user might differ.

    1. mkdir -p /cwslocal/demoexec/anaconda
    2. chown demoexec:demoexec /cwslocal/demoexec/anaconda
  2. Now, select the distribution you just created, and click Deploy.

    Anaconda Management panel

  3. Fill in the required information.

    In this example, the instance name follows a pattern of [Ananconda Name]-[Consumer]-[PowerAI]. The deployment directory matches the one that we created in the previous step. The consumer follows a pattern of [Ananconda Name]-[Consumer].

    • Instance name: Anaconda3-DemoConsumer-PowerAI
    • Deployment directory: /cwslocal/demoexec/anaconda
    • Consumer: Anaconda3-DemoConsumer (created in step 2)
    • Resource group: compute hosts
    • Execution user: demoexec

      Deploy Anaconda Distribution window

  4. Click on the Environment Variables tab.

    Deploy Anaconda Distribution window

  5. Add the variables for PATH and IBM_POWERAI_LICENSE_ACCEPT using the Add a Variable button.

    | Name | Value | | —– | —— | | PATH | $PATH:/usr/bin | | IBM_POWERAI_LICENSE_ACCEPT | yes | | —– | —— |

    Click Configure to complete the Anaconda deployment.

    Configure Anaconda Distribution window

  6. Click Deploy, and watch as your Anaconda environment gets deployed.

    Configure Anaconda Distribution window

  7. Download or create a powerai16.yml file on your workstation with the following content (notice the tabulation in the file). This is a YAML file that is used to create an Anaconda environment. If you do not have a YAML-enabled editor, consider verifing that the file format is valid by pasting the contents into an online YAML verification tool.

     name: powerai161
     channels:
       - https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/
       - defaults
     dependencies:
       - conda=4.6.11
       - jupyter
       - pyyaml
       - tornado=5.1.1
       - sparkmagic
       - numpy
       - numba
       - openblas 
       - pandas
       - python=3.6.8
       - keras
       - matplotlib
       - scikit-learn
       - scipy
       - cuml
       - cudf
       - powerai=1.6.1
       - cudatoolkit-dev
       - pip:
         - sparkmagic=0.12.8
    

    You might have additional conda and pip packages that you want installed. Those packages can be added to the dependencies and pip list in the file.

  8. Select the Anaconda3 distribution that you created. Click Add to add a conda environment.

    Configure Anaconda Distribution window

  9. Create a new environment from the powerai16.yml file that you created, then click Add.

    Use the Browse button to select the powerai161.yml file that you created, then click Add.

    Add Conda Environment window

Watch the environment get created. It creates an environment with over 200 packages. If Add fails, check the logs and verify that the YAML file is formatted correctly. Retry the Add after the issue is resolved.

Task 6: Create a Notebook environment

  1. We use the IBM Spectrum Conductor-provided notebook. You can see it in Workload -> Spark -> Notebook Management.

Notebook environment

  1. Notice that there is a notebook called Jupyter, version 5.4.0. If you select it and click Configure, you can view the settings for this notebook.

    Configure notebook

    The settings show properties such as:

    • The notebook package name
    • The scripts in use
    • Use of SSL
    • Anaconda required (make sure this setting is selected)

      Notebook settings

  2. At the moment, due to RAPIDS package dependency called faiss, we need to apply a patch to the standard Jupyter 5.4.0 deploy.sh script. This patched version can be found here. Download this file to your workstation and replace the one that comes with Conductor by clicking Browse and selecting the patched version.

    Update notebook

  3. Click Update Notebook.

In the next step, we show how to create a new Spark Instance Group that uses the notebook.

Task 7: Create a Spark Instance Group (SIG) for the notebook

  1. SIG preparation: On either node, create the data directory for the execution user within the shared filesystem. For this the example, the shared filesystem is /cwsshare.

    a. mkdir -p /cwsshare/demoexec/ b. chown -R demoexec:demoexec /cwsshare/demoexec/

  2. Create a new SIG and include the added notebook. Go to Workload -> Spark -> Spark Instance Groups.

    Spark instance group window

  3. Click New.

    New SIG

  4. Fill in the information with the following values: a. Instance group name: Notebook-DemoConsumer b. Deployment directory: /cwslocal/demoexec/notebook-democonsumer c. Spark version: use the latest one available

    SIG name

  5. Select the Jupyter 5.4.0 notebook and set the following properties: a. data directory to: /cwsshare/demoexec/notebook-democonsumer b. select the anaconda environment you created in Task 2.

    SIG notebook

  6. Scroll down and click on the standard consumer that the process creates. We need to change it.

    SIG consumer

  7. Scroll down until you find the standard suggested consumer name and click the X to delete it.

    SIG consumer edit

  8. Look for the DemoConsumer consumer, select it, and create a child named Notebook-DemoConsumer. Click Create and then Select.

    SIG consumer edit

  9. Your consumer should now look like something like.

    SIG consumer edit

  10. Scroll down and select the GPUHosts resource group for Spark Executors (GPU slots). Do not change anything else.

    SIG resource groups and plans window

  11. Click Create and Deploy Instance Group at the bottom of the page.

  12. Watch as your instance group gets deployed.

    SIG deployment

  13. After the deployment completes, start the SIG by clicking Start.

    SIG start

Task 8: Create the notebook server for users and upload a notebook to train a Keras model

  1. After the SIG is started, go to the Notebook tab a click Create Notebooks for Users.

    SIG notebook window

  2. Select the users for the nodebook server.

    Screen showing my notebooks button

  3. After the notebook has been created, refresh the screen to see My Notebooks. Clicking this shows the list of notebook servers created for this SIG.

  4. Select the Jupyter 5.4.0 notebook to bring up the notebook server URL.

  5. Sign on to the notebook server.

    Notebook selection window

  6. Download the tf_keras_fashion_mnist.ipynb notebook and upload it to the notebook server by clicking Upload. You must press upload again after specifying the notebook to upload.

    cell execution

  7. Select the notebook and begin executing the cells. The Keras model is defined in cell [13] and is trained in cell [15].

    hadoop integration

The test of the model shows an accuracy of more than 86 percent after being trained for five epochs.

Task 9: Connect to a Hadoop cluster from a notebook and execute a Spark MLlib model

This next section explains how to use the notebook to connect to a Hadoop data lake that has an Apache Livy service deployed. The following image shows the Hadoop integration.

Hadoop integration

Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It supports long-running Spark sessions and multi-tenancy. To install it on your Hadoop cluster, see your Hadoop vendor documentation like this one from Hortonworks. To get the Spark MLlib notebook to connect and run, make the following two changes on the Hortonworks HDP cluster.

  1. Disable the Livy CSRF check by setting livy.server.csrf_protection.enabled=false in the HDP Spark2 configuration. Stop and Start all services to pick up the changes.

  2. Install the numpy package via pip.

    1. yum -y install python-pip
    2. pip install numpy

Sparkmagic runs in a Jupyter Notebook. It includes a set of tools for interactively working with remote Spark clusters through Livy. It is installed through pip and enabled in the notebook by running a Jupyter command.

Sign on to the notebook server and import the hadoop_livy2_spark_mllib_test.ipynb notebook provided by this tutorial and execute it.

  • Notebook cell [1] verifies that the sparkmagic module can be loaded.
  • Notebook cell [2] verifies that the Spark session can be created. Edit the URL to point to your Hadoop host and port for the Livy service.
  • Notebook cell [3] downloads the data and puts it in the hdfs /tmp directory.
  • Notebook cell [4] runs a Spark MLlib kmeans clustering model.
  • Notebook cell [5] cleans up the Spark session running on the Livy service. It is important to clean up the session and associated Hadoop cluster resources.

Running the notebook

Conclusion

You now have learned how to customize and install Anaconda and Notebook environments in Watson Machine Learning Accelerator. You also learned how to use the notebook server to run a notebook with a Keras model and how to run a notebook that connects to a Hadoop data lake and execute a Spark MLlib model.

Kelvin Lui
Jim Van Oosten
Rodrigo Ceron