Train Keras and MLlib models within a Watson Machine Learning Accelerator custom notebook
Customize a notebook package to include Ananconda, PowerAI, sparkmagic and use that to connect to a Hadoop cluster and execute a Spark MLlib model
IBM Watson Machine Learning Accelerator is a software solution that bundles IBM Watson Machine Learning Community Edition, IBM Spectrum Conductor, IBM Spectrum Conductor™ Deep Learning Impact, and support from IBM for the whole stack, including the open source deep learning frameworks. Watson Machine Learning Accelerator provides an end-to-end deep learning platform for data scientists. This includes complete lifecycle management from installation and configuration to data ingest and preparation to building, optimizing, and distributing the training model, and to and moving the model into production. Watson Machine Learning Accelerator excels when you expand your deep learning environment to include multiple compute nodes. There’s even a free evaluation available. See the prerequisites from our first introduction tutorial, Classify images with Watson Machine Learning Accelerator.
This article has been updated for Watson Machine Learning Accelerator v 1.2.x. It leverages the Anaconda and Notebook envionment creation function provided by IBM Spectrum Conductor.
This is the second tutorial of this IBM Watson Machine Learning Accelerator education series.
- Configure the resource groups
- Configure the roles
- Configure the Consumer
- Create a user
- Import the Anaconda installer into WLM-A and create a conda environment.
- Create a Notebook environment
- Create a Spark instance group with a notebook that uses the Anaconda environment.
- Start the notebook server and upload a notebook to train a Keras model.
- Connect to a Hadoop cluster from a notebook and execute a Spark MLlib model.
It should take you about two hours to complete this tutorial, which includes roughly 30 minutes of model training, installation, configuration, and getting the model through the GUI.
The tutorial requires access to a GPU-accelerated IBM Power® Systems server model AC922 or S822LC. In addition to acquiring a server, there are multiple options to access Power Systems servers listed on the IBM PowerAI developer portal.
Task 1: Configure the resource groups
Log on as the cluster Admin user.
Open the Resource Group configuration.
Select the ComputeHosts resource group.
Properly configure the number of slots to a value that makes sense. If the server is an 8-thread capable system, use 7 number of processors. If it’s a 4-thread capable system, go with 3 number of processors.
Optional, but recommended, change the resource selection method to static, and then select only the servers that will provide computing power (processor power) to the cluster.
Click Apply to commit the changes.
Create a new resource group.
Call it GPUHosts.
The number of slots should use the advanced formula and equals the number of GPUs on the systems by using the keyword ngpus.
Optionally, but recommended, change the resource selection method to static and select the nodes that are GPU-capable.
Under the Members Host column, click preferences and select the attribute ngpus to be displayed.
Click Apply and validate that the Members Host column now displays ngpus.
Finish the creation of the resource group by clicking Create.
Go to Resources -> Resource Planning (slot) -> Resource Plan.
Change the allocation policy of the ComputeHosts resource group to balanced.
Task 2: Configure the roles
To start, we create a role of a Chief Data Scientist. The reason for this is so that we create a role with intermediate privileges between an Admin account and a Data Scientist account. This Chief Data Scientist role has the authority of a data scientist plus additional privileges to start and stop instance groups. The idea is that users do not need to go up to a cluster Admin to start or stop their instance groups. Instead, they have the Chief Data Scientist do so.
Go to Systems & Services -> Users -> Roles.
Select the Data Scientist role and duplicate it by clicking the duplicate button.
Call the new role Chief Data Scientist.
Select the Chief Data Scientist role and add a couple of privileges. a. Conductor -> Spark Instance Groups -> Control b. Ego Services -> Services -> Control (exemplified below)
Click Apply to commit the changes.
Task 3: Configure the Consumer
At the OS level, as root, on all nodes, create an OS group and user for the OS execution user. a.
useradd -g demoexec -m demoexec
The GID and UID of the created user / group must be the same on all nodes.
Now go to Resources -> Consumers.
Click Create a consumer.
Name your consumer DemoConsumer (for best practices, use starting capital letters), and use demoexec in the list of users.
Scroll down and enter demoexec as the OS user for execution, and select the Management, Compute, and GPU resource groups.
Click Create to save.
On the left-side column, click the DemoConsumer consumer that you just created, and then click Create a consumer.
Name your consumer Anaconda3-DemoConsumer (for best practices, use starting capital letters). Leave the Inherit the user list and group list from parent consumer selected.
Scroll down and use demoexec as the operating system user for workload execution, and make sure all of the resource groups are selected.
Your Anaconda3-DemoConsumer should now appear as a child of DemoConsumer.
Task 4: Create a user
Go to Systems & Services -> Users -> Accounts.
Click Create New user account.
Create a demonstration account called DemoUser.
Go to Systems & Services -> Users -> Roles.
Select your newly defined user (make sure you do not unselect Admin in the process), and then assign it to the DemoConsumer consumer you created in Step 2.
Click OK and then Apply to commit the changes. Do not forget to click Apply!
Task 5: Import Anaconda installer into WLM-A and create an environment
Download the following file to your workstation. You can use
wgetor a browser download option for the URL.
Open the Spark Anaconda Management panel by using the Spectrum Conductor management console.
Add a new Anaconda.
Fill in the details for the Anaconda and click Add.
Distribution name is Anaconda3
Use browse to find and select the Anaconda installer downloaded in Task 1
Anaconda version: 2019.03
Python version: 3
Operating system: Linux on Power 64-bit little endian (LE)
Click Add to begin the Anaconda upload. The upload time varies based your network speed.
After the Anaconda add is complete, you can deploy it and create an environment for it.
Deploy preparation: On all nodes, create a directory on the local disk space for an Anaconda deployment. In this example, the local disk space is /cwslocal, and the execution user we are going to use in the Spark Instance Group is demoexec. Your shared disk location and execution user might differ.
- mkdir -p /cwslocal/demoexec/anaconda
- chown demoexec:demoexec /cwslocal/demoexec/anaconda
Now, select the distribution you just created, and click Deploy.
Fill in the required information.
In this example, the instance name follows a pattern of [Ananconda Name]-[Consumer]-[PowerAI]. The deployment directory matches the one that we created in the previous step. The consumer follows a pattern of [Ananconda Name]-[Consumer].
- Instance name: Anaconda3-DemoConsumer-PowerAI
- Deployment directory: /cwslocal/demoexec/anaconda
- Consumer: Anaconda3-DemoConsumer (created in step 2)
- Resource group: compute hosts
Execution user: demoexec
Click on the Environment Variables tab.
Add the variables for PATH and IBM_POWERAI_LICENSE_ACCEPT using the Add a Variable button.
| Name | Value | | —– | —— | | PATH | $PATH:/usr/bin | | IBM_POWERAI_LICENSE_ACCEPT | yes | | —– | —— |
Click Configure to complete the Anaconda deployment.
Click Deploy, and watch as your Anaconda environment gets deployed.
Download or create a powerai16.yml file on your workstation with the following content (notice the tabulation in the file). This is a YAML file that is used to create an Anaconda environment. If you do not have a YAML-enabled editor, consider verifing that the file format is valid by pasting the contents into an online YAML verification tool.
name: powerai161 channels: - https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/ - defaults dependencies: - conda=4.6.11 - jupyter - pyyaml - tornado=5.1.1 - sparkmagic - numpy - numba - openblas - pandas - python=3.6.8 - keras - matplotlib - scikit-learn - scipy - cuml - cudf - powerai=1.6.1 - cudatoolkit-dev - pip: - sparkmagic=0.12.8
You might have additional conda and pip packages that you want installed. Those packages can be added to the dependencies and pip list in the file.
Select the Anaconda3 distribution that you created. Click Add to add a conda environment.
Create a new environment from the powerai16.yml file that you created, then click Add.
Use the Browse button to select the powerai161.yml file that you created, then click Add.
Watch the environment get created. It creates an environment with over 200 packages. If Add fails, check the logs and verify that the YAML file is formatted correctly. Retry the Add after the issue is resolved.
Task 6: Create a Notebook environment
- We use the IBM Spectrum Conductor-provided notebook. You can see it in Workload -> Spark -> Notebook Management.
Notice that there is a notebook called Jupyter, version 5.4.0. If you select it and click Configure, you can view the settings for this notebook.
The settings show properties such as:
- The notebook package name
- The scripts in use
- Use of SSL
Anaconda required (make sure this setting is selected)
At the moment, due to RAPIDS package dependency called faiss, we need to apply a patch to the standard Jupyter 5.4.0 deploy.sh script. This patched version can be found here. Download this file to your workstation and replace the one that comes with Conductor by clicking Browse and selecting the patched version.
Click Update Notebook.
In the next step, we show how to create a new Spark Instance Group that uses the notebook.
Task 7: Create a Spark Instance Group (SIG) for the notebook
SIG preparation: On either node, create the data directory for the execution user within the shared filesystem. For this the example, the shared filesystem is /cwsshare.
mkdir -p /cwsshare/demoexec/b.
chown -R demoexec:demoexec /cwsshare/demoexec/
Create a new SIG and include the added notebook. Go to Workload -> Spark -> Spark Instance Groups.
Fill in the information with the following values: a. Instance group name: Notebook-DemoConsumer b. Deployment directory: /cwslocal/demoexec/notebook-democonsumer c. Spark version: use the latest one available
Select the Jupyter 5.4.0 notebook and set the following properties: a. data directory to: /cwsshare/demoexec/notebook-democonsumer b. select the anaconda environment you created in Task 2.
Scroll down and click on the standard consumer that the process creates. We need to change it.
Scroll down until you find the standard suggested consumer name and click the X to delete it.
Look for the DemoConsumer consumer, select it, and create a child named Notebook-DemoConsumer. Click Create and then Select.
Your consumer should now look like something like.
Scroll down and select the GPUHosts resource group for Spark Executors (GPU slots). Do not change anything else.
Click Create and Deploy Instance Group at the bottom of the page.
Watch as your instance group gets deployed.
After the deployment completes, start the SIG by clicking Start.
Task 8: Create the notebook server for users and upload a notebook to train a Keras model
After the SIG is started, go to the Notebook tab a click Create Notebooks for Users.
Select the users for the nodebook server.
After the notebook has been created, refresh the screen to see My Notebooks. Clicking this shows the list of notebook servers created for this SIG.
Select the Jupyter 5.4.0 notebook to bring up the notebook server URL.
Sign on to the notebook server.
Download the tf_keras_fashion_mnist.ipynb notebook and upload it to the notebook server by clicking Upload. You must press upload again after specifying the notebook to upload.
Select the notebook and begin executing the cells. The Keras model is defined in cell  and is trained in cell .
The test of the model shows an accuracy of more than 86 percent after being trained for five epochs.
Task 9: Connect to a Hadoop cluster from a notebook and execute a Spark MLlib model
This next section explains how to use the notebook to connect to a Hadoop data lake that has an Apache Livy service deployed. The following image shows the Hadoop integration.
Apache Livy is a service that enables easy interaction with a Spark cluster over a REST interface. It supports long-running Spark sessions and multi-tenancy. To install it on your Hadoop cluster, see your Hadoop vendor documentation like this one from Hortonworks. To get the Spark MLlib notebook to connect and run, make the following two changes on the Hortonworks HDP cluster.
Disable the Livy CSRF check by setting
livy.server.csrf_protection.enabled=falsein the HDP Spark2 configuration. Stop and Start all services to pick up the changes.
Install the numpy package via pip.
yum -y install python-pip
pip install numpy
Sparkmagic runs in a Jupyter Notebook. It includes a set of tools for interactively working with remote Spark clusters through Livy. It is installed through pip and enabled in the notebook by running a Jupyter command.
Sign on to the notebook server and import the hadoop_livy2_spark_mllib_test.ipynb notebook provided by this tutorial and execute it.
- Notebook cell  verifies that the sparkmagic module can be loaded.
- Notebook cell  verifies that the Spark session can be created. Edit the URL to point to your Hadoop host and port for the Livy service.
- Notebook cell  downloads the data and puts it in the hdfs /tmp directory.
- Notebook cell  runs a Spark MLlib kmeans clustering model.
- Notebook cell  cleans up the Spark session running on the Livy service. It is important to clean up the session and associated Hadoop cluster resources.
You now have learned how to customize and install Anaconda and Notebook environments in Watson Machine Learning Accelerator. You also learned how to use the notebook server to run a notebook with a Keras model and how to run a notebook that connects to a Hadoop data lake and execute a Spark MLlib model.