Snap ML is available as part of IBM Watson Machine Learning Community Edition (WML CE) 1.6.1, a component in WML Accelerator 1.2.1. By setting up a WML Accelerator environment that that can execute snap-ml-spark APIs, you can complete the following Snap ML operations in WML Accelerator:

  • Running snap-ml-spark applications through spark-submit
  • Enabling snap-ml-spark APIs inside Jupyter Notebooks

To perform these operations, you must configure IBM Spectrum Conductor 2.3 according to the information shown below. IBM Spectrum Conductor 2.3 is a component of IBM Watson Machine Learning Accelerator 1.2.1 for deploying frameworks and services for a multitenant enterprise environment.

  • Overview
  • Configure IBM Spectrum Conductor for Snap ML
  • Create the Anaconda environment
  • Create the Spark instance group (SIG)
  • Running snap-ml-spark applications through spark-submit in WML Accelerator 1.2.1
  • Running Jupyter Notebooks in WML Accelerator 1.2.1 using snap-ml-spark

Overview

The snap-ml-spark package provides python API for easy to use training of Generalized Linear Models (GLMs).
The snap-ml-spark package offers distributed training of models across a cluster of machines. The library is exposed to the user via a spark.ml-like interface that can be seamlessly integrated into an existing pySpark application.

Configure IBM Spectrum Conductor for Snap ML

Create the gpus resource group:

  1. Log in to the cluster management console as an administrator.
  2. From the cluster management console, navigate to Resources > Resource Planning (Slot) > Resource Groups:

  3. Under Global Actions, click Create a Resource Group.

  4. Create a resource group called gpus with Advanced formula ngpus.

  5. Navigate to Resources > Consumers and select a consumer, such as SampleApplications, then open the Consumer Properties tab. Under the section Specify slot-based resource groups, select the new gpus resource group and click Apply.

  6. Navigate to Resources > Resource Planning (Slot) > Resource Plan. Select the gpus resource group and select Exclusive as the Slot allocation policy. Exclusive indicates that when IBM Spectrum Conductor allocates resources from this resource group, it uses all free slots from a host. For example, assuming there are four GPUs on a host, a request for 1, 2, 3, or 4 GPUs would take the whole host. Click Apply.

Create the Anaconda environment

Create an Anaconda environment first, then during the creation of the Spark instance group (SIG), select Jupyter Notebook, and then select the appropriate Anaconda distribution environment name.

  1. Navigate to Workload > Spark > Anaconda Management. Look for the ppc64le Anaconda distribution name such as Anaconda3-2018-12-Linux-ppc64le. If you do not find it, follow these steps:
    a. Download Anaconda3-2019.03-Linux-ppc64le.
    b. From the Anaconda Management window, click Add.
    c. Upload Anaconda3-2019.03-Linux-ppc64le.sh.

  2. Select Anaconda3-2019-03-Linux-ppc64le and click Deploy.

  3. Specify a name, such as myAnaconda, for the Anaconda distribution and a deployment directory, such as /home/egoadmin/myAnaconda.
  4. Select the Environment Variables tab, click Add Variable, and add the following variables:
    IBM_POWERAI_LICENSE_ACCEPT=yes
    PATH=$PATH:/usr/bin or PATH=$PATH:/bin, depending on where bash exists on your system.
  5. Click Deploy.
  6. Click Continue to Anaconda Distribution Instance.
  7. Prepare the conda environment yml file from the existing dlipy3 conda environment by following these steps:
    a. Log in or ssh to the WML Accelerator master host with the user ID that was used to install the WML Accelerator (such as root) and run the following commands:
    source activate dlipy3
    conda env export | grep -v "^prefix: " > /tmp/dlipy3_env.yml
    Note that you may have to run a command similar to export PATH=/opt/anaconda3/bin:$PATH first in order to get the correct conda command to become effective.
    b. Add the following lines under dependencies but above -pip:

    - jupyter=1.0.0
    - jupyter_client=5.2.2
    - jupyter_console=5.2.0 
    - jupyter_core=4.4.0
    - jupyterlab=0.31.5
    - jupyterlab_launcher=0.10.2
    - notebook=5.6.0
    - conda=4.5.12
    

    Change tornado level to 5.1.1 in /tmp/dlipy3_env.yml as below:

    - tornado=5.1.1=py36h7b6447c_0
    

    Copy the file /tmp/dlipy3_env.yml to your local system so that you can upload this yml file when creating the conda environment through GUI.

  8. From the Anaconda Distribution Instance page, select the ppc64le Anaconda distribution name Anaconda3-2019-03-Linux-ppc64le.
  9. In the wizard, select the Anaconda distribution instance myAnaconda, then under Conda environments, click Add.

  10. Select Create environment from a yaml file and click Browse. Select the dlipy3_env.yml file and click Add. This creates a conda environment with the name dlipy3 and the IBM Watson Machine Learning Community Edition components installed in the environment. This dlipy3 conda environment is configured when we create a Spark instance group in the next steps.

Create the Spark instance group (SIG)

To use snap-ml-spark, configure the SIG as follows:

  1. Navigate to Workload > Spark > Spark Instance Groups > New. For Spark version, select Spark 2.3.1:
  2. Click the Configuration link near Spark 2.3.1 to open the configuration wizard and set these configuration properties:
  3. SPARK_EGO_CONF_DIR_EXTRA: /home/egoadmin/myAnaconda/anaconda/envs/dlipy3/snap-ml-spark/conductor_spark/conf.
  4. SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX: set the number of GPUs available on each host in the cluster. For example, SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX=4.
  5. Go to Additional Parameters and click Add a Parameter. Add the parameter spark.jars with the value
    /home/egoadmin/myAnaconda/anaconda/envs/dlipy3/snap-ml-spark/lib/snap-ml-spark-v1.3.0-ppc64le.jar.
  6. In the Spark instance group creation page, set the following configuration options:
    From Enable notebooks, select Jupyter 5.4.0 and specify the following:

    • Provide a shared directory as the base data directory (such as /wmla-nfs/ravigumm-2019-03-07-00-03-03-dli-shared-fs/data).
    • Select the Anaconda distribution instance (myAnaconda) and the Conda environment (dlipy3).
  7. Click the Configuration link for Jupyter 5.4.0, go to the Environment Variables tab and click Add a variable. Add the variable JUPYTER_SPARK_OPTS with appropriate values. For example, to use eight GPUs with eight partitions for notebooks where two hosts with for GPUs on each host in the SIG, specify the following: –conf spark.ego.gpu.app=true –conf spark.ego.gpu.executors.slots.max=4 –conf spark.default.parallelism=8
  8. Under the Resource Groups and Plans section, ensure that the gpus resource group is selected for Spark executors (GPU slots). Select the ComputeHosts resource group for everything else in the section, then click Create and Deploy Instance Group.
  9. After the SIG is deployed and started, to run Jupyter Notebooks, click on the SIG, go to the Notebooks tab, and click Create Notebooks for Users. Select users (for example, Admin and other users as required) and click Create.
    • Stop and start the Jupyter 5.4.0 notebook that you created in order to get the sample notebooks (snap_ml_spark_example_notebooks) on the home page when you log in.
    • Start this Jupyter 5.4.0 notebook only when Jupyter notebooks are to be executed. This ensures that GPUs are not allocated to notebook unnecessarily.

Running snap-ml-spark applications through spark-submit in WML Accelerator 1.2.1

In the cluster management console, navigate to Workload -> Spark -> My Applications And Notebooks. Click Run Application for spark-submit.
A sample spark-submit command takes the following arguments in the box in the Run Application wizard:

–master ego-client –conf spark.ego.gpu.app=true /home/egoadmin/myAnaconda/anaconda/envs/dlipy3/snap-ml-spark/examples/example-criteo45m/example-criteo45m.py –data_path /wmla-nfs/criteoData –num_partitions 8 –use_gpu

Note: Replace ego-client with ego-cluster to submit the Spark job in cluster mode instead of client mode.

This spark-submit command uses one of the samples, in this case example-criteo45m.py, which is shipped with WML CE. The /wmla-nfs/criteoData/data/ directory contains the input criteo data. As this data directory is on the host in the cluster where the spark application driver is running, consider using a shared file system for the data directory.

Details about running this example and its related data set can be found in home/egoadmin/myAnaconda/anaconda/envs/dlipy3/snap-ml-spark/examples/example-criteo45m/README.md. You can find more examples in home/egoadmin/myAnaconda/anaconda/envs/dlipy3/snap-ml-spark/examples/.

The following image shows a running application that is using eight GPUs from two hosts:

Running Jupyter Notebooks in WML Accelerator 1.2.1 using snap-ml-spark

  1. Click on the SIG and go to the Notebooks tab.
  2. Start the Notebook, if not in Started state.
  3. Click My Notebooks and click on something similar to Jupyter 5.4.0, owned by Admin. This opens a new window that lets you log in to the Notebooks home page.
  4. Log in as Admin, click the snap_ml_spark_example_notebooks folder, then select any of the sample notebooks to open and run. Read the instructions at the beginning of the notebook before running the notebook.
  5. Click New in the Jupyter Notebooks home page and select Spark Cluster to create a new IPython notebook where snap-ml-spark can be imported and its API can be used.

As shown above, distributed training of models across a cluster of machines can be done using snap-ml-spark in WML Accelerator 1.2.1 using this setup.

For further details, refer to WML Accelerator 1.2.1 in the IBM Knowledge Center at http://www.ibm.com/support/knowledgecenter/en/SSFHA8_1.2.1.

Join The Discussion

Your email address will not be published. Required fields are marked *