Working with Snap ML in WML Accelerator 1.2.0

Spectrum Conductor in WML Accelerator 1.2.0 provides capability to setup Spark cluster automatically. To execute an application using snap-ml-spark APIs in Spectrum Conductor environment in IBM WML Accelerator, either

  1. Run snap-ml-spark application through spark-submit in IBM WML Accelerator OR
  2. Enable snap-ml-spark APIs inside Jupyter Notebooks in IBM WML Accelerator

To perform these operations, a set of configuration changes in IBM Spectrum Conductor is required.

Configuring IBM Spectrum Conductor with Spark for IBM PowerAI SnapML

Create gpus Resource Group

  1. Log in to the IBM Spectrum Computing Cluster Management Console as an Administrator.
  2. From the cluster management console, navigate to Resources > Resource Planning (Slot) > Resource Groups.
     

     
  3. Under Global Actions, click Create a Resource Group.
     

     

  4. Create a resource group called gpus with Advanced formula ngpus.
     

     
  5. Navigate to Resources > Consumers and select a consumer, such as SampleApplications. After selecting it, click Consumer Properties tab. Under the section Specify slot-based resource groups, select the check mark for the resource group that was just created (gpus) and click Apply.
     

     
  6. Navigate to Resources > Resource Planning (Slot) > Resource Plan. Select Resource Group: gpus from the drop down box and Exclusive as the Slot allocation policy. Exclusive indicates that when IBM Spectrum Conductor with Spark allocates resources from this resource group, it results in using all free slots from a host. For example, assuming there are 4 GPUs on a host, a request for 1, 2, 3, or 4 GPUs would take the whole host. Click Apply.
     

     

Create the Anaconda environment

Create an anaconda environment first and then during creation of Spark Instance Group (SIG), select Jupyter Notebook, and then select the anaconda distribution and the anaconda environment name from the drop down boxes.

  1. Navigate to Workload -> Spark -> Anaconda Management. Look for ppc64le anaconda distribution name such as Anaconda3-2018-12-Linux-ppc64le. If you do not find it, follow these steps:

     

  2. Select Anaconda3-2018-12-Linux-ppc64le and click Deploy.
     

     
  3. Specify a name (such as myAnaconda) for the anaconda distribution and a deployment directory (such as /home/egoadmin/myAnaconda).
     

     
  4. Click the Environment Variables tab and click on Add Variable and add the following variables:
    • IBM_POWERAI_LICENSE_ACCEPT=yes
    • PATH=$PATH:/usr/bin or PATH=$PATH:/bin, depending on where bash exists on your system.

     

  5. Click Deploy.
  6. Click on Continue to Anaconda Distribution Instance.
  7. Prepare conda environment yml file from the existing dlipy3 conda environment by following these steps:
    • Log in or ssh to the WML Accelerator master host with the user ID that was used to install the WML Accelerator (may be root) and run the following commands:
      source activate dlipy3
      conda env export | grep -v "^prefix: " > /tmp/dlipy3_env.yml
      Note that you may have to run a command similar to export PATH=/opt/anaconda3/bin:$PATH first in order to get the correct conda command to become effective.
    • Add the following lines under dependencies: but above -pip::

      - jupyter=1.0.0
      - jupyter_client=5.2.2
      - jupyter_console=5.2.0
      - jupyter_core=4.4.0
      - jupyterlab=0.31.5
      - jupyterlab_launcher=0.10.2
      - notebook=5.6.0
      - conda=4.5.12

    • Copy the file /tmp/dlipy3_env.yml to your local system so that you can upload this yml file when creating the conda environment through GUI.
  8. From the Anaconda Distribution Instance page, select ppc64le anaconda distribution name Anaconda3-2018-12-Linux-ppc64le.
  9. In the wizard, click on the Anaconda distribution instance myAnaconda -> Click Add under Conda environments.
     
  10. Select Create environment from a yaml file and click Browse. Select the dlipy3_env.yml file and click Add. This creates a conda environment with the name dlipy3 and the PowerAI Base components installed in the environment. This dlipy3 conda environment is configured when we create a Spark Instance Group in the next steps.
     

Create the Spark Instance Group (SIG)

To use snap-ml-spark, the Spark Instance Group(SIG) is to be configured in Spectrum Conductor with following specific configurations:

  1. Navigate to Workload -> Spark -> Spark Instance Groups -> New. From the Spark version field, select Spark 2.3.1 from the drop down box.
     

     
  2. Click the Configuration link near Spark 2.3.1 to open the configuration wizard and set the configuration properties SPARK_EGO_CONF_DIR_EXTRA, SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX, and spark.jars:
    • Set the value for the property SPARK_EGO_CONF_DIR_EXTRA to be /home/egoadmin/myAnaconda/anaconda/envs/dlipy3/snap-ml-spark/conductor_spark/conf.
       

       
    • Set the value for the property SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX to be the number of GPUs available on each host in the cluster. For example, SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX=4.
       

       
    • Go to Additional Parameters and click Add a Parameter. Add the parameter spark.jars with the value /home/egoadmin/myAnaconda/anaconda/envs/dlipy3/snap-ml-spark/lib/snap-ml-spark-v1.2.0-ppc64le.jar.
       

       
  3. In the Spark Instance Group creation page, set the following configuration options:
    • Select Jupyter 5.4.0 at Enable notebooks.
    • Provide a shared directory as the base data directory (such as /wmla-nfs/ravigumm-2019-03-07-00-03-03-dli-shared-fs/data).
    • Select the Anaconda distribution instance (myAnaconda) and the Conda environment (dlipy3) from the drop down boxes.
       

       
  4. Click the Configuration link for Jupyter 5.4.0, go to Environment Variables tab, and click Add a variable. Add the variable JUPYTER_SPARK_OPTS with the value --conf spark.ego.gpu.app=true --conf spark.ego.gpu.executors.slots.max=4 --conf spark.default.parallelism=8 to use 8 GPUs with 8 partitions for notebooks where 2 hosts with 4GPUs on each host exists in the SIG.
     

     
  5. Under Resource Groups and Plans section, gpus resource group is selected for Spark executors (GPU slots). The ComputeHosts resource group is selected for all other things under Resource Groups and Plans section. Click Create and Deploy Instance Group.
     

     
  6. After the SIG is deployed and started, to run Jupyter Notebooks, click on the SIG, go to Notebooks tab, and click Create Notebooks for Users. Select users (for example, Admin and other users, as required) and click Create.
  7. Stop and start the Jupyter 5.4.0 notebook that you created in order to get the sample notebooks (snap_ml_spark_example_notebooks) to the home page when you log in. Start this Jupyter 5.4.0 notebook only when some Jupyter notebooks are to be executed. This is to make sure that GPUs are not allocated to notebook unnecessarily.

How to run snap-ml-spark applications through spark-submit in WML Accelerator 1.2.0

In the Cluster Management Console(GUI), navigate to Workload -> Spark -> My Applications And Notebooks. Click Run Application for spark-submit.

A sample spark-submit command takes the following arguments in the box in the Run Application wizard:

--master ego-client --conf spark.ego.gpu.app=true /home/egoadmin/myAnaconda/anaconda/envs/dlipy3/snap-ml-spark/examples/example-criteo45m/example-criteo45m.py --data_path /wmla-nfs/criteoData --num_partitions 8 --use_gpu

In the previous spark-submit command arguments, ego-client is to be replaced with ego-cluster to submit Spark job in cluster mode instead of client mode.
 

 
This spark-submit command, uses one of the samples, in this case example-criteo45m.py which is shipped with PowerAI base package. The /wmla-nfs/criteoData/data/ directory contains the input criteo data. As this data directory is on the host in the cluster where the spark application driver is running, consider using a shared filesystem for the data directory. Details about running this example and its related dataset can be found in home/egoadmin/myAnaconda/anaconda/envs/dlipy3/snap-ml-spark/examples/example-criteo45m/README.md. You can find more examples are available in home/egoadmin/myAnaconda/anaconda/envs/dlipy3/snap-ml-spark/examples/.

The following image shows a running application that is using 8 GPUs from 2 hosts:
 

 

How to run Jupyter Notebooks in WML Accelerator 1.2.0 using snap-ml-spark

  1. Click on the Spark Instance Group (SIG) and go to Notebooks tab.
  2. Start the Notebook, if not in started state.
  3. Click My Notebooks drop down box and click on something similar to Jupyter 5.4.0 – owned by Admin. This will open a new window with login to Home page of Notebooks.
  4. Log in as Admin user and click the snap_ml_spark_example_notebooks folder and then select any of the sample notebooks to open and run. Go through the instructions given at the beginning of the notebook before running the notebook.
  5. Click New drop down box in the Jupyter Notebooks Home page and select Spark Cluster to create a new IPython notebook where snap-ml-spark can be imported and its API can be used.

More Details

WML Accelerator 1.2.0 Knowledge Center can be found here: https://www.ibm.com/support/knowledgecenter/en/SSFHA8_1.2.0

Join The Discussion

Your email address will not be published. Required fields are marked *