Working with Snap ML in PowerAI Enterprise 1.1.2

Spectrum Conductor in PowerAI Enterprise 1.1.2 provides capability to setup Spark cluster automatically. To execute an application using snap-ml-spark APIs in Spectrum Conductor environment in IBM PowerAI Enterprise, either

  1. run snap-ml-spark application through spark-submit in PowerAI Enterprise OR
  2. enable snap-ml-spark APIs inside Jupyter Notebooks in PowerAI Enterprise

To perform these operations, a set of configuration changes in IBM Spectrum Conductor is required.

Configuring IBM Spectrum Conductor with Spark for IBM PowerAI SnapML

Create gpus Resource Group

  1. Log in to the IBM Spectrum Computing Cluster Management Console as an Administrator.
  2. From the cluster management console, navigate to Resources > Resource Planning (Slot) > Resource Groups.
     

     
  3. Under Global Actions, click Create a Resource Group.
     

     

  4. Create a resource group called gpus with Advanced formula ngpus.
     

     
  5. Navigate to Resources > Consumers and select a consumer, such as SampleApplications. After selecting it, click Consumer Properties tab. Under the section Specify slot-based resource groups, select the check mark for the resource group that was just created (gpus) and click Apply.
     

     
  6. Navigate to Resources > Resource Planning (Slot) > Resource Plan. Select Resource Group: gpus from the drop down box and Exclusive as the Slot allocation policy. Exclusive indicates that when IBM Spectrum Conductor with Spark allocates resources from this resource group, it results in using all free slots from a host. For example, assuming there are 4 GPUs on a host, a request for 1, 2, 3, or 4 GPUs would take the whole host. Click Apply.
     

     

Create Anaconda environment

Create an anaconda environment first and then during creation of Spark Instance Group (SIG), select Jupyter Notebook, and then select the anaconda distribution and the anaconda environment name from the drop down boxes.

  1. Navigate to Workload -> Spark -> Anaconda Management. Click Deploy after selecting a ppc64le anaconda distribution name such as Anaconda5-1-0-Python3-Linux-ppc64le.
     

     
  2. Specify a name (such as myAnaconda) for the anaconda distribution and a deployment directory (such as /home/egoadmin/myAnaconda). Click Deploy.
     

     
  3. Click on a ppc64le anaconda distribution name (such as Anaconda5-1-0-Python3-Linux-ppc64le) to open the Add Conda Environment wizard. In the wizard, click on the Anaconda distribution instance myAnaconda and click Add under Conda environments. Then, deselect Create environment from a yaml file and provide an environment name (such as env1). Click Add.
     

     

Create the Spark Instance Group (SIG)

To use snap-ml-spark, the Spark Instance Group(SIG) is to be configured in Spectrum Conductor with specific configurations as given below.

  1. Navigate to Workload -> Spark -> Spark Instance Groups -> New. From the Spark version field, select Spark 2.3.1 from the drop down box.
     

     
  2. Click the Configuration link near Spark 2.3.1 to set the configuration properties SPARK_EGO_CONF_DIR_EXTRA, SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX, and spark.jars:
    1. The value to be set for the property SPARK_EGO_CONF_DIR_EXTRA is /opt/DL/snap-ml-spark/conductor_spark/conf.
       

       
    2. The value to be set for the property SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX is the number of GPUs available on each host in the cluster. For example, SPARK_EGO_GPU_EXECUTOR_SLOTS_MAX=4.
       

       
    3. Go to Additional Parameters and click Add a Parameter. Add the parameter spark.jars with the value /opt/DL/snap-ml-spark/lib/snap-ml-spark-v1.1.0-ppc64le.jar.
       

       
  3. In the Spark Instance Group creation page, set the following configuration options:
    1. Select Jupyter 5.4.0 at Enable notebooks.
    2. Provide a shared directory as the base data directory (such as /paie-nfs/data).
    3. Select the Anaconda distribution instance and the Conda environment from the drop down boxes.
       

       
  4. Click the Configuration link for Jupyter 5.4.0, go to Environment Variables tab, and click Add a variable. Add the variable JUPYTER_SPARK_OPTS with the value --conf spark.ego.gpu.app=true --conf spark.ego.gpu.executors.slots.max=4 --conf spark.default.parallelism=8 to use 8 GPUs with 8 partitions for notebooks where 2 hosts with 4GPUs on each host exists in the SIG.
     

     
  5. Under Resource Groups and Plans section, gpus resource group is selected for Spark executors (GPU slots). The ComputeHosts resource group is selected for all other things under Resource Groups and Plans section. Click Create and Deploy Instance Group.
     

     
  6. After the SIG is deployed and started, to run Jupyter Notebooks, click on the SIG, go to Notebooks tab, and click Create Notebooks for Users. Select users (for example, Admin and other users, as required) and click Create.
  7. Stop and start the Jupyter 5.4.0 notebook that you created in order to get the sample notebooks (snap_ml_spark_example_notebooks) to the home page when you log in. Start this Jupyter 5.4.0 notebook only when some Jupyter notebooks are to be executed. This is to make sure that GPUs are not allocated to notebook unnecessarily.

How to run snap-ml-spark applications through spark-submit in PowerAI Enterprise 1.1.2

In the Cluster Management Console(GUI), navigate to Workload -> Spark -> My Applications And Notebooks. Click Run Application for spark-submit.

A sample spark-submit command takes the following arguments in the box in the Run Application wizard:

--master ego-client --conf spark.ego.gpu.app=true /opt/DL/snap-ml-spark/examples/example-criteo45m/example-criteo45m.py --data_path /tmp/criteoData --num_partitions 8 --use_gpu

In the above spark-submit command arguments, ego-client is to be replaced with ego-cluster to submit Spark job in cluster mode instead of client mode.
 

 
Here in this spark-submit command, we have used one of the examples, example-criteo45m.py which is shipped with PowerAI base package. The /tmp/criteoData/data/ directory should contain the input criteo data. This data directory is a directory on the host in the cluster where the selected spark master (selected in Run Application page) is running. Details related to how to run this example and its related dataset can be found in /opt/DL/snap-ml-spark/examples/example-criteo45m/README.md. More examples are available in /opt/DL/snap-ml-spark/examples.

Running Application can be seen here as below:
 

 

How to run Jupyter Notebooks in PowerAI Enterprise 1.1.2 using snap-ml-spark

  1. Click on the Spark Instance Group (SIG) and go to Notebooks tab.
  2. Start the Notebook, if not in started state.
  3. Click My Notebooks drop down box and click on something similar to Jupyter 5.4.0 – owned by Admin. This would open a new window with login to Home page of Notebooks.
  4. Log in as Admin user and click the snap_ml_spark_example_notebooks folder and then select any of the sample notebooks to open and run.
  5. Click New drop down box in the Jupyter Notebooks Home page and select Spark Cluster to create a new IPython notebook where snap-ml-spark can be imported and its API can be used.

More Details

PowerAI Enterprise 1.1.2 Knowledge Center can be found here: https://www.ibm.com/support/knowledgecenter/en/SSFHA8_1.1.2

Join The Discussion

Your email address will not be published. Required fields are marked *