Deep learning training: Accelerate your learning with Watson Studio and Watson Machine Learning Accelerator

Model training and hyperparameter search is an iterative process that can take days, weeks, or even months. Data scientists can spend a significant amount of time training models to achieve the wanted accuracy.

Together, IBM® Watson™ Studio Local 2.0.2 and IBM Watson Machine Learning Accelerator 1.2.0 form an enterprise AI platform for accelerating the model training process, combining speed and accuracy to drive value and reduce the model’s time to go live to market. Model training is GPU-accelerated and can scale up automatically, which allows for allocations of more GPUs where available. A data scientist can get results faster and reach the accuracy level needed with our enterprise AI platform.

The technologies work in concert, as follows:

  • Watson Studio 2.0.2 supports the collaborative development of models.
  • Watson Machine Learning Accelerator helps data scientists optimize the speed of training by automating hyperparameter searches in parallel.
  • The elastic distributed training capability in Watson Machine Learning Accelerator helps with distributing model training across multiple GPUs and compute nodes. The distribution of training jobs is elastic. This means GPUs are dynamically allocated, as they can be added or removed when the user exits without killing the job. Because the scheduler dynamically allocates the GPUs, you don’t need to code GPU topology into the model. Instead, elastic distributed training handles the distribution for models that are built on stand-alone systems and makes the distribution apparent to the user.

Learning objectives

In this tutorial, you:

  1. Install and configure Watson Studio 2.0.2, Watson Machine Learning 2.0.2, and Watson Machine Learning Accelerator 1.2.0
  2. Submit for training a single node deep learning model in Watson Studio 2.0.2
  3. Offload this training to Watson Machine Learning Accelerator
  4. Run the distributed training with elastic distributed training
  5. Deploy your deep learning model

Installing and configuring

To install and configure IBM Watson Studio 2.0.2 and IBM Watson Machine Learning Accelerator, follow the steps in this run book.

Creating the elastic distributed training PyTorch deep learning experiment

In this section, you create a deep learning experiment that analyzes handwriting. To create and run the experiment, you must have access to the following:

  • A data set to use for training and testing the model. This tutorial uses an MNIST data set for analyzing handwriting samples.
  • A training definition that contains model building code and metadata about how to run the experiment. For information on coding a training definition file, see Coding guidelines for deep learning programs.
  • A training execution command. The execution command must reference the Python code, pass the names of the training files, and can optionally specify metrics.

The tutorial includes these components and instructions for downloading them and adding them to your experiment.

Create a project for the experiment

To begin the tutorial, log in to Watson Studio 2.0.2 and create a project for the experiment.

  1. Log in to IBM Watson Studio 2.0.2. Note: The user credentials must match the Watson Machine Learning user account created for Watson Machine Learning Accelerator. IBM Watson Login page

  2. Create a new standard project. Creating a new project

  3. Enter a project name and description. Entering a name and description

  4. Click Add to Project in the Action Bar, and select Experiment. Choosing the asset type

  5. Because your project is not associated with a Watson Machine Learning deployment space, you are prompted to associate it now. A deployment space is where you create and manage Watson Machine Learning deployments. Creating an experiment

  6. Enter the details of the new Deployment Space, and click Associate. Deployment space details

Defining and training the experiment

After associating the deployment space, you are returned to the New Experiment page where you can define and run the experiment.

  1. In the New Experiment page, enter a name and description for your experiment and upload the sample data set as follows: Defining experiment details

    1. Download the MNIST data set from https://github.com/IBM/wmla-assets/tree/master/Accelerate-Deep-Learning-Training-with-Watson-Studio-Watson-ML-and-Watson-ML-Accelerator.
    2. Upload the MNIST data set to the Watson Machine Learning Accelerator data set NFS mount point. For example, /gpfs/dli_data_fs/pytorch-mnist.
    3. In the field Source files folder, enter the path to your source files relative to the Watson Machine Learning Accelerator data set NFS mount point. For example, /gpfs/dli_data_fs/pytorch-mnist.
  2. Click Add Training Definition to create a new training definition to run in your experiment.

  3. Ensure that the New training definition tab is active. Enter a name and optionally, a description for your training definition. Creating training definitions

  4. Click the browse button and select the training definition source file, pytorch_onnx.zip. You can download it from https://github.com/IBM/wmla-assets/tree/master/Accelerate-Deep-Learning-Training-with-Watson-Studio-Watson-ML-and-Watson-ML-Accelerator. alt

  5. Choose the framework for your training definition.

  6. Enter the execution command for your training definition. The execution command must reference the Python code, pass the names of the training files, and can optionally specify metrics.

  7. In the Compute Configuration drop-down menu, choose the compute configuration that will be used to run your training definition. If you want to use Elastic Distributed Training along with PyTorch, be sure to select the single GPU option here. Otherwise, you can choose any number of GPUs.

  8. If you selected the Pytorch Framework and the single GPU compute configuration, you have the option of choosing the Distributed training type. To use distributed training, select Elastic distributed training from the drop-down menu. Otherwise, keep the selection at None. If you select Elastic distributed training, you can then choose the number of nodes to use for the distributed training from the Number of nodes drop-down menu. Nodes are GPUs, so if you select 8 as the nodes for distributed training, the training is distributed across 8 GPUs.

  9. Click Create to create your training definition and to return to the New Experiment page.

  10. Click Create and run to create the experiment and start the training process. Defining experiment details

  11. You are sent to the experiment details page where you can monitor the training process. Experiment details page

  12. During the experiment execution, you can compare training runs and view experiment details. Comparing training runs Test image continued

  13. Click a training run to monitor its progress and review details specific to the run. test Pytorch screen Training 1 screen Training 1 screen continued

  14. After a training run has completed successfully, you can select Save from the action menu to create a new model in your Watson Machine Learning deployment space. Training 1 model screen

  15. Enter the model name, and optionally, the description. Entering name and description

  16. Click the link in the successful model save notification to view model details in the Watson Studio interface. Model saved notification

Deploying your deep learning model

In this section, you deploy your model to make it available for use.

  1. Click Open in deployment space to view the model in the Watson Machine Learning deployment space, where it can be deployed. Pytorch from experiment dashboard

  2. Click the Deployments tab, then click Add Deployment to create a new web service deployment for your model. For a web service deployment, you pass a payload file to the model and get results back immediately. Adding deployment window

  3. Enter the deployment name, and optionally, the description, then click Save. Define deployment details

  4. After you create the deployment, view the deployment details by clicking the deployment link from the list of deployments. Deployment window Web service deployment window Web service deployment window continued

Your model is now ready to be scored.

Conclusion

Watson Studio, Watson Machine Learning, and Watson Machine Learning Accelerator together form a strong enterprise AI platform foundation. This foundation can help data scientists get results faster and improve the accuracy level of models. In this tutorial, we demonstrated how data scientists develop models collaboratively with Watson Studio, and then accelerate model training with Watson Machine Learning Accelerator Elastic Distributed Training. For more details, refer to the Watson Studio 2.0.2 and Watson Machine Learning 1.2.0 documentation.

Kelvin Lui
Chris J Jones
Calin Cocan
Prabhu S Padashetty
Anil Kumar Tallapragada
James Van Oosten
Ashley Zhao
Julianne Forgo
Helena Krolak