Elastic Distributed Training in Watson Machine Learning Accelerator

This article is part of the Learning path: Get started with Watson Machine Learning Accelerator series.

Level Topic Type
100 An introduction to Watson Machine Learning Accelerator Article
101 Classify images with Watson Machine Learning Accelerator Article + notebook
201 Elastic Distributed Training in Watson Machine Learning Accelerator Article + notebook
202 Drive higher GPU utilization and throughput Tutorial
301 Accelerate your deep learning and machine learning Article + notebook

Introduction

Watson Machine Learning Accelerator Elastic Distributed Training (EDT) simplifies the distribution of training workloads for the data scientist. The model distribution is transparent to the end user, with no need to specifically know the topology of the distribution. The usage is simple, define a maximum GPU count for training jobs, and Watson Machine Learning Accelerator schedules the jobs simultaneously on the existing cluster resources. GPU allocation for multiple jobs can grow and shrink dynamically based on fair share or priority scheduling and without interrupting running jobs.

EDT enables multiple data scientists to share GPUs in a dynamic fashion, increasing productivity while also increasing overall GPU utilization.

Description

In this article, we use a Jupyter Notebook to walk through the process of taking a PyTorch model from the community and making the required code changes to distribute the training using Elastic Distributed Training. The article and notebook cover:

  • Training the PyTorch Model with Elastic Distributed Training
  • Monitoring the running job status and showing how to debug any issues

Instructions

The detailed steps for this article can be found in the associated Jupyter Notebook. Within this notebook, you’ll:

  • Make changes to your code
  • Make your data set available
  • Set up an API end point and log on
  • Submit a job through an API
  • Monitor a running job
  • Retrieve output and saved models
  • Debug any issues

Changes to your code

Note that the following images show a comparison between the before and EDT-enabled versions of the code using the diff command.

  1. Import the additional libraries required for Elastic Distributed Training and set up the environment variables. Note that the additional EDT helper scripts edtcallback.py, emetrics.py and elog.py are required. These must be copied to the same directory as your modified code. Sample versions can be found in the tarball in the tutorial repo. Additionally, they can be downloaded from http://ibm.biz/WMLA-samples.

    Adding additional libraries

  2. Replace the data loading functions with EDT-compatible functions that return a tuple containing two items of type torch.utils.data.Dataset.

    Replace data loading functions

  3. Replace the training and testing loops with the EDT-equivalent function. This requires the creation of a main function. You could also potentially specify parameters in the API call and pass these parameters into the model.

    Replace training and testing

  4. Instantiate the Elastic Distributed Training instance and launch the EDT job with specific parameters.

    • epoch
    • effective_batch_size
    • max number of GPUs per EDT training job
    • checkpoint creation frequency in number of epochs

      Instantiate the EDT instance

Conclusion

This article provided an overview of the Elastic Distributed Training feature of Watson Machine Learning Accelerator. The article is the final part of the Learning path: Get started with Watson Machine Learning Accelerator series.