This article is part of the Learning path: Get started with Watson Machine Learning Accelerator series.
|100||An introduction to Watson Machine Learning Accelerator||Article|
|101||Classify images with Watson Machine Learning Accelerator||Article + notebook|
|201||Elastic Distributed Training in Watson Machine Learning Accelerator||Article + notebook|
|202||Drive higher GPU utilization and throughput||Tutorial|
|301||Accelerate your deep learning and machine learning||Article + notebook|
Watson Machine Learning Accelerator Elastic Distributed Training (EDT) simplifies the distribution of training workloads for the data scientist. The model distribution is transparent to the end user, with no need to specifically know the topology of the distribution. The usage is simple, define a maximum GPU count for training jobs, and Watson Machine Learning Accelerator schedules the jobs simultaneously on the existing cluster resources. GPU allocation for multiple jobs can grow and shrink dynamically based on fair share or priority scheduling and without interrupting running jobs.
EDT enables multiple data scientists to share GPUs in a dynamic fashion, increasing productivity while also increasing overall GPU utilization.
In this article, we use a Jupyter Notebook to walk through the process of taking a PyTorch model from the community and making the required code changes to distribute the training using Elastic Distributed Training. The article and notebook cover:
- Training the PyTorch Model with Elastic Distributed Training
- Monitoring the running job status and showing how to debug any issues
The detailed steps for this article can be found in the associated Jupyter Notebook. Within this notebook, you’ll:
- Make changes to your code
- Make your data set available
- Set up an API end point and log on
- Submit a job through an API
- Monitor a running job
- Retrieve output and saved models
- Debug any issues
Changes to your code
Note that the following images show a comparison between the before and EDT-enabled versions of the code using the
Import the additional libraries required for Elastic Distributed Training and set up the environment variables. Note that the additional EDT helper scripts
elog.pyare required. These must be copied to the same directory as your modified code. Sample versions can be found in the tarball in the tutorial repo. Additionally, they can be downloaded from http://ibm.biz/WMLA-samples.
Replace the data loading functions with EDT-compatible functions that return a tuple containing two items of type
Replace the training and testing loops with the EDT-equivalent function. This requires the creation of a
mainfunction. You could also potentially specify parameters in the API call and pass these parameters into the model.
Instantiate the Elastic Distributed Training instance and launch the EDT job with specific parameters.
- max number of GPUs per EDT training job
checkpoint creation frequency in number of epochs
This article provided an overview of the Elastic Distributed Training feature of Watson Machine Learning Accelerator. The article is the final part of the Learning path: Get started with Watson Machine Learning Accelerator series.