by Seetharami Seelam, Geert Janssen, and Luis Lastras


Sequence-to-sequence models are used extensively in tasks such as machine translation, speech recognition, caption generation, and text summarization. The training times can be on the order of weeks for these models to achieve reasonable accuracy even on a GPU so it can takes weeks to months to explore different hyperparameters such as batch sizes, drop out rates, or the number of layers to improve the accuracy. IBM DDL (distributed deep learning library) can reduce these long training times by applying multiple (from 2 to 100s) GPUs to each training job and by optimizing the communication between the GPUs. These capabilities allow one to more quickly and easily explore the hyperparameter space.

In this article, we discuss how to use the IBM DDL technology to reduce the training times of sequence-to-sequence models with multi-GPU training. IBM DDL provides a compact set of APIs and simple steps to modify training code that is written to run on a single GPU to run on multiple GPUs. We demonstrate this DDL concept with the TensorFlow implementation of sequence to sequence models from the TensorFlow NMT tutorials:

NMT Sequence-to-Sequence models

The Neural Machine Translation example implements a state of the art model for translation of sentences between (different) languages. It uses an encoder-decoder architecture consisting of LSTMs. We have observed evidence of efforts to implement model parallelism by mapping different LSTMs to different GPUs. Our experiments show that model parallelization does not scale well with these workloads. Since the Python script does not allow any means of data parallelismwe will discuss how to enable data parallelism with IBM DDL.

IBM Distributed Deep Learning (DDL) library

IBM Distributed Deep Learning (DDL) is a communication library that provides a set of collective functions much like MPI. These functions are exposed as TensorFlow operators. While constructing a TensorFlow model graph, DDL operators are injected to facilitate certain synchronization and communication actions. Usually the DDL operator insertion action is done automatically. A user Python script merely imports the wrapper module which overrides certain TensorFlow graph construction functions. In other situations, the integration effort might be a bit more involved and require modifications to comply with the recommendations described below.

The objective of using DDL is to enable distributed learning by spreading the computational work over many processors, typically GPUs. The type of parallelism employed by DDL is data parallelism, i.e., the data is partitioned and distributed to several identical workers. DDL is based on a very simple symmetric distribution model: all worker processes run the same Python script and there will be one worker per GPU. The worker processes are started and governed by MPI. This makes it easy to deploy a script across
multiple host machines.

Whether done automatically, or under precise user guidance, the steps necessary to use DDL in a TensorFlow Python training script are as follows:

  1. Since there is one worker per GPU, the model parameters, a.k.a. the weights and biases (i.e., all so-called trainable TensorFlow variables) will reside in GPU memory which must be kept synchronized across the workers. This synchronization entails two actions: initialization and gradient updates. All trainable parameters must have identical initial values at the onset of computation. If these values are mere constants this is achieved by default. Often the initial values are randomly chosen from some probability distribution. In the random values case one could first initialize a single worker and then broadcast its values to all other workers. Typically, the root, or rank 0 worker, is elected as master and all workers perform a broadcast operation that also functions as a barrier.
  2. A gradient update to each worker‘s model parameters must be synchronized across all workers. Each worker will compute its own (different) gradients based on the batch of input samples that it processes during one iteration. The gradients must then be combined (typically just summed or averaged) and the result must be distibutedto all workers. A standard (MPI) operation to achieve this result is called all-reduce. DDL provides a highly tuned variant of all-reduce (tuned specifically for distributed deep learning) that ensures the best possible communication across the available channels of varying bandwidth, speed, and latency.
  3. Within this symmetric distribution scheme, workers need to be provided with a unique chunk of the overall dataset. Statistically it would be inefficient for workers to operate on the same input data at the same time as nothing is gained by doing the same learning across multiple workers. Therefore one must partition the dataset and hand out a different chunk to each worker. Quite often this is achieved by keeping data in a large buffer and assigning each worker a particular offset location to start reading in this buffer. Alternatively, the partitioning could be done upfront and each worker simply reads from its own dedicated file.
  4. The use of multiple workers changes the calculation and interpretation of some of the hyper parameters, especially the learning rate schedule. Since with N workers each processing a batch size B of samples per iteration, the total amount of samples processed per iteration step is increased up by a factor N. The learning rate schedule may need to be adjusted to accommodate this increase in samples per iteration. A schedule which is unaware of the parallel execution would act too slowly.
  5. Quite often, a training script incorporates progress evaluations at regular intervals. This typically entails the recording of checkpoints. Also, there will be log messages to inform the user of specific events. In debugged code it makes little sense to replicate these actions in all workers; it suffices that only a single dedicated worker takes care of them.
    Summarizing, to be successful at distributed deep learning, one must pay attention to:

    1. Uniform weight (and bias) initialization,
    2. Synchronized shared gradient updates,
    3. Partitioned data sets,
    4. Hyper parameter control adjustment, and
    5. Single worker checkpointing, evaluation, and logging.

Enabling distribution in NMT Sequence-to-Sequence models

Typically most scripts create a single training graph and DDL can be integrated into that training graph following the approach described below. However, in NMT, the script creates 3 graphs: the train, inference, and eval graph. The latter 2 will have a slightly different top layer structure than the train graph, and will operate by loading the variables from checkpoints produced by the train graph. The implications for running multiple parallel scripts are:

  1. The train graph needs variable initialization; the inference and eval graphs need no variable initialization.
  2. In principal only one worker needs to create the inference and eval graphs.
  3. One worker will have to do the checkpointing.

The use of multiple graphs also means that DDL ( ) must treat the graphs differently: it will inject broadcast initializers for the train graph, but must not do so for the inference and eval graphs. DDL provides new API functions to enable and disable the insertion of broadcast nodes to solve this problem. Note that for simplicity the script still constructs the 3 graphs for each worker but makes sure that only the root or master worker deploys the eval and inference graphs; the other workers simply ignore those graphs. See here for an example:

Apart from importing in the top-level script (see here:, most other changes are the insertions of if-statements to condition some code blocks to only be executed by the rank 0 worker. More specifically:



  • creation of the train model is passed the total number of workers and the
    jobid as arguments to correctly do the data partitioning when the data input
    iterator is constructed.
  • if-statements are inserted at several places to enforce single worker logging,
    checkpointing and progress evaluation as described above.

The final version of DDL enabled NMT code is available here:

Experimental Results & Discussion

The distributed NMT code from above is tested on a cluster of AC922 machines with V100 GPUs. These machines have two Power9 processors and 4 V100 GPUs per machine and the cluster had 4 machines for a total of 16 GPUs. These machines are connected with a 100Gbps Ethernet network and additional details about the system and the benchmark configuration are provided below:

System Model: IBM POWER9 AC922
GPUs: 4x V100
OS: Red Hat Enterprise Linux Server release 7.5 (Maipo)
Linux kernel: 4.14.0-49.el7a.ppc64le
CUDA/cuDNN: 9.2 / 7.1.4
MPI Version:
Dataset: WMT16 German-English data from here:

Run command:
mpirun -x LD_LIBRARY_PATH -x PATH -x DDL_OPTIONS=”-mode p -dbg_level 2″ -n -host \
python -m nmt.nmt \
–src=en –tgt=en \
–vocab_prefix=$DATA_DIR/vocab.bpe.32000 \
–train_prefix=$DATA_DIR/train.tok.clean.bpe.32000.shuf \
–dev_prefix=$DATA_DIR/newstest2013.tok.bpe.32000 \
–test_prefix=$DATA_DIR/newstest2015.tok.bpe.32000 \
–out_dir=$DATA_DIR/output2 \
–batch_size=64 \
–num_train_steps=10000 \
–steps_per_stats=100 \
–num_layers=4 \
–num_units=1024 \
–dropout=0.2 \
–metrics=bleu \
–beam_width=15 \
–encoder_type=bi \
–decay_scheme=luong10 \
–subword_option=bpe \
–num_gpus=1 \

Training details:

The training data above is the same WMT German English data used in the Tensorflow NMT website with a few changes: The training data (train.tok.clean.bpe.32000.shuf) is just a random shuffling of the actual BPE training data (train.tok.clean.bpe.32000). This was done because the original ordering in the data set has similar sequences grouped together which causes an odd periodic training behavior as the epoch progresses and this shuffling removes that problem.


The graph below shows the words per second (wps) as a function of the number of GPUs employed for running the NMT application enhanced with the distribution capability on the AC922 systems. As noted above, each of these systems have 4 GPUs so the data points for 1,2,4 GPUs are on a single system. A single node of the AC922 system achieves a throughput of 51K wps (4 GPUs total). With 16 GPUs across 4 nodes the AC922 systems achieve 160K wps and an effective scaling of 8x compared to a single GPU performance.

To understand the performance scaling better, the table below shows the step-time. The step time for 1 GPU is determined by computation time and the rate of data exchange between the GPU and the CPU. Step times for 2 or more GPUs involve, in addition to computation and data exchange between the GPU and the CPU, the communication time to transfer gradients between the GPUs.

System/NUM GPUS 1 2 4 8 12 16
Time per step (ms) 360 450 560 680 690 700


The gradients for this particular NMT model are about 717MB in size and they are transferred among GPUs once in each step. In the AC922 system pairs of 2-GPUs are connected with 150GB/s using NVLINK2 technology, so the communication time is about 90ms (450 ms – 360 ms). When the communication occurs between the systems (as in the case of 8,12, 16 GPUs), the data flows over a 100Gbps network resulting in sub-linear growth in communication times. From these data, we can conclude that DDL enabled NMT on AC922 with NVLINK that connects the GPUs on the node and the 100Gbps connecting the nodes in the systems substantially improves the scalability of this application and reduces the training time by a factor of 8 over 16 GPUs. Therefore DDL allows one to use additional GPUs to cut the executing time of NMT from a week to a day and enables faster exploration of the hyperparmeter space.

Try it yourself… 

In this blog, we discussed the steps to extend Tensorflow NMT sequence-to-sequence models code with the IBM DDL. Our results demonstrate that with minimal changes to the user code, DDL allows one to efficiently distribute work across GPUs on single node and across nodes. The efficient use of multiple GPUs can reduce the execution time by an order of magnitude and enables easy exploration of hyper parameter space.

Try this NMT example with IBM POWER AI enterprise (for those with IBM AC922 systems), on IBM Cloud, or in IBM Watson Studio.


See the references below for additional information on IBM DDL, POWER AI, NMT, etc

Seetharami Seelam is a Research Staff Member at IBM Research. His research interests are at the intersection of cloud, systems, and cognitive computing. He is working to deliver systems and cognitive services as services in the cloud with ease of use and differentiated performance.

Geert Janssen is a Research Staff Member at IBM Research. his research interests are in developing parallel algorithms for distributed deep learning.

Luis Lastras is a Research Staff Member at IBM Research. His interests are in the area of information retrieval, semantic search, knowledge graphs, linked data, social networks, natural language processing and very large scale systems.

Join The Discussion

Your email address will not be published. Required fields are marked *