The 1.5.3 release of PowerAI includes updates to IBM’s Distributed Deep Learning (DDL) framework which facilitate the distribution of Tensorflow Keras training. In this article we will walk through the process of taking an existing Tensorflow Keras model, making the code changes necessary to distribute its training using DDL and using ddlrun to execute the distributed script.

The script we used as the starting point is the keras mnist_cnn.py example script.

Code Changes

1. Imports

The first step is to convert any keras imports to tensorflow.keras imports. This is accomplished by replacing import keras with from tensorflow.python import keras as keras and replacing imports of the form from keras.xxxxx import ... with imports of the form from tensorflow.python.keras.xxxxx import .... We also have to import ddl and numpy as np. Importing ddl automatically distributes the gradient computation during training.

import keras                                                                  | from tensorflow.python import keras as keras
from keras.datasets import mnist                                              | from tensorflow.python.keras.datasets import mnist
from keras.models import Sequential                                           | from tensorflow.python.keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten                              | from tensorflow.python.keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D                                 | from tensorflow.python.keras.layers import Conv2D, MaxPooling2D
from keras import backend as K                                                | from tensorflow.python.keras import backend as K
                                                                              > import ddl
                                                                              > import numpy as np

2. Split the Training and Test Data

Next we have to split the training and test data so that each gpu is working on different data. This split is what is actually splitting up the work for ddl.

  • x_test_full and y_test_full are added to be able to do a final model evaluation at the end.
  • np.array_split(x_train, ddl.size())[ddl.rank()] is used to split the training data into ddl.size() pieces and select the piece that corresponds to the current rank, ddl.rank(). The same is done for all training and test data and labels.
                                                                              > # DDL: Save the full test data before splitting for final accuracy check.
                                                                              > x_test_full = x_test.astype('float32') / 255
                                                                              > y_test_full = keras.utils.to_categorical(y_test, num_classes)
                                                                              >
                                                                              > # DDL: Split the training & testing data.
                                                                              > x_train = np.array_split(x_train, ddl.size())[ddl.rank()]
                                                                              > x_test = np.array_split(x_test, ddl.size())[ddl.rank()]
x_train = x_train.astype('float32')                                             x_train = x_train.astype('float32')
x_test = x_test.astype('float32')                                               x_test = x_test.astype('float32')
x_train /= 255                                                                  x_train /= 255
x_test /= 255                                                                   x_test /= 255
print('x_train shape:', x_train.shape)                                          print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')                                        print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')                                          print(x_test.shape[0], 'test samples')

                                                                              > # DDL: Split the training & testing data.
                                                                              > y_train = np.array_split(y_train, ddl.size())[ddl.rank()]
                                                                              > y_test = np.array_split(y_test, ddl.size())[ddl.rank()]

3. Adjust the Learning Rate

The next change we have to make is to multiply the learning rate by the total number of GPUs. The intuition behind this is as follows. Since we are splitting up the data and performing gradient descent across ddl.size() GPUs, each with a batch size of 128, we end up with an effective global batch size of 128 * ddl.size(). This has the result of reducing the number of gradient descent updates that occur each epoch, slowing the convergence rate by a factor of approximately the number of GPUs. To compensate for this we must scale the learning rate by ddl.size().

                                                                              > # DDL: adjust learning rate based on number of GPUs.
model.compile(loss=keras.losses.categorical_crossentropy,                       model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),                          |               optimizer=keras.optimizers.Adadelta(lr=1.0 * ddl.size()),
              metrics=['accuracy'])                                                           metrics=['accuracy'])

4. Add DDL Callbacks

DDL requires that 2 callbacks be added to the list of Keras callbacks. To ensure that metrics used for early stopping and other hyper parameter tuning remain in sync throughout training, we have to add ddl.DDLCallback() as the first callback in the callback list. To ensure that all global variables in the model are correctly initialized we have to add ddl.DDLGlobalVariablesCallback() as the last callback in the callback list.

                                                                              > callbacks = list()
                                                                              >
                                                                              > # DDL: Add the DDL callback.
                                                                              > callbacks.append(ddl.DDLCallback())
                                                                              > callbacks.append(ddl.DDLGlobalVariablesCallback())

5. Restrict Printing to Rank 0

There are usually some operations that we only want to perform on a single node, printing for example. To restrict certain operations to rank 0 we can use if ddl.rank() == 0. Here we also use x_test_full and y_test_full to perform an evaluation of the model for the final accuracy check to display at the end.

                                                                              > # DDL: Only use verbose = 1 on rank 0.
model.fit(x_train, y_train,                                                     model.fit(x_train, y_train,
          batch_size=batch_size,                                                          batch_size=batch_size,
          epochs=epochs,                                                                  epochs=epochs,
          verbose=1,                                                          |           verbose=1 if ddl.rank() == 0 else 0,
          validation_data=(x_test, y_test))                                   |           validation_data=(x_test, y_test),
                                                                              >           callbacks=callbacks)
                                                                              > # DDL: Only do final accuracy check on rank 0.
                                                                              > if ddl.rank() == 0:
score = model.evaluate(x_test, y_test, verbose=0)                             |   score = model.evaluate(x_test_full, y_test_full, verbose=0)
print('Test loss:', score[0])                                                 |   print('Test loss:', score[0])
print('Test accuracy:', score[1])                                             |   print('Test accuracy:', score[1])

Running the Script

To run the script across any number of nodes, we can use the following commands:

$ source /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-activate
$ /opt/DL/ddl-tensorflow/bin/ddl-tensorflow-install-samples ~/samples
$ ddlrun -H host1,host2,host3,host4,... python ~/samples/examples/keras/mnist-tf-keras.py

On 4 GPUs the output looks like:

$ ddlrun python ~/samples/examples/keras/mnist-tf-keras.py
+ mpirun -x PATH -x LD_LIBRARY_PATH -x PYTHONPATH -tcp -disable_gpu_hooks --rankfile /tmp/DDLRUN/ddlrun.Rd3PDdkJYvRb/RANKFILE -x 'DDL_OPTIONS=-mode p:4x1x1x1 ' -n 4 python ~/samples/examples/keras/mnist-tf-keras.py
DDL: DDL_GROUP_SIZE=10000000.
2018-08-28 19:37:57.689450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2018-08-28 19:37:57.689548: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 2
2018-08-28 19:37:57.689856: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2018-08-28 19:37:57.689948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 1
2018-08-28 19:37:57.691164: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2018-08-28 19:37:57.691221: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-08-28 19:37:57.726137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
totalMemory: 15.75GiB freeMemory: 15.34GiB
2018-08-28 19:37:57.726350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 3
2018-08-28 19:37:58.078092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 19:37:58.078179: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0
2018-08-28 19:37:58.078203: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N
2018-08-28 19:37:58.078863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14847 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0)
2018-08-28 19:37:58.080687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 19:37:58.080722: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      1
2018-08-28 19:37:58.080738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   N
2018-08-28 19:37:58.081261: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14849 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0004:05:00.0, compute capability: 7.0)
2018-08-28 19:37:58.150432: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 19:37:58.150481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      2
2018-08-28 19:37:58.150495: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2:   N
2018-08-28 19:37:58.151084: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14846 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0035:03:00.0, compute capability: 7.0)
2018-08-28 19:37:58.405935: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 19:37:58.406037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      3
2018-08-28 19:37:58.406065: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3:   N
2018-08-28 19:37:58.406917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14846 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0035:04:00.0, compute capability: 7.0)
I 19:37:58.441 122001 122471 DDL:41  ] [MPI:0   ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
2018-08-28 19:37:59.918249: I ddl_MDR_ops.cc:826] [MPI:2   ]  name=Init local_gpuid=2 local_rank=2 local_size=4
2018-08-28 19:37:59.918246: I ddl_MDR_ops.cc:826] [MPI:3   ]  name=Init local_gpuid=3 local_rank=3 local_size=4
2018-08-28 19:37:59.918266: I ddl_MDR_ops.cc:826] [MPI:1   ]  name=Init local_gpuid=1 local_rank=1 local_size=4
2018-08-28 19:37:59.918266: I ddl_MDR_ops.cc:826] [MPI:0   ]  name=Init local_gpuid=0 local_rank=0 local_size=4
DDL: rank: 0, size: 4, gpuid: 0, hosts: 1
DDL: rank: 1, size: 4, gpuid: 1, hosts: 1
DDL: rank: 2, size: 4, gpuid: 2, hosts: 1
DDL: rank: 3, size: 4, gpuid: 3, hosts: 1
x_train shape: (15000, 28, 28, 1)
15000 train samples
2500 test samples
x_train shape: (15000, 28, 28, 1)
15000 train samples
2500 test samples
x_train shape: (15000, 28, 28, 1)
15000 train samples
2500 test samples
x_train shape: (15000, 28, 28, 1)
15000 train samples
2500 test samples
2018-08-28 19:38:00.963727: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 2
2018-08-28 19:38:00.963824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 19:38:00.963838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      2
2018-08-28 19:38:00.963851: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 2:   N
2018-08-28 19:38:00.964433: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14846 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0035:03:00.0, compute capability: 7.0)
Train on 15000 samples, validate on 2500 samples
2018-08-28 19:38:01.026421: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0
2018-08-28 19:38:01.026512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 19:38:01.026535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      0
2018-08-28 19:38:01.026565: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N
2018-08-28 19:38:01.027149: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14847 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0004:04:00.0, compute capability: 7.0)
2018-08-28 19:38:01.245015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 1
2018-08-28 19:38:01.245136: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 19:38:01.245160: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      1
2018-08-28 19:38:01.245180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 1:   N
2018-08-28 19:38:01.245765: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14849 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0004:05:00.0, compute capability: 7.0)
2018-08-28 19:38:02.113830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 3
2018-08-28 19:38:02.113934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-28 19:38:02.113949: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971]      3
2018-08-28 19:38:02.113963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 3:   N
2018-08-28 19:38:02.114639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14846 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0035:04:00.0, compute capability: 7.0)
Epoch 1/12
2018-08-28 19:38:04.161660: I ddl_MDR_ops.cc:357] [MPI:0   ]  name=training/Adadelta/AllReduceN _global_buf_size=1199882 _N=8
I 19:38:04.529 122001 122637 DDL:703 ] [MPI:0   ] selected algo: NCCLB   - NCCLB
15000/15000 [==============================] - 4s 282us/step - loss: 0.4831 - acc: 0.8497 - val_loss: 0.1190 - val_acc: 0.9596
Epoch 2/12
15000/15000 [==============================] - 2s 119us/step - loss: 0.1169 - acc: 0.9679 - val_loss: 0.0846 - val_acc: 0.9700
Epoch 3/12
15000/15000 [==============================] - 2s 133us/step - loss: 0.0805 - acc: 0.9760 - val_loss: 0.0731 - val_acc: 0.9728
Epoch 4/12
15000/15000 [==============================] - 2s 118us/step - loss: 0.0693 - acc: 0.9797 - val_loss: 0.0571 - val_acc: 0.9792
Epoch 5/12
15000/15000 [==============================] - 2s 122us/step - loss: 0.0514 - acc: 0.9843 - val_loss: 0.0443 - val_acc: 0.9832
Epoch 6/12
15000/15000 [==============================] - 2s 120us/step - loss: 0.0473 - acc: 0.9868 - val_loss: 0.0539 - val_acc: 0.9804
Epoch 7/12
15000/15000 [==============================] - 2s 120us/step - loss: 0.0408 - acc: 0.9869 - val_loss: 0.0510 - val_acc: 0.9844
Epoch 8/12
15000/15000 [==============================] - 2s 121us/step - loss: 0.0398 - acc: 0.9877 - val_loss: 0.0579 - val_acc: 0.9836
Epoch 9/12
15000/15000 [==============================] - 2s 122us/step - loss: 0.0373 - acc: 0.9893 - val_loss: 0.0485 - val_acc: 0.9840
Epoch 10/12
15000/15000 [==============================] - 2s 104us/step - loss: 0.0289 - acc: 0.9915 - val_loss: 0.0566 - val_acc: 0.9824
Epoch 11/12
15000/15000 [==============================] - 2s 111us/step - loss: 0.0291 - acc: 0.9907 - val_loss: 0.0565 - val_acc: 0.9816
Epoch 12/12
15000/15000 [==============================] - 2s 106us/step - loss: 0.0279 - acc: 0.9915 - val_loss: 0.0419 - val_acc: 0.9856
2018-08-28 19:38:26.596350: I ddl_MDR_ops.cc:270] [MPI:2   ] calling ddl_finalize

2018-08-28 19:38:26.598412: I ddl_MDR_ops.cc:270] [MPI:3   ] calling ddl_finalize

2018-08-28 19:38:26.655121: I ddl_MDR_ops.cc:270] [MPI:1   ] calling ddl_finalize

2018-08-28 19:38:27.320345: I ddl_MDR_ops.cc:270] [MPI:0   ] calling ddl_finalize

Test loss: 0.0279394076111
Test accuracy: 0.992


Complete Diff

'''Trains a simple convnet on the MNIST dataset.                                '''Trains a simple convnet on the MNIST dataset.

Gets to 99.25% test accuracy after 12 epochs                                    Gets to 99.25% test accuracy after 12 epochs
(there is still a lot of margin for parameter tuning).                          (there is still a lot of margin for parameter tuning).
16 seconds per epoch on a GRID K520 GPU.                                        16 seconds per epoch on a GRID K520 GPU.
'''                                                                             '''

from __future__ import print_function                                           from __future__ import print_function
import keras                                                                  | from tensorflow.python import keras as keras
from keras.datasets import mnist                                              | from tensorflow.python.keras.datasets import mnist
from keras.models import Sequential                                           | from tensorflow.python.keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten                              | from tensorflow.python.keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D                                 | from tensorflow.python.keras.layers import Conv2D, MaxPooling2D
from keras import backend as K                                                | from tensorflow.python.keras import backend as K
                                                                              > import ddl
                                                                              > import numpy as np

batch_size = 128                                                                batch_size = 128
num_classes = 10                                                                num_classes = 10
epochs = 12                                                                     epochs = 12

# input image dimensions                                                        # input image dimensions
img_rows, img_cols = 28, 28                                                     img_rows, img_cols = 28, 28

# the data, split between train and test sets                                   # the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()                        (x_train, y_train), (x_test, y_test) = mnist.load_data()

if K.image_data_format() == 'channels_first':                                   if K.image_data_format() == 'channels_first':
    x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)              x_train = x_train.reshape(x_train.shape[0], 1, img_rows, img_cols)
    x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)                 x_test = x_test.reshape(x_test.shape[0], 1, img_rows, img_cols)
    input_shape = (1, img_rows, img_cols)                                           input_shape = (1, img_rows, img_cols)
else:                                                                           else:
    x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)              x_train = x_train.reshape(x_train.shape[0], img_rows, img_cols, 1)
    x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)                 x_test = x_test.reshape(x_test.shape[0], img_rows, img_cols, 1)
    input_shape = (img_rows, img_cols, 1)                                           input_shape = (img_rows, img_cols, 1)

                                                                              > # DDL: Save the full test data before splitting for final accuracy check.
                                                                              > x_test_full = x_test.astype('float32') / 255
                                                                              > y_test_full = keras.utils.to_categorical(y_test, num_classes)
                                                                              >
                                                                              > # DDL: Split the training & testing data.
                                                                              > x_train = np.array_split(x_train, ddl.size())[ddl.rank()]
                                                                              > x_test = np.array_split(x_test, ddl.size())[ddl.rank()]
x_train = x_train.astype('float32')                                             x_train = x_train.astype('float32')
x_test = x_test.astype('float32')                                               x_test = x_test.astype('float32')
x_train /= 255                                                                  x_train /= 255
x_test /= 255                                                                   x_test /= 255
print('x_train shape:', x_train.shape)                                          print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')                                        print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')                                          print(x_test.shape[0], 'test samples')

                                                                              > # DDL: Split the training & testing data.
                                                                              > y_train = np.array_split(y_train, ddl.size())[ddl.rank()]
                                                                              > y_test = np.array_split(y_test, ddl.size())[ddl.rank()]
# convert class vectors to binary class matrices                                # convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)                      y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)                        y_test = keras.utils.to_categorical(y_test, num_classes)

model = Sequential()                                                            model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),                                        model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',                                                              activation='relu',
                 input_shape=input_shape))                                                       input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))                                model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))                                       model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))                                                        model.add(Dropout(0.25))
model.add(Flatten())                                                            model.add(Flatten())
model.add(Dense(128, activation='relu'))                                        model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))                                                         model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))                             model.add(Dense(num_classes, activation='softmax'))

                                                                              > # DDL: adjust learning rate based on number of GPUs.
model.compile(loss=keras.losses.categorical_crossentropy,                       model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),                          |               optimizer=keras.optimizers.Adadelta(lr=1.0 * ddl.size()),
              metrics=['accuracy'])                                                           metrics=['accuracy'])

                                                                              > callbacks = list()
                                                                              >
                                                                              > # DDL: Add the DDL callback.
                                                                              > callbacks.append(ddl.DDLCallback())
                                                                              > callbacks.append(ddl.DDLGlobalVariablesCallback())
                                                                              >
                                                                              > # DDL: Only use verbose = 1 on rank 0.
model.fit(x_train, y_train,                                                     model.fit(x_train, y_train,
          batch_size=batch_size,                                                          batch_size=batch_size,
          epochs=epochs,                                                                  epochs=epochs,
          verbose=1,                                                          |           verbose=1 if ddl.rank() == 0 else 0,
          validation_data=(x_test, y_test))                                   |           validation_data=(x_test, y_test),
                                                                              >           callbacks=callbacks)
                                                                              > # DDL: Only do final accuracy check on rank 0.
                                                                              > if ddl.rank() == 0:
score = model.evaluate(x_test, y_test, verbose=0)                             |   score = model.evaluate(x_test_full, y_test_full, verbose=0)
print('Test loss:', score[0])                                                 |   print('Test loss:', score[0])
print('Test accuracy:', score[1])                                             |   print('Test accuracy:', score[1])

Join The Discussion

Your email address will not be published. Required fields are marked *