Overview

PowerAI 1.5.3 supports Caffe as one of Deep learning frameworks. Caffe is the system default version of PowerAI. It actually contains two variations:

  • Caffe BVLC – It contains upstream Caffe 1.0.0 version developed by Berkeley Vision and Learning Center(BVLC) and other community contributors.Berkeley Vision and Learning Center is renamed as BAIR (Berkeley Artificial Intelligence Research).
  • Caffe IBM – It is developed on top of Caffe BVLC and contains enhancements by IBM. By default, Caffe points to Caffe-IBM variant. In case if you need to select the other variant Caffe framework, you can do that by using the command: source /opt/DL/caffe/bin/caffe-activate

To ensure if the specific caffe variant is activated, you can check if specific Caffe framework is activated using the systems PATH variable given in the command.
echo $PATH
/usr/lib64/qt-3.3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/ibutils/bin:/opt/anaconda2/bin/:/root/bin:/opt/DL/protobuf/bin:/opt/DL/mldl-spectrum/bin:/opt/DL/ddl/bin:/opt/DL/caffe-ibm/bin

Power AI doesn’t support activation of multiple frameworks in the same login session as it results unpredictable behavior. If you want to activate some other variant of Caffe, you need to logout of this session and login to the new session. The Caffelink provides steps on how to train imagenet model but you can also use caffe framework that is shipped in PowerAI software bundle. In the next section we will discuss more on how do start the training using the imagenet example.

Before starting the training you need to download the dataset that you will use for training and evaluation.

Download Imagenet Dataset

  1. Sign up for downloading the imagenet dataset from imagenet dataset
  2. After getting access permissions, you can download two tar files:
    • ILSVRC2012_img_train.tar
    • ILSVRC2012_img_val.tar
  3. Extract the two tar files and copy them to train and val folders
    tar -C train/ -xvf ILSVRC2012_img_train.tar
    tar -C val/ -xvf ILSVRC2012_img_val.tar
  4. Extracting images from Train tar files into respective class folders. So the train folder will have 1003 sub-folders of different image categories. Each of the sub-folders will have ~ 1304 JPEG images.
  5. Extracting val tar file generates 53212 Images. You need to categorise them in proper folder structure.For this you need to download valprep.sh from https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh. You need to copy this valprep.sh to val folder and execute this shell script. Once execution is done, it will create different sub-folder structure under val folder and move the respective JPEG images to the right sub-folders under “val” directory. So the “val” folder will have 1004 sub-folders with each having 53 JPEG image

Converting the Images to LMDB format

  1. PowerAI provides a script that copies example scripts and models to a directory. Below example copies all the examples and models to user defined directory test-dir.

    [testuser@dlw11 ~]$ caffe-install-samples test-dir
    Creating directory test-dir
    Copying data/ into test-dir...
    Copying examples/ into test-dir...
    Copying models/ into test-dir...
    Copying scripts/ into test-dir...
    Copying python/ into test-dir...
    Success
    [testuser@dlw11 ~]$ cd test-dir/
    [testuser@dlw11 test-dir]$ ls -al
    total 12
    drwxrwxr-x. 7 testuser testuser 77 Sep 26 01:59 .
    drwx------. 3 testuser testuser 128 Sep 26 01:59 ..
    drwxr-xr-x. 6 testuser testuser 62 Sep 26 01:59 data
    drwxr-xr-x. 18 testuser testuser 4096 Sep 26 01:59 examples
    drwxr-xr-x. 7 testuser testuser 144 Sep 26 01:59 models
    drwxr-xr-x. 3 testuser testuser 4096 Sep 26 01:59 python
    drwxr-xr-x. 3 testuser testuser 4096 Sep 26 01:59 scripts
    [testuser@dlw11 test-dir]$

  2. Navigate to test-dir/data/ilsvrc12 and run the script get_ilsvrc_aux.sh that downloads imagenet mean binaryproto, training, and validation dataset labels in text files.
    $ sh get_ilsvrc_aux.sh
    Downloading...
    --2018-09-28 07:43:16-- http://dl.caffe.berkeleyvision.org/caffe_ilsvrc12.tar.gz
    Resolving dl.caffe.berkeleyvision.org (dl.caffe.berkeleyvision.org)... 169.229.222.251
    Connecting to dl.caffe.berkeleyvision.org (dl.caffe.berkeleyvision.org)|169.229.222.251|:80... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 17858008 (17M) [application/octet-stream]
    Saving to: 'caffe_ilsvrc12.tar.gz'

    100%[======================================>] 1,78,58,008 11.3MB/s in 1.5s

    2018-09-28 07:43:18 (11.3 MB/s) - ‘caffe_ilsvrc12.tar.gz’ saved [17858008/17858008]

    Unzipping...
    Done.
    $ ls
    det_synset_words.txt imagenet_mean.binaryproto test.txt
    get_ilsvrc_aux.sh synsets.txt train.txt
    imagenet.bet.pickle synset_words.txt val.txt

  3. Navigate to path test-dir/examples/imagenet and edit create_imagenet.sh file as follows:
    1. EXAMPLE variable to the absolute path of imagenet example
    2. DATA to the location of data/ilsvrc12
    3. TRAIN_DATA_ROOT to the absolute path of imagenet train folder
    4. VAL_DATA_ROOT to the absolute path of validation val folder

    #!/usr/bin/env sh
    # Create the imagenet lmdb inputs
    # N.B. set the path to the imagenet train + val data dirs
    set -e

    EXAMPLE=/gpfs/gpfs_gl4_16mb/b8p226/b8p226zd/test-dir/examples/imagenet
    DATA=/gpfs/gpfs_gl4_16mb/b8p226/b8p226zd/test-dir/data/ilsvrc12

    # Check if CAFFE_BIN is unset
    if [ -z "$CAFFE_BIN" ]; then
    TOOLS=./build/tools
    else
    TOOLS=$CAFFE_BIN
    fi

    TRAIN_DATA_ROOT=/gpfs/gpfs_gl4_16mb/b8p226/b8p226zd/pytorch/train/
    VAL_DATA_ROOT=/gpfs/gpfs_gl4_16mb/b8p226/b8p226zd/pytorch/val/

    # Set RESIZE=true to resize the images to 256x256. Leave as false if images have
    # already been resized using another tool.
    RESIZE=false
    if $RESIZE; then
    RESIZE_HEIGHT=256

  4. Execute the create_imagenet.sh file and you can see LMDB file generation output below:
    $ sh ./create_imagenet.sh
    Creating train lmdb...
    I0928 07:45:58.968338 34452 convert_imageset.cpp:86] Shuffling data
    I0928 07:45:59.885766 34452 convert_imageset.cpp:89] A total of 1281167 images.
    I0928 07:45:59.887583 34452 db_lmdb.cpp:35] Opened lmdb /gpfs/gpfs_gl4_16mb/b8p226/b8p226zd/test-dir/examples/imagenet/ilsvrc12_train_lmdb
    I0928 07:46:46.930565 34452 convert_imageset.cpp:147] Processed 1000 files.
    I0928 07:47:23.977712 34452 convert_imageset.cpp:147] Processed 2000 files.
    I0928 07:47:58.416146 34452 convert_imageset.cpp:147] Processed 3000 files.
    I0928 07:48:31.446862 34452 convert_imageset.cpp:147] Processed 4000 files.
    I0928 07:49:03.587481 34452 convert_imageset.cpp:147] Processed 5000 files.
    I0928 07:49:36.566186 34452 convert_imageset.cpp:147] Processed 6000 files.
    I0928 07:50:08.293210 34452 convert_imageset.cpp:147] Processed 7000 files.
    I0928 07:50:40.633654 34452 convert_imageset.cpp:147] Processed 8000 files.
    I0928 07:51:12.149935 34452 convert_imageset.cpp:147] Processed 9000 files.
    I0928 07:51:44.529917 34452 convert_imageset.cpp:147] Processed 10000 files.

At the end of execution ilsvrc12_train_lmdb and ilsvrc12_val_lmdb folders will be generated with 2 files: data.mdb and lock.mdb. Note that the imagenet_mean.binaryproto file required for training alexnet model will be located in ..test-dir/data/ilsvrc12/

Training Alexnet model using Caffe-IBM

Navigate to models/bvlc_alexnet folder and you can find the solver, trainer files for alexnet model.

[testuser@dlw11 models]$ cd bvlc_alexnet/
[testuser@dlw11 bvlc_alexnet]$ pwd
/home/testuser/test-dir/models/bvlc_alexnet
[testuser@dlw11 bvlc_alexnet]$ ls -al
total 20
drwxr-xr-x. 2 testuser testuser 95 Sep 26 01:59 .
drwxr-xr-x. 7 testuser testuser 144 Sep 26 01:59 ..
-rw-r--r--. 1 testuser testuser 3629 Sep 26 01:59 deploy.prototxt
-rw-r--r--. 1 testuser testuser 1146 Sep 26 01:59 readme.md
-rw-r--r--. 1 testuser testuser 297 Sep 26 01:59 solver.prototxt
-rw-r--r--. 1 testuser testuser 5351 Sep 26 01:59 train_val.prototxt
[testuser@dlw11 bvlc_alexnet]$

Before training the model, you need to set the parameters under solver.prototxt and train_val.prototxt. Under solver.prototxt, if you need to train the model you need to take care of few parameter values

  • Number of Iterations
  • Training Batch size
  • Total number of Images
  • Number of GPU’s used for training

No.of Epochs = Number of Iterations*Batch size*Number of GPU’s / Total number of Images

Suppose you want to train a model for 10 epochs on Imagenet Data having 1.2 million images , you can set the parameter value as:

  1. Number of GPU’s = 4
  2. Number of Iterations = 25000
  3. Training Batch size= 256
  4. Total number of Images = 1200000

The solver.prototxt file appears like below

net: "/home/testuser/test-dir/models/bvlc_alexnet/train_val.prototxt"
test_iter: 1000
test_interval: 1000
base_lr: 0.01
lr_policy: "step"
gamma: 0.1
stepsize: 20000
display: 20
max_iter: 25000
momentum: 0.9
weight_decay: 0.0005
snapshot: 10000
snapshot_prefix: "/home/testuser/test-dir/models/bvlc_alexnet/caffe_alexnet_train"
solver_mode: GPU
~
~
~
~
~
~
~
~
~

Here you need to mention the absolute path of train_val.prototxt file in “net” value, snapshot_prefix and “max_iter” as 25000. For train_val.prototxt available under the same directory mention the absolute path of train and validation directory location in the “TRAIN” and “TEST” phase. Also give the train batch size as “256” if you need to train the model for 10 epochs.

name: "AlexNet"
layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mirror: true
crop_size: 227
mean_file: "/gpfs/gpfs_gl4_16mb/b8p226/b8p226zd/test-dir/data/ilsvrc12/imagenet_mean.binaryproto"
}
data_param {
source: "/gpfs/gpfs_gl4_16mb/b8p226/b8p226zd/test-dir/examples/imagenet/ilsvrc12_train_lmdb"
batch_size: 256
backend: LMDB
}
}
layer {
name: "data"
type: "Data"
top: "data"
top: "label"
include {
phase: TEST
}
transform_param {
mirror: false
crop_size: 227
mean_file: "/gpfs/gpfs_gl4_16mb/b8p226/b8p226zd/test-dir/data/ilsvrc12/imagenet_mean.binaryproto"
}
data_param {
source: "/gpfs/gpfs_gl4_16mb/b8p226/b8p226zd/test-dir/examples/imagenet/ilsvrc12_val_lmdb"
batch_size: 50
backend: LMDB
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1

Caffe accepts train and validation data only in LMDB format. So you have given the absolute path here under the source value in data_param section. Also you need to give the location of mean file to be used for training under mean_file. Now you are all set and you can train the model using the command:

caffe train –gpu=all –solver=solver.prototxt

You can see the output like below.

Start Of Execution At: Tue Sep 18 05:13:02 EDT 2018
Using Configs:
LMDB_TRAIN_DIR = /gpfs/gpfs_gl4_16mb/b8p226/b8p226zd/test-dir/examples/imagenet/ilsvrc12_train_lmdb
LMDB_VAL_DIR = /gpfs/gpfs_gl4_16mb/b8p226/b8p226zd/test-dir/examples/imagenet/ilsvrc12_val_lmdb
LMDB_MEAN_FILE = /gpfs/gpfs_gl4_16mb/b8p226/b8p226zd/test-dir/data/ilsvrc12/imagenet_mean.binaryproto
ITERATIONS = 8000
TRAIN_BATCH_SIZE = 240
TEST_BATCH_SIZE = 64
GPUs = 0,1,2,3,4,5
RUN_MODE = non-lms
lms_size_threshold=
lms_exclude =
Caffe Training Started At: Tue Sep 18 05:13:02 EDT 2018
Running Caffe as : time numactl /opt/DL/caffe/bin/caffe train -gpu 0,1,2,3,4,5 --solver=/tmp/b8p226zd/alexnet-2018-09-18-05-13-02/solver_2018-09-18-05-13-02.prototxt --iterations 8000
I0918 05:13:02.492660 5945 caffe.cpp:335] Using GPUs 0, 1, 2, 3, 4, 5
I0918 05:13:04.695749 5945 caffe.cpp:340] GPU 0: Tesla V100-SXM2-16GB
I0918 05:13:04.697422 5945 caffe.cpp:340] GPU 1: Tesla V100-SXM2-16GB
I0918 05:13:04.699075 5945 caffe.cpp:340] GPU 2: Tesla V100-SXM2-16GB
I0918 05:13:04.700800 5945 caffe.cpp:340] GPU 3: Tesla V100-SXM2-16GB
I0918 05:13:04.702517 5945 caffe.cpp:340] GPU 4: Tesla V100-SXM2-16GB
I0918 05:13:04.704231 5945 caffe.cpp:340] GPU 5: Tesla V100-SXM2-16GB
I0918 05:13:04.712694 5945 common.cpp:226] NVidia Management Library loaded successfully
I0918 05:13:05.501705 5945 solver.cpp:45] Initializing solver from parameters:
test_iter: 1000
test_interval: 1000
base_lr: 0.01
display: 20
max_iter: 8000
lr_policy: "step"
gamma: 0.1
momentum: 0.9
weight_decay: 0.0005
stepsize: 20000
snapshot: 800
snapshot_prefix: "/tmp/b8p226zd/alexnet-2018-09-18-05-13-02/caffe_alexnet_gpu_train"
solver_mode: GPU
device_id: 0
net: "/tmp/b8p226zd/alexnet-2018-09-18-05-13-02/train_val_2018-09-18-05-13-02.prototxt"
train_state {
level: 0
stage: ""
}
I0918 05:13:05.502215 5945 solver.cpp:103] Creating training net from net file: /tmp/b8p226zd/alexnet-2018-09-18-05-13-02/train_val_2018-09-18-05-13-02.prototxt
I0918 05:13:05.503700 5945 net.cpp:531] The NetState phase (0) differed from the phase (1) specified by a rule in layer data
I0918 05:13:05.503762 5945 net.cpp:531] The NetState phase (0) differed from the phase (1) specified by a rule in layer accuracy
I0918 05:13:05.503785 5945 net.cpp:57] Initializing net from parameters:
name: "Alexnet"
state {

At the end you will see the output:

I0918 05:38:51.565030 5945 sgd_solver.cpp:128] Iteration 7920, lr = 0.01
I0918 05:38:54.061976 5945 solver.cpp:244] Iteration 7940 (7.9922 iter/s, 2.50244s/20 iters), loss = 2.87838
I0918 05:38:54.062108 5945 solver.cpp:263] Train net output #0: loss = 2.87838 (* 1 = 2.87838 loss)
I0918 05:38:54.066254 5945 sgd_solver.cpp:128] Iteration 7940, lr = 0.01
I0918 05:38:56.531842 5945 solver.cpp:244] Iteration 7960 (8.09814 iter/s, 2.4697s/20 iters), loss = 2.75436
I0918 05:38:56.537919 5945 solver.cpp:263] Train net output #0: loss = 2.75436 (* 1 = 2.75436 loss)
I0918 05:38:56.538363 5945 sgd_solver.cpp:128] Iteration 7960, lr = 0.01
I0918 05:38:59.019044 5945 solver.cpp:244] Iteration 7980 (8.06125 iter/s, 2.48101s/20 iters), loss = 3.00122
I0918 05:38:59.019153 5945 solver.cpp:263] Train net output #0: loss = 3.00122 (* 1 = 3.00122 loss)
I0918 05:38:59.024619 5945 sgd_solver.cpp:128] Iteration 7980, lr = 0.01
I0918 05:39:01.430420 5945 solver.cpp:483] Snapshotting to binary proto file /tmp/b8p226zd/alexnet-2018-09-18-05-13-02/caffe_alexnet_gpu_train_iter_8000.caffemodel
I0918 05:39:02.009801 5945 sgd_solver.cpp:367] Snapshotting solver state to binary proto file /tmp/b8p226zd/alexnet-2018-09-18-05-13-02caffe_alexnet_gpu_train_iter_8000.solverstate
I0918 05:39:02.231539 5945 solver.cpp:332] Iteration 8000, loss = 2.65919
I0918 05:39:02.231578 5945 solver.cpp:352] Iteration 8000, Testing net (#0)
I0918 05:39:07.122040 5945 blocking_queue.cpp:49] Waiting for data
I0918 05:39:11.766402 6121 data_layer.cpp:86] Restarting data prefetching from start.
I0918 05:39:18.347111 5945 solver.cpp:431] Test net output #0: accuracy = 0.408531
I0918 05:39:18.347160 5945 solver.cpp:431] Test net output #1: loss = 2.72732 (* 1 = 2.72732 loss)
I0918 05:39:18.347172 5945 solver.cpp:337] Optimization Done.
I0918 05:39:24.157742 5945 caffe.cpp:421] Optimization Done.
Caffe Training Completed At: Tue Sep 18 05:39:26 EDT 2018
Generating Data for outdir-2018-09-18-05-13-02/caffe-run-alexnet.log
Running parse_log.sh to generate the data for train & test
End Of Execution At: Tue Sep 18 05:40:08 EDT 2018

Now you should be able to easily train any model using Caffe-IBM. If you have any questions, feel free to add them below. We’d love to hear from you!

Useful links

Caffe Imagenet example
Power AI 1.5.3 KC

Join The Discussion

Your email address will not be published. Required fields are marked *