page-brochureware.php

IBM Power Systems

Machine Learning / Deep Learning performance proof-points

Training performance for deep learning networks such as, alexnet , vggnet, and so on, using common frameworks.

IBM Data Science Experience (DSX) Local on IBM® Power® System AC922

Accelerate Data Scientist productivity and drive faster insights with IBM DSX Local on IBM Power System AC922

For the systems and workload compared:

  • Power AC922 (based on the IBM POWER9™ processor technology with NVIDIA GPUs) completes running GPU-accelerated K-means clustering with 15 GB data in half the time of tested x86 systems (Skylake 6150 with NVIDIA GPUs).
  • Power AC922 delivers 2x faster insights for GPU-accelerated K-means clustering workload than Intel® Xeon® SP Gold 6150-based servers.
  • IBM Power Systems™ cluster with Power LC922 (CPU optimized) and Power AC922 (GPU accelerated) provides an optimized infrastructure for DSX Local.
IBM Data Science Experience (DSX) Local on IBM Power System AC922

System configuration

Power AC922 Two-socket Intel Xeon Gold 6150
IBM POWER9, 2x 20 cores/3.78 GHz, and 4x NVIDIA Tesla V100 GPUs with NVLink Gold 6150, 2x 18 cores/2.7 GHz and 4x NVIDIA Tesla V100 GPUs
1 TB memory, each user assigned 180 GB in DSX Local 768 GB memory, each user assigned 180 GB in DSXL
2x 960 GB SSD 2x 960 SSD
10 GbE two-port 10 GbE two-port
RHEL 7.5 for POWER9 RHEL 7.5
Data Science Experience Local 1.2 fp3 Data Science Experience Local 1.2 fp3

Notes:

  • The results are based on IBM internal testing of the core computational step to form five clusters using a 5270410 x 301 float64 data set (15 GB) running the K-means algorithm using Apache Python and TensorFlow. Results are valid as of 6/13/2018 and the test was conducted under laboratory conditions with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems. Individual results can vary based on workload size, use of storage subsystems, and other conditions.
  • Apache, Apache Python, and associated logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
  • Download the workload and try it yourself: https://github.com/theresax/DSX_perf_eval/tree/master/clustering. Note: You will need to use your own data with similar dimensions as described in the README.md file.

IBM Data Science Experience (DSX) Local on IBM® Power® System LC922

Accelerate Data Scientist productivity and drive faster insights with DSX Local on IBM Power System LC922

For the systems and workload compared:

  • Power LC922 running K-means clustering with 1 GB data scales to 2X more users than tested x86 systems
  • Power LC922 supports 2x more users at a faster response time than Intel® Xeon® SP Gold 6140-based servers.
  • Power LC922 delivers over 41% faster insights for the same (four to eight) number of users.
IBM Data Science Experience (DSX) Local on IBM Power System LC922

System configuration

Power LC922 Two-socket Intel Xeon SP Gold 6140
IBM POWER9™, 2x 20 cores/2.6 GHz/512 GB memory Gold 6140, 2x 18 cores/2.4 GHz/512 GB memory
10x 4 TB HDD 10x 4 TB HDD
10 GbE two-port 10 GbE two-port
RHEL 7.5 for POWER9 RHEL 7.5
Data Science Experience Local 1.1.2 Data Science Experience Local 1.1.2

Notes:

  • The test results are based on IBM internal testing of the core computational step to form five clusters using a 350694 x 301 float64 data set (1 GB) running the K-means algorithm using Apache Python and TensorFlow. Results are valid as of 4/21/18 and the test was conducted under laboratory conditions with speculative execution controls to mitigate user-to-kernel and user-to-user side-channel attacks on both systems. Individual results can vary based on workload size, use of storage subsystems, and other conditions.
  • Apache, Apache Python, and associated logos are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Chainer on IBM POWER9™ with Nvidia Tesla V100 delivers 3.7X reduction in AI model training versus tested x86 systems

Large Model Support (LMS) uses system memory and GPU memory to support more complex and higher resolution data. Maximize research productivity running training for medical/satellite images with Caffe with LMS on POWER9 with Nvidia V100 GPUs.

For the systems and workload compared:

  • 3.7X reduction versus tested x86 systems in runtime of 1000 iterations running on competing systems to train medical/satellite images
  • Critical machine learning (ML) capabilities such as regression, nearest neighbor, recommendation systems, clustering, etc., operate on more than just the GPU memory
    • NVLink 2.0 enables enhanced Host to GPU communication
    • LMS for deep learning from IBM enables seamless use of Host + GPU memory for improved performance
Chainer on POWER9

System configuration

IBM Power System AC922 2x Intel Xeon E5-2640 v4
POWER9 with NVLink 2.0 Xeon E5-2640 v4
40 cores (2 x 20c chips) 20 cores (2 x 10c chips) / 40 threads
2.25 GHz, 1024 GB memory 2.4 GHz, 1024 GB memory
(4) Tesla V100 GPUs (4) Tesla V100 GPUs
RHEL 7.4 Power LE (POWER9) Ubuntu 16.04
CUDA 9.1/CUDNN 7 CUDA 9.0/CUDNN 7

Notes:

Caffe on IBM POWER9™ with Nvidia Tesla V100 delivers 3.8X reduction in AI model training versus tested x86 systems

Large Model Support (LMS) uses system memory and GPU memory to support more complex and higher resolution data. Maximize research productivity running training for medical/satellite images with Caffe with LMS on POWER9 with Nvidia V100 GPUs.

For the systems and workload compared:

  • 3.8X reduction versus tested x86 systems in runtime of 1000 iterations running on competing systems to train on 2240 x 2240 images
  • Critical machine learning (ML) capabilities such as regression, nearest neighbor, recommendation systems, clustering, etc., operate on more than just the GPU memory
    • NVLink 2.0 enables enhanced Host to GPU communication
    • LMS for deep learning from IBM enables seamless use of Host + GPU memory for improved performance
Caffe on POWER9

System configuration

IBM Power System AC922 2x Intel Xeon E5-2640 v4
POWER9 with NVLink 2.0 Intel Xeon E5-2640 v4
40 cores (2 x 20c chips) 20 cores (2 x 10c chips) / 40 threads
2.25 GHz, 1024 GB memory 2.4 GHz, 1024 GB memory
(4) Tesla V100 GPUs (4) Tesla V100 GPUs
RHEL 7.4 Power LE (POWER9) Ubuntu 16.04
CUDA 9.1/CUDNN 7 CUDA 9.0/CUDNN 7

Notes:

  • Results are based IBM Internal Measurements running 1000 iterations of Enlarged GoogleNet model (mini-batch size=5) on Enlarged Imagenet Dataset (2240×2240)
  • Software: IBM Caffe with LMS Source code https://github.com/ibmsoe/caffe/tree/master-lms
  • Date of testing: November 26, 2017

IBM POWER9™ with Nvidia Tesla V100 delivers 35% more images/second on TensorFlow versus tested x86 systems

Maximize research productivity by training on more images in the same time with TensorFlow 1.4.0 running on IBM Power System AC922 servers with Nvidia Tesla V100 GPUs connected via NVLink 2.0

For the systems and workload compared:

  • 35% more images processed per second vs tested x86 systems
  • ResNet50 testing on ILSVRC 2012 dataset (aka Imagenet 2012)
    • Training on 1.2M images
    • Validation on 50K images
TensorFlow on POWER9

System configuration

IBM Power System AC922 2x Intel Xeon E5-2640 v4
POWER9 with NVLink 2.0 Intel Xeon E5-2640 v4
40 cores (2 x 20c chips) 20 cores (2 x 10c chips) / 40 threads
2.25 GHz, 1024 GB memory 2.4 GHz, 1024 GB memory
(4) Tesla V100 GPUs (4) Tesla V100 GPUs
RHEL 7.4 Power LE (POWER9) Ubuntu 16.04
Tensorflow 1.4.0 framework and HPM Resnet50 Tensorflow 1.4.0 framework and HPM Resnet50

Notes:

  • Results are based IBM Internal Measurements running 1000 iterations of HPM Resnet50 on 1.2M images and validation on 50K images with Dataset from ILSVRC 2012 also known as Imagenet 2012.
  • Software: Tensorflow 1.4.0 framework and HPM Resnet50 https://github.com/tensorflow/benchmarks.git (commit: f5d85aef) and with the following parameters: Batch-Size: 64 per GPU; Iterations: 1100; Data: Imagenet; local-parameter-device: gpu; variable-update: replicated
  • Date of testing: November 26, 2017

Distributed Deep Learning: IBM POWER9™ with Nvidia Tesla V100 results in 2.3X more data processed on TensorFlow versus tested x86 systems

Maximize research productivity by training on more images in the same time with TensorFlow 1.4.0 running on a cluster of IBM Power System AC922 servers with Nvidia Tesla V100 GPUs connected via NVLink 2.0

For the systems and workload compared:

  • 2.3X more images processed per second vs tested x86 systems
  • PowerAI Distributed Deep Learning (DDL) library provides innovative distribution methods enabling AI frameworks to scale to multiple servers leveraging all attached GPUs
  • ResNet50 testing on ILSVRC 2012 dataset (also known as Imagenet 2012)
    • Training on 1.2M images
    • Validation on 50K images
TensorFlow on POWER9 cluster

System configuration

4-nodes IBM Power System AC922 4-nodes of 2x Intel Xeon E5-2640 v4
POWER9 with NVLink 2.0 Intel Xeon E5-2640 v4
40 cores (2 x 20c chips) 20 cores (2 x 10c chips) / 40 threads
2.25 GHz, 1024 GB memory 2.4 GHz, 1024 GB memory
(4) Tesla V100 GPUs (4) Tesla V100 GPUs
RHEL 7.4 Power LE (POWER9) Ubuntu 16.04
Tensorflow 1.4.0 framework and HPM Resnet50 Tensorflow 1.4.0 framework and HPM Resnet50

Notes:

  • Results are based IBM Internal Measurements running 5000 iterations of HPM+DDL ResNet50 on Power and 500 iterations of HPM Resnet50 on x86 on 1.2M images and validation on 50K images with Dataset from ILSVRC 2012 also known as Imagenet 2012.
  • Software: Tensorflow 1.4.0 framework and HPM Resnet50 https://github.com/tensorflow/benchmarks.git (commit: f5d85aef) and with the following parameters: Batch-Size: 64 per GPU; Data: Imagenet; variable-update: distributed_replicated
  • Date of testing: December 2, 2017

Caffe/VGGNet on IBM Power System

For the systems and workload compared:

  • Power S822LC for High Performance Computing (HPC) with four P100 GPUs using Berkeley Vision and Learning Center (BVLC) Caffe is 17% faster compared to Intel Xeon E5-2640 v4 with eight M40 GPUs
  • Power S822LC for HPC with four P100 GPUs using IBM Caffe is 24% faster compared to Intel Xeon E5-2640 v4 with eight M40 GPUs

Graph using Berkeley Vision and Learning Center (BVLC) Caffe:

BVLC Caffe / VGGNet

Graph using IBM Caffe:

Caffe / VGGNet

System configuration

Power System S822LC for HPC Competitor: Intel Xeon E5-2640 v4
POWER8 Xeon E5-2640 v4
20-cores 20-cores
3.9 GHz 3.6 GHz
512 GB memory 512 GB memory
(4) NVIDIA P100 GPUs (8) NVIDIA M40 GPUs
Ubuntu 16.04 Ubuntu 16.04
CUDA 8.0.44 / cuDNN 5.1 CUDA 8.0.44 / cuDNN 5.1
BVLC Caffe 1.0.0-rc3 / Imagenet Data BVLC Caffe 1.0.0-rc3 / Imagenet Data

Notes:

  • Competitor testing was done on 27-Sep-2016
  • Power S822LC testing was done on 04-Oct-2016

IBM Corporation 2017®

IBM, the IBM logo, ibm.com, POWER and POWER8 are trademarks of the International Business Machines Corp., registered in many jurisdictions worldwide. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Other product and service names may be the trademarks of IBM or other companies.

The content in this document (including any pricing references) is current as of July 22, 2015 and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates

THE INFORMATION CONTAINED ON THIS WEBSITE IS PROVIDED ON AN “AS IS” BASIS WITHOUT ANY WARRANTY EXPRESS OR IMPLIED INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE ANDY ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT.

In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document.

All information contained on this website is subject to change without notice. The information contained in this website does not affect or change IBM product specifications or warranties. IBM’s products are warranted according to the terms and conditions of the agreements under which they are provided. Nothing in this website shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties.

All information contained on this website was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary.

No licenses, expressed or implied, by estoppel or otherwise, to any intellectual property rights are granted by this website.