page-brochureware.php

IBM Power Systems

High Performance Computing (HPC) performance proof-points

Power Systems solutions deliver faster time to insight and offer accelerated performance for demanding HPC workloads.

GROMACS on IBM Power Systems

Achieve faster simulation using Reaction-Field (RF) method on IBM® Power® System AC922 server that is based on the IBM POWER9™ processor technology.

For the systems and workload compared:

  • IBM Power AC922 with four Tesla V100 GPUs is 1.76x faster than previous generation IBM Power System S822LC server with four Tesla P100 GPUs.
gromacs_on_power

System configuration

Power AC922 for HPC (with GPU) Power S822LC for HPC (with GPU)
IBM POWER9 with NVLink, 2.8 GHz, 44 cores IBM POWER8 with NVLink, 4.023 GHz, 20 cores, 40 threads
1 TB memory 256 GB memory
RHEL 7.5 for Power Little Endian (POWER9) RHEL 7.3
CUDA toolkit 10.0 / CUDA driver 410.37 CUDA 8.0
NVIDIA Tesla V100 with NVLink GPU NVIDIA Tesla P100 with NVLink GPU
NVIDIA NVLink 2.0 NVIDIA NVLink 1.0
GNU 7.3.1 (IBM Advance Toolchain 11) GNU 4.8.5 (OS default )

Notes:

  • Results on the IBM POWER9 system are based on IBM internal testing of GROMACS 2018.3, benchmarked on POWER9 processor-based systems installed with four NVIDIA Tesla V100 GPUs.
    • Date of testing: 30th November 2018
  • Results on the IBM POWER8® system are based on IBM internal testing of GROMACS 2016.3, benchmarked on POWER8 processor-based systems installed with four NVIDIA Tesla P100 GPUs.
    • Date of testing : 8th June 2017

Nanoscale Molecular Dynamics program (NAMD) on IBM Power Systems

For the systems and workload compared:

  • The GPU-accelerated NAMD application runs 2x faster on an IBM® Power® AC922 system compared to an IBM Power System S822LC system.
Nanoscale Molecular Dynamics program (NAMD) on IBM Power Systems

System configuration

Power AC922 for HPC (with GPU) Power S822LC for HPC (with GPU)
IBM POWER9 with NVLink, 2.8 GHz, 40 cores, 80 threads IBM POWER8 with NVLink, 4.023 GHz, 20 cores, 40 threads
1 TB memory 256 GB memory
RHEL 7.4 for Power Little Endian (POWER9) RHEL 7.3
CUDA toolkit 9.1 / CUDA driver 390.31 CUDA 8.0
NVIDIA Tesla V100 with NVLink GPU NVIDIA Tesla P100 with NVLink GPU
NVIDIA NVLink 2.0 NVIDIA NVLink 1.0

Notes:

  • Results on the IBM POWER9™ system are based on IBM internal testing of NAMD 2.13 (Sandbox build dated 11th December 2017) and Charm 6.8.1, benchmarked on POWER9 processor-based systems installed with four NVIDIA Tesla V100 GPUs.
    • Test date: 16th Feb 2018
  • Results on the IBM POWER8® system are based on IBM internal testing of NAMD 2.12, benchmarked on POWER8 processor-based systems installed with four NVIDIA Tesla P100 GPUs.
    • Test date: 9th May 2017

POWER9 Coral Systems – Summit: Oak Ridge National Laboratory (ORNL) reports 5-10X application performance with ÂĽ of the nodes versus Titan

According to ORNL, Summit is the next leap in leadership-class computing systems for open science.

  • ORNL reports 5-10X application performance with ÂĽ of the nodes vs Titan
  • Summit will deliver more than five times the computational performance of Titan’s 18,688 nodes, using only approximately 4,600 nodes.
  • Each Summit node will contain multiple IBM POWER9 CPUs and NVIDIA Volta GPUs all connected together with NVIDIA’s high-speed NVLink and a huge amount of memory.
  • Each node will have over half a terabyte of coherent memory (HBM “high bandwidth memory” + DDR4) addressable by all CPUs and GPUs, plus an additional 800 gigabytes of NVRAM.

Summit vs Titan

System configuration

Feature Titan Summit
Application Performance Baseline 5-10x Titan
Number of Nodes 18,688 ~4,600
Node performance 1.4 TF/s > 40 TF/s
Memory per Node 32 GB DDR3 + 6 GB GDDR5 512 GB DDR4 + HBM
NV memory per Node 0 1600 GB
Total System Memory 710 TB >10 PB DDR4 + HBM + Non-volatile
System Interconnect
(node injection bandwidth)
Gemini (6.4 GB/s) Dual Rail EDR-IB (23 GB/s)
Interconnect Topology 3d Torus Non-blocking Fat Tree
Processors 1 AMD Opteron™
NVIDIA Kepler™
2 IBM POWER9™
NVIDIA Volta™
File System 32 PB, 1 TB/s, Lustre© 250 PB, 2.5 TB/s, GPFS™
Peak power consumption 9 MW 15 MW

Notes:

CPMD on IBM POWER9™ with NVLink 2.0 runs 2.9X faster than tested x86 systems providing reduced wait time and improved computational chemistry simulation execution time.

For the systems and workload compared:

  • IBM Power System AC922 delivers 2.9X reduction in execution time of tested x86 systems
  • IBM Power System AC922 delivers 2.0X reduction in execution time compared to prior generation IBM Power System S822LC for HPC
  • POWER9 with NVLink 2.0 unlocks the performance of GPU-accelerated version of CPMD by enabling lightning fast CPU-GPU data transfers
    • 3.3 TB of data movement required between CPU and GPU
    • 70 seconds for NVLink 2.0 transfer time vs
    • 300+ seconds for traditional PCIe bus transfer time
CPMD on POWER9

System configuration

IBM Power System AC922 IBM Power System S822LC for HPC 2x Intel Xeon E5-2640 v4
40 cores (2 x 20c chips) 20 cores (2 x 10c chips) / 160 threads 20 cores (2 x 10c chips) / 40 threads
POWER9 with NVLink 2.0 POWER8 with NVLink Intel Xeon E5-2640 v4
2.25 GHz, 1024 GB memory 2.86 GHz, 256 GB memory 2.4 GHz, 256 GB memory
(4) Tesla V100 GPUs (4) Tesla P100 GPUs (4) Tesla P100 GPUs
Red Hat Enterprise Linux 7.4 for Power Little Endian (POWER9) with ESSL PRPQ RHEL 7.4.with ESSL 5.3.2.0 Ubuntu 16.04 with OPENBLAS 0.2.18
Spectrum MPI: PRPQ release, XLF: 15.16, CUDA 9.1 PE2.2, XLF: 15.1, CUDA 8.0 OpenMPI: 1.10.2, GNU-5.4.0, CUDA-8.0

Notes:

  • All results are based on running CPMD, a parallelized plane wave / pseudopotential implementation of Density Functional Theory Application. A Hybrid version of CPMD (e.g. MPI + OPENMP + GPU + streams) was implemented with runs are made for 256-Water Box, RANDOM initialization.
  • Results are reported in Execution Time (seconds).. Effective measured data rate on PCIe bus of 10 GB/s and on Nvlink 2.0 of 50GB/s.
  • Test date: November 27, 2017

GROMACS on IBM Power System S822LC

For the systems and workload compared:

  • GROMACS GPU accelerated version runs 10.14x faster on an IBM® Power® System S822LC system compared to a CPU-only version.
GROMACS on IBM Power System S822LC


GROMACS on IBM Power System S822LC 1.5M water

System configuration

Power S822LC for HPC (with GPU) Power S822LC for HPC (CPU-only)
IBM POWER8 with NVLink, 4.023 GHz, 20 cores, 80 threads IBM POWER8 with NVLink, 4.023 GHz, 20 cores, 80 threads
1000 GB memory 1000 GB memory
Ubuntu 16.04.2 Ubuntu 16.04.2
CUDA 8.0 with Driver 361.119
NVIDIA Tesla four P100 GPUs with NVLink

Notes:

Performance best practices for IBM Power System S822LC for HPC

In order to achieve peak performance, set these three system and GPU settings:

$sudo cpupower frequency-set -g performance	# Set the system to performance governor
$sudo nvidia-smi -pm ENABLED			# Enable GPU persistence mode
$sudo nvidia-smi -ac 715,1480			# Set max GPU frequency

After the validation runs, reset the GPU and CPU settings (if required) using these three commands:

$sudo nvidia-smi -rac
$sudo nvidia-smi -pm DISABLED
$sudo cpupower frequency-set -g ondemand

For additional recommendations, read this article: Best practices and basic evaluation benchmarks: IBM Power System S822LC for High Performance Computing (HPC)

CPMD competitive performance on IBM Power System S822LC for HPC

For the systems and workload compared:

  • CPMD-4423 with 128 Water box, CPU only version is approximately 2x better on IBM® Power® System S822LC compared to Intel® Xeon® E5-2600 v4.
  • Performance on Power S822LC (with 2 P100) is approximately 2X better than those observed on Intel Xeon E5-2600 v4.
  • CPU-GPU communication gain is up to 40% due to NVLink on Power S822LC over Intel Xeon E5-2600 v4 with PCIe.
image208

System configuration

Power S822LC for HPC (8335-GTB) Competitor: Xeon E5-2640 v4
20-core 20-core
3.9 GHz IBM POWER8® 3.40 GHz
256 GB memory 256 GB memory
(4) NVIDIA P100 GPUs 16 GB HBM2 (4) NVIDIA P100 GPUs 16 GB HBM2
RHEL 7.3 (Maipo) Ubunutu 16.04
CUDA 8.0.53 CUDA 8.0.44
XLF-15.1.5/SpectrumMPI 10.1 / GFORTRAN-5.4/OpenMPI
LAPACK-3.5.0/ESSL-5.5.0.0 2.1.1/OpenBLAS-0.2.18
CUDA 8.0 CUDA 8.0

Notes:

  • Test date: 17 February 2017

Genome alignment on IBM Power System S822LC for HPC

For the systems and workload compared:

Accelerate genome alignment on IBM Power System S822LC for High Performance Computing (HPC) with 2X better performance than x86 accelerated solution

  • Power S822LC for HPC with four Tesla P100 GPUs achieved 138 million base pairs per second
  • Intel Xeon processor E5-2640 v4 with four Tesla K80 GPUs achieved 69 million base pairs per second

Run your complete pipeline rather than wait for completion

  • Power S822LC for HPC delivered peak SOAP3-dp results with a processor utilization of 70% compared to 100% for the competition
image208


image209

System configuration

Power S822LC for HPC (8335-GTB) Competitor: Xeon E5-2640 v4
20-core 20-core
3.9 GHz IBM POWER8®, 160 threads 2.40 GHz, Xeon E5-2640 v4, 40 threads
512 GB memory 512 GB memory
900 GB SATA HDD 900 GB SATA HDD
Four P100-SXM2 (GPUs) Four K80 GPUs
NVLink 1.0 PCIe Gen3
SOAP3-dp SOAP3-dp
Ubuntu 16.04.1 LTS Ubuntu 16.04 LTS
CUDA 8.0 CUDA 8.0

Notes:

Host to Device (H2D) Bandwidth

For the systems and workload compared:

  • Transfer data with up to 2.91X the CUDA Host-Device Bandwidth of x86 platforms
  • Support new applications with a faster interface from CPU to GPU
Host to Device (H2D) Bandwidth

System configuration

Power S822LC for High Performance Computing Xeon E5-2640 v4 Competitor
20-cores 20-cores
(2) IBM POWER8 with NVLink, 2.86 Ghz, 20-cores, 160 threads (2) Xeon E5-2640 v4 @ 2.40 GHz
512 GB memory 512 GB memory
(2) 1 TB 7200 RPM SATA HDD (2) 800 GB Intel SSD DC S3510 Series 2.5″ SSD
NVIDIA Tesla P100 with NVLink GPU NVIDIA Tesla K80 GPU, Bandwidth test on Tesla K80 Device 0 (of 0,1)
NVIDIA NVLink PCIe Gen3
Ubuntu 16.04.1 LTS Ubuntu 16.04 LTS
CUDA 8.0 CUDA 8.0

Notes:

  • Results are based IBM Internal Measurements running the CUDA Host to Device Bandwidth Test
  • Date of testing: 08/30/2016

Lattice QCD Benchmark Performance on IBM Power System S822LC for High Performance Computing

For the systems and workload compared:

  • Run LatticeQCD with on average 2.00X greater performance than x86 accelerated solutions, BiCGStab solvers
Lattice QCD Benchmark Performance on IBM Power System S822LC for High Performance Computing

System configuration

Power S822LC for HPC Xeon E5-2640 v4 Competitor
(2) IBM POWER8 with NVLink, 2.86 Ghz, 20-cores, 160 threads (2) Xeon E5-2640 v4 @ 2.40GHz, 20-cores
512 GB memory 512 GB memory
(2) 1 TB 7200 RPM SATA HDD (2) 800 GB Intel SSD DC S3510 Series 2.5″ SSD
(4) NVIDIA Tesla P100 with NVLink GPUs (4) NVIDIA Tesla K80 GPUs
NVIDIA NVLink PCIe Gen3
Ubuntu 16.04.1 LTS Ubuntu 16.04 LTS
CUDA 8.0 CUDA 8.0

Notes:

  • All results are based on running LatticeQCD BiCGStab Solvers and reported in GFLOPS
  • Date of testing: 08/12/2016

NAMD STMV Simulation

For the systems and workload compared:

  • Realize up to 10X faster time to solution running NAMD
    • Up to 10X better performance with 4 Tesla GPUs
    • Achieve your desired performance with fewer systems
NAMD STMV Simulation

System configuration

Power S822LC for HPC (with GPU) Power S822LC for HPC (CPU only)
IBM POWER8 with NVLink, 4.023 GHz, 20-cores, 40 threads IBM POWER8 with NVLink, 4.023 GHz, 20-cores, 40 threads
256 GB memory 256 GB memory
RHEL 7.3 RHEL 7.3
1 TB 7200 RPM SATA HDD 1 TB 7200 RPM SATA HDD
CUDA 8.0 CUDA 8.0
NVIDIA Tesla P100 with NVLink GPU
NVIDIA NVLink

Notes:

  • Results are based on IBM internal testing of NAMD 2.12 benchmarked on POWER8 processor-based systems installed with four NVIDIA Tesla P100s
  • Date of testing: 05/09/2017

IBM Power System S812LC integer processing

SPECint_rate2006

For the systems and workload compared:

  • IBM Power System S812LC is 45% better than the best single processor Xeon E5-2650 V3 system.
SPECint_rate2006 graph comparing Power S812LC against best single Xeon E5-2650 V3

Notes:

  • Compared Power S812LC (2.92GHz, 1 processor, 10 cores, 40 threads) SPECint_rate result (642) with all published single processor Xeon E5-2650 V3 based systems (2.3GHz, 1 processor, 10 cores, 20 threads) as of June 20, 2016. For more details visit: www.spec.org.

IBM Power System S812LC floating point processing

SPECfp_rate2006

For the systems and workload compared:

  • IBM Power System S812LC is 31% better than the best single processor Xeon E5-2650 V3 system.
SPECfp_rate2006 graph comparing Power S812LC against best single Xeon E5-2650 V3

Notes:

  • Compared Power S812LC (2.92GHz, 1 processor, 10 cores, 40 threads) SPECfp_rate result (468) with all published single processor Xeon E5-2650 V3 based systems (2.3GHz, 1 processor, 10 cores, 20 threads) as of June 20, 2016. For more details visit: www.spec.org.

IBM Power System S822LC integer processing

SPECint_rate2006

For the systems and workload compared:

  • IBM Power System S822LC is 24% better than the best dual processor Xeon E5-2650 V3 system.
SPECint_rate2006 comparing Power S822LC against best dual Xeon E5-2650 V3

Notes:

  • Compared Power S822LC (2.92GHz, 2 processor, 20 cores, 80 threads) SPECint_rate result (1100) with all published dual processor Xeon E5-2650 V3 based systems (2.3GHz, 1 processor, 10 cores, 20 threads) as of June 20, 2016. For more details visit: www.spec.org.

IBM Power System S822LC floating point processing

SPECfp_rate2006

For the systems and workload compared:

  • IBM Power System S822LC is 24% better than the best dual processor Xeon E5-2650 V3 system.
SPECfp_rate2006 comparing Power S822LC against best dual Xeon E5-2650 V3

Notes:

  • Compared Power S822LC (2.92GHz, 2 processor, 20 cores, 80 threads) SPECfp_rate result (888) with all published dual processor Xeon E5-2650 V3 based systems (2.3GHz, 1 processor, 10 cores, 20 threads) as of June 20, 2016. For more details visit: www.spec.org.

IBM Corporation 2017®

IBM, the IBM logo, ibm.com, POWER and POWER8 are trademarks of the International Business Machines Corp., registered in many jurisdictions worldwide. Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. Other product and service names may be the trademarks of IBM or other companies.

The content in this document (including any pricing references) is current as of July 22, 2015 and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates

THE INFORMATION CONTAINED ON THIS WEBSITE IS PROVIDED ON AN “AS IS” BASIS WITHOUT ANY WARRANTY EXPRESS OR IMPLIED INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE ANDY ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT.

In no event will IBM be liable for damages arising directly or indirectly from any use of the information contained in this document.

All information contained on this website is subject to change without notice. The information contained in this website does not affect or change IBM product specifications or warranties. IBM’s products are warranted according to the terms and conditions of the agreements under which they are provided. Nothing in this website shall operate as an express or implied license or indemnity under the intellectual property rights of IBM or third parties.

All information contained on this website was obtained in specific environments, and is presented as an illustration. The results obtained in other operating environments may vary.

No licenses, expressed or implied, by estoppel or otherwise, to any intellectual property rights are granted by this website.