HPL (High Performance Linpack)์„ POWER9 AC922์—์„œ CUDA๋ฅผ ์ด์šฉํ•˜์—ฌ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ •๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค. ์ฃผ๋กœ ์•„๋ž˜ site์˜ ๋‚ด์šฉ๋Œ€๋กœ ํ…Œ์ŠคํŠธํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

How to compile HPL (LINPACK)

๋จผ์ € ์•„๋ž˜์™€ ๊ฐ™์ด ํ•„์š”ํ•œ package๋“ค์„ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค.

[user1@ac922 files]$ sudo yum install openmpi openmpi-devel mpich openblas openblas-static mpich-3.0-devel atlas lapack

๊ทธ๋ฆฌ๊ณ , atlas ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ atlas-devel์ด ํ•„์š”ํ•œ๋ฐ, ์ด๋Š” Redhat optional DVD์— ๋“ค์–ด์žˆ์Šต๋‹ˆ๋‹ค. ์ €๋Š” ๊ทธ๊ฒƒ์ด ์—†๋Š” ๊ด€๊ณ„๋กœ ๋ถ€๋“์ด ์•„๋ž˜ rpmfind.net์—์„œ ppc64le fedora์šฉ atlas-3.10.2์™€ atlas-devel-3.10.2๋ฅผ download ๋ฐ›์•„ ์„ค์น˜ํ–ˆ์Šต๋‹ˆ๋‹ค.

[user1@ac922 files]$ wget https://rpmfind.net/linux/fedora-secondary/releases/25/Everything/ppc64le/os/Packages/a/atlas-3.10.2-12.fc24.ppc64le.rpm

[user1@ac922 files]$ wget https://rpmfind.net/linux/fedora-secondary/releases/25/Everything/ppc64le/os/Packages/a/atlas-devel-3.10.2-12.fc24.ppc64le.rpm

[user1@ac922 files]$ sudo rpm -Uvh atlas-3.10.2-12.fc24.ppc64le.rpm atlas-devel-3.10.2-12.fc24.ppc64le.rpm

liblapack.so ๋Œ€์‹  liblapack.so.3.4.2๋ผ๋Š” ์ด๋ฆ„๋งŒ ๋งŒ๋“ค์–ด์ ธ ์žˆ์œผ๋ฏ€๋กœ, ์ด๋ฅผ soft link๋ฅผ ๊ฑธ์–ด ์ƒ์„ฑํ•ด ์ค๋‹ˆ๋‹ค.

[user1@ac922 files]$ sudo ln -s /usr/lib64/liblapack.so.3.4.2 /usr/lib64/liblapack.so

์ด์ œ (x86_64 ๋ฒ„์ „์ด๊ธด ํ•˜์ง€๋งŒ) HPL์˜ CUDA ๋ฒ„์ „ ์†Œ์Šค์ฝ”๋“œ๋ฅผ ๋ฐ›์•„์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ์•„๋ž˜์˜ NVIDIA site์— login์„ ํ•˜๊ณ  ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Login ID๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ ํšŒ์› ๊ฐ€์ž…์„ ํ•ด์•ผ ํ•˜๋Š”๋ฐ, ๋ฌด๋ฃŒ์ž…๋‹ˆ๋‹ค.

https://developer.nvidia.com/rdp/assets/cuda-accelerated-linpack-linux64

์œ„์—์„œ license ๋“ฑ์— ๋™์˜ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์ด hpl-2.0_FERMI_v15.solitairetheme8์„ download ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” tar.gz ํ˜•ํƒœ์˜ ํŒŒ์ผ์ž…๋‹ˆ๋‹ค.

[user1@ac922 files]$ tar -zxvf hpl-2.0_FERMI_v15.solitairetheme8

[user1@ac922 files]$ cd hpl-2.0_FERMI_v15

๋จผ์ €, Intel MKL compiler์— ํŽธํ–ฅ๋œ cuda_dgemm.c์˜ source๋ฅผ ์•ฝ๊ฐ„ ์ˆ˜์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

[user1@ac922 hpl-2.0_FERMI_v15]$ vi ./src/cuda/cuda_dgemm.c

// handle2 = dlopen (“libmkl_intel_lp64.so”, RTLD_LAZY);
handle2 = dlopen (“libopenblas.so”, RTLD_LAZY);

// dgemm_mkl = (void(*)())dlsym(handle, “dgemm”);
dgemm_mkl = (void(*)())dlsym(handle, “dgemm_”);

// handle = dlopen (“libmkl_intel_lp64.so”, RTLD_LAZY);
handle = dlopen (“libopenblas.so”, RTLD_LAZY);

// mkl_dtrsm = (void(*)())dlsym(handle2, “dtrsm”);
mkl_dtrsm = (void(*)())dlsym(handle2, “dtrsm_”);

์œ„์˜ ์ˆ˜์ •๋“ค์„ ํ•˜์ง€ ์•Š์œผ๋ฉด run_linpack ์ˆ˜ํ–‰์‹œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ runtime error๊ฐ€ ๋‚ฉ๋‹ˆ๋‹ค. ์ด๋Š” ppc64le ์•„ํ‚คํ…์ฒ˜ ์ƒ์—์„œ๋Š” libmkl_intel_lp64 ๋Œ€์‹  ์˜คํ”ˆ์†Œ์Šค์ธ openblas๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

libmkl_intel_lp64.so: cannot open shared object file: No such file or directory
libopenblas.so.0: undefined symbol: dtrsm
libopenblas.so.0: undefined symbol: dgemm

์ด์ œ compile์„ ์œ„ํ•ด Make.CUDA๋ฅผ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค. ppc64le ์•„ํ‚คํ…์ฒ˜๋ผ๊ณ  ํ•ด์„œ ํฌ๊ฒŒ ๋ฐ”๋€” ๊ฑด ์—†์Šต๋‹ˆ๋‹ค. ์•„๋ž˜ libmpich.a ๋Œ€์‹  ์žฅํ™ฉํ•˜๊ฒŒ -L๊ณผ -lmpich ๋“ฑ์„ ์“ด ๊ฒƒ์€ ์—ญ์‹œ optional Redhat DVD๊ฐ€ ์—†์–ด ์ œ ํ™˜๊ฒฝ์—๋Š” mpich-devel์„ ์„ค์น˜ํ•˜์ง€ ๋ชปํ•˜์—ฌ libmpich.a๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ํŠนํžˆ -lmkl ๋Œ€์‹  -lopenblas๋ฅผ ์“ด ๊ฒƒ์— ์ฃผ๋ชฉํ•˜์‹ญ์‹œ์š”.

[user1@ac922 hpl-2.0_FERMI_v15]$ vi Make.CUDA

#TOPdir = /home/mfatica/hpl-2.0_FERMI_v15
TOPdir = /home/user1/files/hpl-2.0_FERMI_v15

#MPdir = /opt/intel/mpi/3.0
#MPinc = -I$(MPdir)/include64
#MPlib = $(MPdir)/lib64/libmpi.a
#MPlib = $(MPdir)/lib64/libmpich.a
MPdir = /usr/lib64/openmpi
MPinc = -I /usr/include/openmpi-ppc64le
MPlib = -L /usr/lib64/openmpi/lib -lmpi -L /usr/lib64/mpich/lib -lmpich

#LAdir = $(TOPdir)/../../lib/em64t
#LAdir = /share/apps/intel/mkl/10.2.4.032/libem64t
#LAinc =
# CUDA
#LAlib = -L /home/cuda/Fortran_Cuda_Blas -ldgemm -L/usr/local/cuda/lib -lcublas -L$(LAdir) -lmkl -lguide -lpthread
LAdir = /usr/lib64
LAinc = -I /usr/include/openblas -I /usr/include
LAlib = ${LAdir}/libopenblas.a
LAlib = -L $(TOPdir)/src/cuda -ldgemm -L /usr/lib64/atlas -lsatlas -ltatlas -L /usr/local/cuda-9.1/targets/ppc64le-linux/lib/stubs -lcuda -lcublas -L /usr/local/cuda-9.1/lib64 -lcudart -L$(LAdir) -lpthread -lopenblas

#CC = mpicc
CC = /usr/lib64/openmpi/bin/mpicc

์ด์ œ ์•„๋ž˜์™€ ๊ฐ™์ด ํ™˜๊ฒฝ๋ณ€์ˆ˜๋ฅผ ๋งž์ถฐ์ฃผ๊ณ , make arch=CUDA๋ฅผ ์ˆ˜ํ–‰ํ•˜๋ฉด ์ผ์‚ฌ์ฒœ๋ฆฌ๋กœ compile์ด ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

[user1@ac922 hpl-2.0_FERMI_v15]$ export PATH=/usr/lib64/openmpi/bin:$PATH
[user1@ac922 CUDA]$ export LD_LIBRARY_PATH=/usr/lib64/openmpi/lib:/usr/lib64/mpich/lib:$LD_LIBRARY_PATH

[user1@ac922 hpl-2.0_FERMI_v15]$ make arch=CUDA

/usr/lib64/openmpi/bin/mpicc -DAdd__ -DF77_INTEGER=int -DStringSunStyle -DCUDA -I/home/user1/files/hpl-2.0_FERMI_v15/include -I/home/user1/files/hpl-2.0_FERMI_v15/include/CUDA -I /usr/include/openblas -I /usr/include -I /usr/include/openmpi-ppc64le -I/usr/local/cuda/include -fomit-frame-pointer -O3 -funroll-loops -W -Wall -fopenmp -o /home/user1/files/hpl-2.0_FERMI_v15/bin/CUDA/xhpl HPL_pddriver.o HPL_pdinfo.o HPL_pdtest.o /home/user1/files/hpl-2.0_FERMI_v15/lib/CUDA/libhpl.a -L /home/user1/files/hpl-2.0_FERMI_v15/src/cuda -ldgemm -L /usr/lib64/atlas -lsatlas -ltatlas -L /usr/local/cuda-9.1/targets/ppc64le-linux/lib/stubs -lcuda -lcublas -L /usr/local/cuda-9.1/lib64 -lcudart -L/usr/lib64 -lpthread -L /usr/lib64/openmpi/lib -lmpi -L /usr/lib64/mpich/lib -lmpich
make TOPdir=/home/user1/files/hpl-2.0_FERMI_v15 /home/user1/files/hpl-2.0_FERMI_v15/bin/CUDA/HPL.dat
make[3]: Entering directory `/home/user1/files/hpl-2.0_FERMI_v15/testing/ptest/CUDA’
make[3]: `/home/user1/files/hpl-2.0_FERMI_v15/bin/CUDA/HPL.dat’ is up to date.
make[3]: Leaving directory `/home/user1/files/hpl-2.0_FERMI_v15/testing/ptest/CUDA’
touch dexe.grd
make[2]: Leaving directory `/home/user1/files/hpl-2.0_FERMI_v15/testing/ptest/CUDA’
make[1]: Leaving directory `/home/user1/files/hpl-2.0_FERMI_v15′

์‹คํ–‰ ํŒŒ์ผ์€ ์•„๋ž˜์™€ ๊ฐ™์ด bin/CUDA ๋ฐ‘์— xhpl์ด๋ผ๋Š” ์ด๋ฆ„์œผ๋กœ ๋งŒ๋“ค์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

[user1@ac922 hpl-2.0_FERMI_v15]$ cd bin/CUDA
[user1@ac922 CUDA]$ ls -l
total 264
-rw-r–r–. 1 user1 user1 1344 Jul 17 2012 HPL.dat
-rw-r–r–. 1 user1 user1 1333 Jul 17 2012 HPL.dat_example
-rw-r–r–. 1 user1 user1 6816 Jul 17 2012 output_example
-rwxr-xr-x. 1 user1 user1 607 Jul 17 2012 run_linpack
-rwxrwxr-x. 1 user1 user1 284552 Mar 22 17:56 xhpl

์ˆ˜ํ–‰ํ•  ๋•Œ xhpl์„ ๊ทธ๋Œ€๋กœ ์“ฐ์ง€๋Š” ์•Š๊ณ , ๋ฏธ๋ฆฌ ์ค€๋น„๋œ run_linpack script๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” HPL_DIR ์ •๋„๋งŒ ์ˆ˜์ •ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

[user1@ac922 CUDA]$ vi run_linpack

#HPL_DIR=/home/mfatica/hpl-2.0_FERMI_v15
HPL_DIR=/home/user1/files/hpl-2.0_FERMI_v15

๊ทธ๋ฆฌ๊ณ  input ํŒŒ์ผ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋Š” HPL.dat ํŒŒ์ผ์„ ์ˆ˜์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด์— ๋Œ€ํ•ด์„œ๋Š” ์•„๋ž˜ URL์„ ์ฐธ์กฐํ•˜์—ฌ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค.

http://www.netlib.org/benchmark/hpl/tuning.html

์ €๋Š” ๋Œ์•„๊ฐ€๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด ๋ชฉ์ ์ด๋ผ, ์ผ๋‹จ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ทธ๋ƒฅ ์ž‘๊ฒŒ ๋Œ๋ฆฝ๋‹ˆ๋‹ค.

[user1@ac922 CUDA]$ vi HPL.dat
HPL Linpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
6 device out (6=stdout,7=stderr,file)
3 # of problems sizes (N)
300 600 1000 Ns
5 # of NBs
768 1024 1152 1280 1536 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
2 # of process grids (P x Q)
1 1 Ps
1 1 Qs
16.0 threshold
3 # of panel fact
0 1 2 PFACTs (0=left, 1=Crout, 2=Right)
4 # of recursive stopping criterium
1 2 4 8 NBMINs (>= 1)
3 # of panels in recursion
2 3 4 NDIVs
3 # of recursive panel fact.
0 1 2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
1 SWAP (0=bin-exch,1=long,2=mix)
60 swapping threshold
1 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)

์ด์ œ ์ˆ˜ํ–‰ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

[user1@ac922 CUDA]$ ./run_linpack
================================================================================
HPLinpack 2.0 — High-Performance Linpack benchmark — September 10, 2008
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N : 300 600 1000
NB : 768 1024 1152 1280 1536
PMAP : Row-major process mapping
P : 1 1
Q : 1 1
PFACT : Left Crout Right
NBMIN : 1 2 4 8
NDIV : 2 3 4
RFACT : Left Crout Right
BCAST : 1ringM
DEPTH : 1
SWAP : Spread-roll (long)
L1 : no-transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words

——————————————————————————–

– The matrix A is randomly generated for each test.
– The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
– The relative machine precision (eps) is taken to be 1.110223e-16
– Computational tests pass if scaled residuals are less than 16.0

Finished 3240 tests with the following results:
3240 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
——————————————————————————–

End of Tests.
================================================================================

๋ณด์‹œ๋‹ค์‹œํ”ผ ์•„๋ฌด ๋ฌธ์ œ ์—†์ด ๋๋‚ฌ์Šต๋‹ˆ๋‹ค.

์ด๋•Œ CPU์˜ ์‚ฌ์šฉ ํ˜•ํƒœ๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด 1๊ฐœ core๋งŒ 100% ์”๋‹ˆ๋‹ค.

x CPU Utilisation qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqx
x—————————+————————————————-+ x
xCPU User% Sys% Wait% Idle|0 |25 |50 |75 100| x
x 1 0.0 0.0 0.0 100.0| > | x
x 2 0.0 0.0 0.0 100.0| > | x
x 3 0.0 0.0 0.0 100.0| > | x
x 4 0.0 0.0 0.0 100.0| > | x
x 5 0.0 0.0 0.0 100.0| > | x
x 6 0.0 0.0 0.0 100.0| > | x
x 7 0.0 0.0 0.0 100.0| > | x
x 8 0.0 0.0 0.0 100.0| > | x
x 9 0.0 0.0 0.0 100.0| > | x
x 10 0.0 0.0 0.0 100.0| > | x
x 11 0.0 0.0 0.0 100.0| > | x
x 12 98.6 1.4 0.0 0.0|UUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUUU> x
x 13 0.0 0.0 0.0 100.0| > | x
x 14 0.0 0.0 0.0 100.0| > | x
x 15 0.0 0.0 0.0 100.0| > | x
x 16 0.0 0.0 0.0 100.0| > | x
x 17 0.0 0.0 0.0 100.0| > | x
x 18 0.0 0.0 0.0 100.0| > | x

๊ทธ๋ฆฌ๊ณ  GPU ์‚ฌ์šฉ๋ฅ ๋„ ์ด๋”ฐ๊ธˆ์”ฉ 1%๋ฅผ ์“ฐ๋Š” ์ •๋„๋กœ ๋†’์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค. HPL.dat์— ํŠœ๋‹์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

Fri Mar 23 11:25:11 2018
+—————————————————————————–+
| NVIDIA-SMI 387.36 Driver Version: 387.36 |
|——————————-+———————-+———————-+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2… On | 00000004:04:00.0 Off | 0 |
| N/A 38C P0 52W / 300W | 2534MiB / 16128MiB | 1% Default |
+——————————-+———————-+———————-+
| 1 Tesla V100-SXM2… On | 00000004:05:00.0 Off | 0 |
| N/A 40C P0 37W / 300W | 18MiB / 16128MiB | 0% Default |
+——————————-+———————-+———————-+
| 2 Tesla V100-SXM2… On | 00000035:03:00.0 Off | 0 |
| N/A 37C P0 36W / 300W | 18MiB / 16128MiB | 0% Default |
+——————————-+———————-+———————-+
| 3 Tesla V100-SXM2… On | 00000035:04:00.0 Off | 0 |
| N/A 41C P0 36W / 300W | 18MiB / 16128MiB | 0% Default |
+——————————-+———————-+———————-+

+—————————————————————————–+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 68175 C …1/files/hpl-2.0_FERMI_v15/bin/CUDA/xhpl 2516MiB |
+—————————————————————————–+

ํ† ๋ก  ์ฐธ๊ฐ€

์ด๋ฉ”์ผ์€ ๊ณต๊ฐœ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•„์ˆ˜ ์ž…๋ ฅ์ฐฝ์€ * ๋กœ ํ‘œ์‹œ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.