Overview

Skill Level: Intermediate

Steps on how to setup and configure Volta GPUs with CUDA 10.x on Ubuntu 18.04.x running on IBM Power System AC922.

Ingredients

Operating system (OS) Setup

CUDA-10.1 toolkit

Step-by-step

  1. Operating system (OS) Setup

    Setup a machine

    • Firmware update on AC922.
    • Install Ubuntu 18.04.2 on Power9 system DD2.2 hardware.
    • Check for NVIDIA Volta GPU devices.

    Firmware update on AC922:

    Make sure POWER9 AC922 920 firmware applied, OP9_v2.0.10-2.22 is the latest firmware available at the time of writing this Recepe.

    Install Ubuntu 18.04.2 on Power9 system DD2.2 hardware:

    http://ports.ubuntu.com/ubuntu-ports/dists/bionic-updates/main/installer-ppc64el/current/images/netboot/

    root@xxxxx:~# cat /etc/os-release
    NAME=”Ubuntu”
    VERSION=”18.04.2 LTS (Bionic Beaver)”
    ID=ubuntu
    ID_LIKE=debian
    PRETTY_NAME=”Ubuntu 18.04.2 LTS”
    VERSION_ID=”18.04″
    HOME_URL=”https://www.ubuntu.com/”
    SUPPORT_URL=”https://help.ubuntu.com/”
    BUG_REPORT_URL=”https://bugs.launchpad.net/ubuntu/”
    PRIVACY_POLICY_URL=”https://www.ubuntu.com/legal/terms-and-policies/privacy-policy”
    VERSION_CODENAME=bionic
    UBUNTU_CODENAME=bionic

     

    Check for NVIDIA Volta GPU devices:

    root@xxxxx:~# lspci | grep -i nvidia
    0004:04:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
    0004:05:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
    0035:03:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)
    0035:04:00.0 3D controller: NVIDIA Corporation Device 1db5 (rev a1)

  2. Setup and configure CUDA-10.1 toolkit

    Download and Install cuda-10.1 toolkit, download can be found at NVIDIA download link.

    $ dpkg -i cuda-repo-ubuntu1804-10-1-local-10.1.91-418.29_1.0-1_ppc64el.deb
    $ apt-get update
    $ apt-get install csh numactl openssh-server build-essential libx11-dev freeglut3 freeglut3-dev libxi-dev libxi6 libxmu-dev libxmu6 libglew-dev
    $ apt-get install cuda

     

    On POWER9 need to start the NVIDIA persistence daemon at boot time

    vi /lib/systemd/system/nvidia-persistenced.service

    File should contain the following, else replace with:
    [Unit]Description=NVIDIA Persistence Daemon
    Wants=syslog.target
    [Service]Type=forking
    PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
    Restart=always
    ExecStart=/usr/bin/nvidia-persistenced –verbose
    ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced
    [Install]WantedBy=multi-user.target

     

    Enable persistence daemon service:

    systemctl enable nvidia-persistenced

    systemctl start nvidia-persistenced

     

    Reboot the system to initialize all changes

     

    Validation:

    Check for Volta GPU device discovery on Ubuntu 18.04.2 at host side.

    Run cuda deviceQuery and make sure GPU devices get discovered.

     

    root@ltc-wspoon12:~# nvidia-smi
    Mon Feb 18 13:11:16 2019
    +—————————————————————————–+
    | NVIDIA-SMI 418.29 Driver Version: 418.29 CUDA Version: 10.1 |
    |——————————-+———————-+———————-+
    | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
    |===============================+======================+======================|
    | 0 Tesla V100-SXM2… On | 00000004:04:00.0 Off | 0 |
    | N/A 41C P0 46W / 300W | 0MiB / 32480MiB | 0% Default |
    +——————————-+———————-+———————-+
    | 1 Tesla V100-SXM2… On | 00000004:05:00.0 Off | 0 |
    | N/A 42C P0 41W / 300W | 0MiB / 32480MiB | 0% Default |
    +——————————-+———————-+———————-+
    | 2 Tesla V100-SXM2… On | 00000035:03:00.0 Off | 0 |
    | N/A 39C P0 42W / 300W | 0MiB / 32480MiB | 0% Default |
    +——————————-+———————-+———————-+
    | 3 Tesla V100-SXM2… On | 00000035:04:00.0 Off | 0 |
    | N/A 41C P0 41W / 300W | 0MiB / 32480MiB | 0% Default |
    +——————————-+———————-+———————-+

    +—————————————————————————–+
    | Processes: GPU Memory |
    | GPU PID Type Process name Usage |
    |=============================================================================|
    | No running processes found |
    +—————————————————————————–+

     

    Device Discovery:

    root@ltc-wspoon12:/usr/local/cuda/samples/bin/ppc64le/linux/release# deviceQuery
    deviceQuery Starting…

    CUDA Device Query (Runtime API) version (CUDART static linking)

    Detected 4 CUDA Capable device(s)

    Device 0: “Tesla V100-SXM2-32GB”
    CUDA Driver Version / Runtime Version 10.1 / 10.1
    CUDA Capability Major/Minor version number: 7.0
    Total amount of global memory: 32256 MBytes (33822867456 bytes)
    (80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores
    GPU Max Clock rate: 1530 MHz (1.53 GHz)
    Memory Clock rate: 877 Mhz
    Memory Bus Width: 4096-bit
    L2 Cache Size: 6291456 bytes
    Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
    Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
    Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
    Total amount of constant memory: 65536 bytes
    Total amount of shared memory per block: 49152 bytes
    Total number of registers available per block: 65536
    Warp size: 32
    Maximum number of threads per multiprocessor: 2048
    Maximum number of threads per block: 1024
    Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
    Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
    Maximum memory pitch: 2147483647 bytes
    Texture alignment: 512 bytes
    Concurrent copy and kernel execution: Yes with 4 copy engine(s)
    Run time limit on kernels: No
    Integrated GPU sharing Host Memory: No
    Support host page-locked memory mapping: Yes
    Alignment requirement for Surfaces: Yes
    Device has ECC support: Enabled
    Device supports Unified Addressing (UVA): Yes
    Device supports Compute Preemption: Yes
    Supports Cooperative Kernel Launch: Yes
    Supports MultiDevice Co-op Kernel Launch: Yes
    Device PCI Domain ID / Bus ID / location ID: 4 / 4 / 0
    Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

    Device 1: “Tesla V100-SXM2-32GB”
    CUDA Driver Version / Runtime Version 10.1 / 10.1
    CUDA Capability Major/Minor version number: 7.0
    Total amount of global memory: 32256 MBytes (33822867456 bytes)
    (80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores
    GPU Max Clock rate: 1530 MHz (1.53 GHz)
    Memory Clock rate: 877 Mhz
    Memory Bus Width: 4096-bit
    L2 Cache Size: 6291456 bytes
    Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
    Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
    Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
    Total amount of constant memory: 65536 bytes
    Total amount of shared memory per block: 49152 bytes
    Total number of registers available per block: 65536
    Warp size: 32
    Maximum number of threads per multiprocessor: 2048
    Maximum number of threads per block: 1024
    Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
    Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
    Maximum memory pitch: 2147483647 bytes
    Texture alignment: 512 bytes
    Concurrent copy and kernel execution: Yes with 4 copy engine(s)
    Run time limit on kernels: No
    Integrated GPU sharing Host Memory: No
    Support host page-locked memory mapping: Yes
    Alignment requirement for Surfaces: Yes
    Device has ECC support: Enabled
    Device supports Unified Addressing (UVA): Yes
    Device supports Compute Preemption: Yes
    Supports Cooperative Kernel Launch: Yes
    Supports MultiDevice Co-op Kernel Launch: Yes
    Device PCI Domain ID / Bus ID / location ID: 4 / 5 / 0
    Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

    Device 2: “Tesla V100-SXM2-32GB”
    CUDA Driver Version / Runtime Version 10.1 / 10.1
    CUDA Capability Major/Minor version number: 7.0
    Total amount of global memory: 32256 MBytes (33822867456 bytes)
    (80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores
    GPU Max Clock rate: 1530 MHz (1.53 GHz)
    Memory Clock rate: 877 Mhz
    Memory Bus Width: 4096-bit
    L2 Cache Size: 6291456 bytes
    Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
    Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
    Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
    Total amount of constant memory: 65536 bytes
    Total amount of shared memory per block: 49152 bytes
    Total number of registers available per block: 65536
    Warp size: 32
    Maximum number of threads per multiprocessor: 2048
    Maximum number of threads per block: 1024
    Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
    Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
    Maximum memory pitch: 2147483647 bytes
    Texture alignment: 512 bytes
    Concurrent copy and kernel execution: Yes with 4 copy engine(s)
    Run time limit on kernels: No
    Integrated GPU sharing Host Memory: No
    Support host page-locked memory mapping: Yes
    Alignment requirement for Surfaces: Yes
    Device has ECC support: Enabled
    Device supports Unified Addressing (UVA): Yes
    Device supports Compute Preemption: Yes
    Supports Cooperative Kernel Launch: Yes
    Supports MultiDevice Co-op Kernel Launch: Yes
    Device PCI Domain ID / Bus ID / location ID: 53 / 3 / 0
    Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

    Device 3: “Tesla V100-SXM2-32GB”
    CUDA Driver Version / Runtime Version 10.1 / 10.1
    CUDA Capability Major/Minor version number: 7.0
    Total amount of global memory: 32256 MBytes (33822867456 bytes)
    (80) Multiprocessors, ( 64) CUDA Cores/MP: 5120 CUDA Cores
    GPU Max Clock rate: 1530 MHz (1.53 GHz)
    Memory Clock rate: 877 Mhz
    Memory Bus Width: 4096-bit
    L2 Cache Size: 6291456 bytes
    Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
    Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
    Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
    Total amount of constant memory: 65536 bytes
    Total amount of shared memory per block: 49152 bytes
    Total number of registers available per block: 65536
    Warp size: 32
    Maximum number of threads per multiprocessor: 2048
    Maximum number of threads per block: 1024
    Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
    Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
    Maximum memory pitch: 2147483647 bytes
    Texture alignment: 512 bytes
    Concurrent copy and kernel execution: Yes with 4 copy engine(s)
    Run time limit on kernels: No
    Integrated GPU sharing Host Memory: No
    Support host page-locked memory mapping: Yes
    Alignment requirement for Surfaces: Yes
    Device has ECC support: Enabled
    Device supports Unified Addressing (UVA): Yes
    Device supports Compute Preemption: Yes
    Supports Cooperative Kernel Launch: Yes
    Supports MultiDevice Co-op Kernel Launch: Yes
    Device PCI Domain ID / Bus ID / location ID: 53 / 4 / 0
    Compute Mode:
    < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
    > Peer access from Tesla V100-SXM2-32GB (GPU0) -> Tesla V100-SXM2-32GB (GPU1) : Yes
    > Peer access from Tesla V100-SXM2-32GB (GPU0) -> Tesla V100-SXM2-32GB (GPU2) : Yes
    > Peer access from Tesla V100-SXM2-32GB (GPU0) -> Tesla V100-SXM2-32GB (GPU3) : Yes
    > Peer access from Tesla V100-SXM2-32GB (GPU1) -> Tesla V100-SXM2-32GB (GPU0) : Yes
    > Peer access from Tesla V100-SXM2-32GB (GPU1) -> Tesla V100-SXM2-32GB (GPU2) : Yes
    > Peer access from Tesla V100-SXM2-32GB (GPU1) -> Tesla V100-SXM2-32GB (GPU3) : Yes
    > Peer access from Tesla V100-SXM2-32GB (GPU2) -> Tesla V100-SXM2-32GB (GPU0) : Yes
    > Peer access from Tesla V100-SXM2-32GB (GPU2) -> Tesla V100-SXM2-32GB (GPU1) : Yes
    > Peer access from Tesla V100-SXM2-32GB (GPU2) -> Tesla V100-SXM2-32GB (GPU3) : Yes
    > Peer access from Tesla V100-SXM2-32GB (GPU3) -> Tesla V100-SXM2-32GB (GPU0) : Yes
    > Peer access from Tesla V100-SXM2-32GB (GPU3) -> Tesla V100-SXM2-32GB (GPU1) : Yes
    > Peer access from Tesla V100-SXM2-32GB (GPU3) -> Tesla V100-SXM2-32GB (GPU2) : Yes

    deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.1, CUDA Runtime Version = 10.1, NumDevs = 4
    Result = PASS
    root@ltc-wspoon12:/usr/local/cuda/samples/bin/ppc64le/linux/release#

Join The Discussion