Overview

Skill Level: Intermediate

This recipe will walk you through the steps to install nvidia-docker 2.0, NVIDIA's framework to expose GPUs to containers, configure the new kubernetes device plugin to expose the GPUs to your cluster, and create a sample job for your new environment.

Ingredients

  • Ubuntu 16.04 for ppc64le on a Power 8 system
  • NVIDIA's drivers installed on the host

See https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0) for version information.

  • Docker:

For the latest release using the following instructions: https://docs.docker.com/install/linux/docker-ce/ubuntu/#os-requirements
When it comes to the repository add, make sure you copy/paste from the "IBM Power (ppc64le)" tab.

  • Kubernetes 1.10:

For a guide on configuring kubernetes, you can use https://kubernetes.io/docs/setup/independent/create-cluster-kubeadm.

 

Step-by-step

  1. Install nvidia-docker 2.0 from NVIDIA's repository

    Use the following instructions from NVIDIA: https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)

    You don’t need to pin your version of docker when installing.

    Be sure to back up your docker engine config and service files before installing nvidia-docker 2.0. The nvidia-docker package will add nvidia-docker as a runtime by modifying the config file.

    Verify that the runtime has been added correctly:

    $ docker info | grep Runtime 
    "Runtimes: nvidia runc"

     

  2. Set default runtime to nvidia

    Edit your service file by modifying the following flag: –default-runtime=nvidia. This could be in your main service file, or a drop-in file, e.g. /etc/systemd/system/docker.service.d/override.conf.

    Verify that the default runtime has been set to nvidia:

    $ docker info | grep Default

    Default Runtime: nvidia

     

    Note: There are multiple ways to configure this. For other options, refer to nvidia’s documentation here: https://github.com/nvidia/nvidia-container-runtime#docker-engine-setup

  3. Configure kubernetes plugin

    Follow the instructions from the kubernetes project here: https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#v1-8-onwards

    If you are using a Kubernetes version earlier than 1.10, , set –feature-gates=”DevicePlugins=true” (https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/#deploying-nvidia-gpu-device-plugin).

    When deploying, you’ll use https://raw.githubusercontent.com/nvidia/k8s-device-plugin/v1.10/nvidia-device-plugin.yml.

    Create the device plugin daemonset, so that each pod in your cluster has access to the host GPUs:

    $ kubectl create -f nvidia-device-plugin.yml
    daemonset "nvidia-device-plugin-daemonset" created

     

  4. Deploy pod with exposed GPU

    Using https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus first build your image:

    $ curl https://raw.githubusercontent.com/kubernetes/kubernetes/v1.7.11/test/images/nvidia-cuda/Dockerfile -o nvidia-cuda-vector_Dockerfile

    Edit the FROM line to FROM nvidia/cuda-ppc64le:9.2-devel-ubuntu16.04

    $ docker build -f nvidia-cuda-vector_Dockerfile -t cuda-vector-add:v0.1 .

    Deploy your application to your pod:

    $ kubectl create -f cuda-vector-pod.yml
    pod "cuda-vector-add" created

    Find your nvidia-device-plugin-deamonset container on your node and confirm that things loaded properly, e.g.:

    $ docker logs k8s_nvidia-device-plugin-ctr_nvidia-device-plugin-daemonset-bqdnk_kube-system_d447ec3b-5ddd-11e8-94a4-98be9405a2a4_0
    2018/05/22 16:33:24 Loading NVML
    2018/05/22 16:33:24 Fetching devices.
    2018/05/22 16:33:24 Starting FS watcher.
    2018/05/22 16:33:24 Starting OS watcher.
    2018/05/22 16:33:24 Starting to serve on /var/lib/kubelet/device-plugins/nvidia.sock
    2018/05/22 16:33:24 Registered device plugin with Kubelet

    View the logs of your test job, for example:

    $ kubectl logs cuda-vector-add
    [Vector addition of 50000 elements]Copy input data from the host memory to the CUDA device
    CUDA kernel launch with 196 blocks of 256 threads
    Copy output data from the CUDA device to the host memory
    Test PASSED
    Done
  5. Further Reading

    For more on how you can control how GPUs are exposed inside containers, see https://github.com/nvidia/nvidia-container-runtime#environment-variables-oci-spec. These environment variables are set in the images provided by NVIDIA.

Join The Discussion