Authors: Xun Pan, Zhaohui Ding

IBM Spectrum LSF is a mature and sophisticated workload management software for HPC. It has several daemons running on each node for nodes and jobs management. Docker can be used to simplify the deployment and management of LSF cluster services. This benefits third-party container management infrastructures in managing LSF as a service with containers such as IBM Cloud Private (“ICP”). For more information, see IBM Cloud Private.

By default, ICP creates Docker container to run LSF daemons on each node. These containers communicate with each other through the container network. One container is treated as an LSF host and jobs are run in the same container, see Figure 1. (a). This article introduces a new usage: LSF daemons still run in a container of a host, and users’ jobs run in independent containers on the host, see Figure 1. (b).

containerizing lsf daemons

Figure 1. The containerized LSF Daemons and Batch Jobs

The new usage presented in Figure 1. (b) has the following benefits:

  • LSF daemons and jobs are isolated independently
  • LSF is efficient in using full host resources for high performance computing
  • The Docker image encapsulated applications can be used to run Docker jobs for easy deployment

Nevertheless, Docker’s isolation functionality exports limited resources in the container by default, which leads to LSF daemons only being able to detect resources within the container. To make LSF daemons able to manage resources of the entire host, LSF daemons need to know the whole picture of the resources in the node. This article describes the steps to make the LSF daemons work properly.

Run LSF in Docker Container

LSF can be installed in a shared directory for all hosts in the cluster. Docker starts LSF daemons in a container by mounting the installation directory. The daemon binaries are started from the mounted path. This setup means that it is easy to run LSF daemons in the container and to manage LSF configurations in a central place.

Configure the Container for LSF Daemons

1. Account Mapping

The LSF administrator account that is specified in the container must exist. You can create the “lsfadmin” account and user group in the Docker image or mount your own passwd and group file in the container. You can specify the “docker run” option “-v passwd:/etc/passwd -v /etc/group:/etc/group”. The following is an example for the passwd file.

$ cat ./passwd
root:x:0:0:root:/root:/bin/bash
lsfadmin:x:100001:100001:::
2. Network Communication

LSF daemons on a host communicate with other hosts though the network. To improve performance, use the Docker host network. You can specify “docker run” option “–network=host” when starting LSF daemons on each node.

3. Other mapping between container and hosted OS
  • Job’s PIDs
  • LSF collects job PIDs for accounting. By default, Docker uses the private PID namespace for a container. LSF needs to know the job PIDs of the hosted OS. To collect the job PID, use the “docker run” option “–pid=host”

  • Communication socket
  • LSF starts the Docker container for each job. The job needs to communicate with the host dockerd daemon. The IPC socket is mounted to the Docker container for starting the job container by LSF daemons, which is usually located at /var/run/docker.sock. To specify the location of the IPC socket, use the “docker run” option “-v /var/run/docker.sock:/var/run/docker.sock”

  • cgroups
  • LSF manages jobs with cgroups. Each job has one job cgroup for process tracking, accounting, and resource enforcement. LSF assumes that the cgroup is located at /sys/fs/cgroup. To specify the cgroup location, use the “docker run” option “-v /sys/fs/cgroup:/sys/fs/cgroup”

Configure LSF to run jobs on a host container

LSF starts a job by using the job file script. Each job depends on the job spool directory to locate the job file. In a container, the spool directory is mounted for necessary job file reading/writing, and to start job containers with the job file in the spool directory.

Run LSF jobs

  • Start an LSF container node manually
  • $ docker run -it --hostname=lsf -v `pwd`/passwd:/etc/passwd -v /etc/group:/etc/group -v /sys/fs/cgroup/:/sys/fs/cgroup/ -v /var/run/docker.sock:/var/run/docker.sock –v /scratch/job:/scratch/job --pid=host lsfubt bash
    root@lsf $ cd /lsf/conf
    root@lsf $ source profile.lsf
    root@lsf $ lsadmin limstartup
    Starting up LIM on  ...... done
    root@lsf $ lsadmin resstartup
    Starting up RES on  ...... done
    root@lsf $ badmin hstartup
    Starting up slave batch daemon on  ...... done
    root@lsf $ lsid
    IBM Spectrum LSF Advanced 10.1.0.3, Jul 31 2017
    Copyright International Business Machines Corp. 1992, 2016.
    US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
    
    My cluster name is lsf10-1
    My master name is lsf
    root@lsf$ lshosts
    HOST_NAME      type    model  cpuf ncpus maxmem maxswp server RESOURCES
    lsf          X86_64   PC6000 116.1     2      -   1.4G    Yes (mg docker)
    root@lsf$ bparams  -l | grep SPOOL
        JOB_SPOOL_DIR = /scratch/job
    
  • Run a container job
  • root@lsf $ bapp –l
    
    APPLICATION NAME: docker
     -- docker job
    
    STATISTICS:
       NJOBS     PEND      RUN    SSUSP    USUSP      RSV
           0        0        0        0        0        0
    
    PARAMETERS:
    
    CONTAINER: docker[image(ubuntu) options(--rm)]
    
    lsfadmin@lsf $ bsub -app docker sleep 100
    Job <513> is submitted to default queue .
    lsfadmin@lsf $ bsub -app docker sleep 100
    Job <514> is submitted to default queue .
    lsfadmin@lsf $ bjobs
    JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
    513    lsfadmin RUN   normal     lsf         lsf         sleep 100  Sep 29 05:16
    514    lsfadmin RUN   normal     lsf         lsf         sleep 100  Sep 29 05:16 
    # another terminal
    lsfadmin@host $ docker ps
    CONTAINER ID        IMAGE               COMMAND                  CREATED             STATUS              PORTS               NAMES
    4a594455b7f5        ubuntu              "/scratch/job/1506..."   4 seconds ago       Up 3 seconds                            job.513
    b92679a80adf        ubuntu              "/scratch/job/1506..."   4 seconds ago       Up 3 seconds                            job.514
    c4058673f003        lsfubt              "bash"                   6 minutes ago       Up 6 minutes                            nervous_shirley
    
    

Summary

For the LSF deployment, it is flexible to use Docker containers for running LSF daemons. This article shows how to make LSF daemons and users’ jobs running in independent containers on a host. This mode helps LSF to fully manage host resources. This configuration has better isolation to make the system more stable and easy to recover if LSF or the job containers fail. It also introduces a flexible way of third-party containers management software to manage LSF clusters.

For More Information

IBM Spectrum LSF – https://www.ibm.com/spectrum-computing
IBM Cloud Private – https://www.ibm.com/cloud-computing/products/ibm-cloud-private/
Turbocharging Kubernetes Batch Job Management with IBM Spectrum LSF – https://www.ibm.com/developerworks/community/blogs/fe25b4ef-ea6a-4d86-a629-6f87ccf4649e/entry/September_5_2017_at_10_06_12_AM?lang=en

Join The Discussion

Your email address will not be published. Required fields are marked *