Authors: Xun Pan, Zhaohui Ding
IBM Spectrum LSF is a mature and sophisticated workload management software for HPC. It has several daemons running on each node for nodes and jobs management. Docker can be used to simplify the deployment and management of LSF cluster services. This benefits third-party container management infrastructures in managing LSF as a service with containers such as IBM Cloud Private (“ICP”). For more information, see IBM Cloud Private.
By default, ICP creates Docker container to run LSF daemons on each node. These containers communicate with each other through the container network. One container is treated as an LSF host and jobs are run in the same container, see Figure 1. (a). This article introduces a new usage: LSF daemons still run in a container of a host, and users’ jobs run in independent containers on the host, see Figure 1. (b).
The new usage presented in Figure 1. (b) has the following benefits:
- LSF daemons and jobs are isolated independently
- LSF is efficient in using full host resources for high performance computing
- The Docker image encapsulated applications can be used to run Docker jobs for easy deployment
Nevertheless, Docker’s isolation functionality exports limited resources in the container by default, which leads to LSF daemons only being able to detect resources within the container. To make LSF daemons able to manage resources of the entire host, LSF daemons need to know the whole picture of the resources in the node. This article describes the steps to make the LSF daemons work properly.
Run LSF in Docker Container
LSF can be installed in a shared directory for all hosts in the cluster. Docker starts LSF daemons in a container by mounting the installation directory. The daemon binaries are started from the mounted path. This setup means that it is easy to run LSF daemons in the container and to manage LSF configurations in a central place.
Configure the Container for LSF Daemons
1. Account Mapping
The LSF administrator account that is specified in the container must exist. You can create the “lsfadmin” account and user group in the Docker image or mount your own passwd and group file in the container. You can specify the “docker run” option “-v passwd:/etc/passwd -v /etc/group:/etc/group”. The following is an example for the passwd file.
$ cat ./passwd root:x:0:0:root:/root:/bin/bash lsfadmin:x:100001:100001:::
2. Network Communication
LSF daemons on a host communicate with other hosts though the network. To improve performance, use the Docker host network. You can specify “docker run” option “–network=host” when starting LSF daemons on each node.
3. Other mapping between container and hosted OS
- Job’s PIDs
- Communication socket
LSF collects job PIDs for accounting. By default, Docker uses the private PID namespace for a container. LSF needs to know the job PIDs of the hosted OS. To collect the job PID, use the “docker run” option “–pid=host”
LSF starts the Docker container for each job. The job needs to communicate with the host dockerd daemon. The IPC socket is mounted to the Docker container for starting the job container by LSF daemons, which is usually located at /var/run/docker.sock. To specify the location of the IPC socket, use the “docker run” option “-v /var/run/docker.sock:/var/run/docker.sock”
LSF manages jobs with cgroups. Each job has one job cgroup for process tracking, accounting, and resource enforcement. LSF assumes that the cgroup is located at /sys/fs/cgroup. To specify the cgroup location, use the “docker run” option “-v /sys/fs/cgroup:/sys/fs/cgroup”
Configure LSF to run jobs on a host container
LSF starts a job by using the job file script. Each job depends on the job spool directory to locate the job file. In a container, the spool directory is mounted for necessary job file reading/writing, and to start job containers with the job file in the spool directory.
Run LSF jobs
- Start an LSF container node manually
$ docker run -it --hostname=lsf -v `pwd`/passwd:/etc/passwd -v /etc/group:/etc/group -v /sys/fs/cgroup/:/sys/fs/cgroup/ -v /var/run/docker.sock:/var/run/docker.sock –v /scratch/job:/scratch/job --pid=host lsfubt bash root@lsf $ cd /lsf/conf root@lsf $ source profile.lsf root@lsf $ lsadmin limstartup Starting up LIM on
...... done root@lsf $ lsadmin resstartup Starting up RES on ...... done root@lsf $ badmin hstartup Starting up slave batch daemon on ...... done root@lsf $ lsid IBM Spectrum LSF Advanced 10.1.0.3, Jul 31 2017 Copyright International Business Machines Corp. 1992, 2016. US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. My cluster name is lsf10-1 My master name is lsf root@lsf$ lshosts HOST_NAME type model cpuf ncpus maxmem maxswp server RESOURCES lsf X86_64 PC6000 116.1 2 - 1.4G Yes (mg docker) root@lsf$ bparams -l | grep SPOOL JOB_SPOOL_DIR = /scratch/job
root@lsf $ bapp –l APPLICATION NAME: docker -- docker job STATISTICS: NJOBS PEND RUN SSUSP USUSP RSV 0 0 0 0 0 0 PARAMETERS: CONTAINER: docker[image(ubuntu) options(--rm)] lsfadmin@lsf $ bsub -app docker sleep 100 Job <513> is submitted to default queue
. lsfadmin@lsf $ bsub -app docker sleep 100 Job <514> is submitted to default queue . lsfadmin@lsf $ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 513 lsfadmin RUN normal lsf lsf sleep 100 Sep 29 05:16 514 lsfadmin RUN normal lsf lsf sleep 100 Sep 29 05:16 # another terminal lsfadmin@host $ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 4a594455b7f5 ubuntu "/scratch/job/1506..." 4 seconds ago Up 3 seconds job.513 b92679a80adf ubuntu "/scratch/job/1506..." 4 seconds ago Up 3 seconds job.514 c4058673f003 lsfubt "bash" 6 minutes ago Up 6 minutes nervous_shirley
For the LSF deployment, it is flexible to use Docker containers for running LSF daemons. This article shows how to make LSF daemons and users’ jobs running in independent containers on a host. This mode helps LSF to fully manage host resources. This configuration has better isolation to make the system more stable and easy to recover if LSF or the job containers fail. It also introduces a flexible way of third-party containers management software to manage LSF clusters.
For More Information
IBM Spectrum LSF – https://www.ibm.com/spectrum-computing
IBM Cloud Private – https://www.ibm.com/cloud-computing/products/ibm-cloud-private/
Turbocharging Kubernetes Batch Job Management with IBM Spectrum LSF – https://www.ibm.com/developerworks/community/blogs/fe25b4ef-ea6a-4d86-a629-6f87ccf4649e/entry/September_5_2017_at_10_06_12_AM?lang=en