In this blog we will start the process of constructing a Hybrid IBM Spectrum LSF cluster. We will examine different ways of configuring LSF, as well as looking at many of the architectural considerations. We will provide sample Ansible playbooks which you can take and customize for your own needs. We’ll use LSF Suite as our starting point for our on premises LSF cluster and Amazon Web Services (AWS) as our cloud provider for this example.
A Little Background
IBM Spectrum LSF Suite is a comprehensive suite of offerings supporting traditional high-performance computing (HPC) and high throughput (HTC) workloads, as well as for big data, cognitive, GPU machine learning, and containerized workloads. The suite has the following components:
IBM Spectrum LSF – Is the core of the Suite. Load Sharing Facility (LSF) is the powerful workload management platform for demanding, distributed HPC environments. It provides a comprehensive set of intelligent, policy-driven scheduling features so that you can use all of your compute infrastructure resources and help ensure optimal application performance. LSF is the HPC workload management standard, with the most complete set of capabilities from license scheduling and session scheduling to advanced analytics.
IBM Spectrum LSF Application Center – A WebShere based bootstrap job submission and control portal that not only allows job submission and control via a easy to use interface, but supports additional features included in the Suite.
IBM Spectrum LSF Process Manager – Brings enterprise scheduling to Spectrum LSF. Users can create workflows from their web browser of choice, and then schedule those workflows based upon either a calendar event, a file event, or from the Application Center GUI.
IBM Spectrum LSF Explorer – A Elastic Search based Reporting GUI that provides specific workload based reporting from inside of Application Center. LSF Explorer allows searching LSF events for matching information, in addition to System logs provided through Elastics Logstash and File Beats tools. Additionally, most metric data is provided by Elastics Metric Beats toolset. All this information can be graphs from the Application Center interface for user consumption. Our implementation of Elastic supports one and three node configurations for fault tolerance.
IBM Spectrum LSF License Scheduler – This tool allows policy driven management of high priced FLEXlm and Reprise managed license services. It is the state of the art solution for managing such licenses within an enterprise including multiple clusters.
IBM Spectrum LSF Data Manager – This tool provides on Cloud data movement and caching services for LSF customers wishing to run jobs on multiple clusters or in the cloud.
IBM Spectrum LSF Resource Connector – This tool allows policy driven cloud bursting with LSF to all major Cloud bursting services include: IBM, Amazon Web Services, Google, and Azure.
As you can see there are a lot of components that are integrated together. The LSF Suite takes care of installing and configuring that so you don’t have to. There are three editions of IBM Spectrum LSF Suite as shown below.
The edition to use will depend on what type of LSF Cluster you wish to create. We’ll discuss more about that later, but for LSF Multi Clusters you will need IBM Spectrum LSF Suite for HPC, or higher. Any edition will work for LSF Stretch clusters.
There is no one correct way to construct a LSF Hybrid cluster. Every site is different, and users are running different workloads that have very different resource consumption patterns. Some Deep Learning (DL) training consumes huge amount of data to produce a small model. Others are relatively CPU intensive, or GPU intensive. This means that the solution you construct will really depend on the workloads you want to run. Understanding those workloads and experimenting with different configurations will be needed to build the best solution for you site.
In this blog we will look at two installation options for LSF. There are others as seen below, but we will be constructing a Hybrid Cloud configuration.
We will focus on the LSF Stretch Cluster configuration, and the LSF Multi Cluster configuration.
LSF Stretch Cluster
This architecture assumes that you have a cluster in another location – either on premise or even running in another cloud or cloud location. The “stretched cluster” architecture is defined as a single cluster stretched over a WAN so that compute nodes in the cloud communicate with a master scheduling host on the originating location.
Generally, though much simpler in concept than “Multi-Cluster”, this means that all LSF daemon communication with the master scheduler happens over the WAN which can be a source of extra cost or lowered reliability.
LSF Multi Cluster
This is a more complex architecture which adds a master scheduler running in the cloud. By adding a master scheduler in the cloud, the architecture eliminates all the communication from cloud compute node to the on premise master.
The two master schedulers instead exchange task meta-data in a “job forwarding” model. In this model, users on premise submit workload to a queue on premise, which in turn forwards that workload to the cloud for execution. Upon task completion, the master in the cloud communicates completion, and status with the on premise master and the user is notified.
Our approach will be to assemble our solution in layers. That way we can explore the issues and suggest other ways of doing things that may work better for you. We’ll provide sample Ansible playbooks that collectively create a functioning solution, but that you can take and freely customize to suit your environment. The layers will cover:
- Installing Prerequisites
- Making an Amazon Virtual Private Cloud (Amazon VPC)
- Network Connection
- User / Group and host resolution on the cloud
- Creating Cloud machines
We’ve put together a series of videos that walk through the process.
These videos look at how to extent your on premises LSF clusters to the Cloud. In them we look at various topics you need to consider in constructing you Hybrid cloud solution. We show two different LSF configurations suitable for small and large clusters and discuss the benefits of each. We provide sample Ansible playbooks which you can take and customise for your site. Each video covers a different topic, and a different Ansible playbook. They are best viewed in order.
LSF Cloud Video 1 – Introduction
This is the first of the video series on creating a hybrid LSF cluster. This video covers, what is LSF, why do users want to go to the cloud, and how we can help in that journey. We outline two different ways LSF can be configured. The first extends the on premises cluster by adding cloud servers to the cluster. The second constructs a second cluster on the cloud, and dynamically sizes that cluster based on the amount of workload. The subsequent videos provide additional details and live demonstrations on how to build them.
LSF Cloud Video 2 – What Type of Cluster
This video provides details on different way LSF can be configured to use Cloud machines. We start from the simplest case, the LSF Stretch Cluster, which adds Cloud machines into an existing on premises cluster. We then show a LSF Multi Cluster, which creates a separate LSF cluster on the cloud that accepts workload from the on premises cluster and dynamically resizes based on policies. The uses cases of each one is outlined along with the benefits and issues.
LSF Cloud Video 3 – Installing Prerequisites
In this video we start the process of building a LSF hybrid cluster. We start from an existing on premises LSF Suite cluster, and use that, along with some sample Ansible playbooks to deploy the LSF Stretch and LSF Multi clusters on to Amazon Elastic Compute Cloud (Amazon EC2) instances. This video discusses the prerequisites for the sample playbooks. It shows how to setup your AWS account and get the needed AWS keys and certificate that will be used later. It shows the git repository that hosts the code:
It shows how to add the AWS keys to the playbooks and run the first playbook to setup you LSF Master to build the rest of the solution.
LSF Cloud Video 4 – Amazon VPC Configuration
This video focuses specifically on Amazon Web Services and there Cloud environment. In it we show a playbook that will construct a Amazon VPC, along with associated subnets, routes, security groups, network ACLs, and internet gateways. We also show how to use an existing Amazon VPC with the playbooks. The LSF cluster will use this Amazon VPC to access the cloud instances.
LSF Cloud Video 5 – Network Connection
The connection between the on premises cluster and the cloud instances is a critical part of the infrastructure. This video looks at different options available with AWS. It shows a sample playbook that will construct a VPN using OpenVPN. We also test the connection to verify it can work with LSF.
LSF Cloud Video 6 – Users and Groups
In this video we discuss ways in which to resolve the issue of providing a consistent user experience with a hybrid cloud. We look at possible solutions for synchronising user, group and host configurations between the on premises and cloud machines. We show a playbook that synchronises the users, groups and hosts between the on premises LSF master and the cloud instances.
LSF Cloud Video 7 – Bringup LSF Cloud Instances
This video uses a playbook to bring up additional cloud instances. The machines are configured so that they can be reached from the on premises LSF master and the users, groups, and host resolution is configured.
LSF Cloud Video 8 – Storage
In this video we cover one of the more difficult issues to address in constructing an LSF hybrid cluster. The architecture of the storage will have a large impact on how the on cloud cluster performs. This video will cover some options, but it is strongly recommended that users perform there own experiments to see what storage configuration option works best for there workloads. We demonstrate a simple storage configuration.
LSF Cloud Video 9 – LSF Stretch Cluster deployment
This video demonstrates the deployment of the LSF Stretch cluster. We take the machine(s) deployed in the previous videos and extent the existing on premises cluster to include additional cloud machines. We show how the LSF Master is reconfigured, and demonstrate jobs running on the cloud instances.
LSF Cloud Video 10 – LSF Multi Cluster deployment
Here we demonstrate the deployment of the LSF Multi cluster. We take the machine(s) deployed in the previous videos and extent the existing on premises cluster to include additional cloud machines. We show how the LSF Master on premises and on cloud is reconfigured. We submit work to the cluster and see it dynamically create new machines on the cloud, and see it terminate those machines when the load drops.
LSF Cloud Video 11 – Decommissioning the Cluster
This video demonstrates how to take down the on cloud cluster. It also shows what must be done to remove any hosts that were dynamically created by the resource connector in the LSF Multi cluster. It is VERY important to clean up fully, so a thorough review of this video is recommended.
Extending the Code
The Ansible playbooks used in these videos is hosted on Github here. They are public and freely available for you to take and customize. If you add a new feature you’d like to share with everyone, please post it.