Introducing the basic concepts of High Performance Computing with Linux cluster technology
Linux clustering is popular in many industries these days. With the advent of clustering technology and the growing acceptance of open source software, supercomputers can now be created for a fraction of the cost of traditional high-performance machines.
This two-part article introduces the concepts of High Performance Computing (HPC) with Linux cluster technology and shows you how to set up clusters and write parallel programs. This part introduces the different types of clusters, uses of clusters, some fundamentals of HPC, the role of Linux, and the reasons for the growth of clustering technology. Part 2 covers parallel algorithms, how to write parallel programs, how to set up clusters, and cluster benchmarking.
Types of HPC architectures
Most HPC systems use the concept of parallelism. Many software platforms are oriented for HPC, but first let’s look at the hardware aspects.
HPC hardware falls into three categories:
- Symmetric multiprocessors (SMP)
- Vector processors
Symmetric multiprocessors (SMP)
SMP is a type of HPC architecture in which multiple processors share the same memory. (In clusters, also known as massively parallel processors (MPPs), they don’t share the same memory; we will look at this in more detail.) SMPs are generally more expensive and less scalable than MPPs.
In vector processors, the CPU is optimized to perform well with arrays or vectors; hence the name. Vector processor systems deliver high performance and were the dominant HPC architecture in the 1980s and early 1990s, but clusters have become far more popular in recent years.
Clusters are the predominant type of HPC hardware these days; a cluster is a set of MPPs. A processor in a cluster is commonly referred to as a node and has its own CPU, memory, operating system, and I/O subsystem and is capable of communicating with other nodes. These days it is common to use a commodity workstation running Linux and other open source software as a node in a cluster.
You’ll see how these types of HPC hardware differ, but let’s start with clustering.
The term “cluster” can take different meanings in different contexts. This article focuses on three types of clusters:
- Fail-over clusters
- Load-balancing clusters
- High-performance clusters
The simplest fail-over cluster has two nodes: one stays active and the other stays on stand-by but constantly monitors the active one. In case the active node goes down, the stand-by node takes over, allowing a mission-critical system to continue functioning.
Load-balancing clusters are commonly used for busy Web sites where several nodes host the same site, and each new request for a Web page is dynamically routed to a node with a lower load.
These clusters are used to run parallel programs for time-intensive computations and are of special interest to the scientific community. They commonly run simulations and other CPU-intensive programs that would take an inordinate amount of time to run on regular hardware.
Figure 1 illustrates a basic cluster. Part 2 of this series shows you how to create such a cluster and write programs for it.
Figure 1. The basic cluster
Grid computing is a broad term that typically refers to a service-oriented architecture (SOA) with collaboration among loosely coupled systems. Cluster-based HPC is a special case of grid computing in which the nodes are tightly coupled. A successful, well-known project in grid computing is SETI@home, the Search for Extraterrestrial Intelligence program, which used the idle CPU cycles of a million home PCs via screen savers to analyze radio telescope data. A similar successful project is the Folding@Home project for protein-folding calculations.
Common uses of high-performance clusters
Almost every industry needs fast processing power. With the increasing availability of cheaper and faster computers, more companies are interested in reaping the technological benefits. There is no upper boundary to the needs of computer processing power; even with the rapid increase in power, the demand is considerably more than what’s available.
Life sciences research
Proteins molecules are long flexible chains that can take on a virtually infinite number of 3D shapes. In nature, when put in a solvent, they quickly “fold” to their native states. Incorrect folding is believed to lead to a variety of diseases like Alzheimer’s; therefore, the study of protein folding is of fundamental importance.
One way scientists try to understand protein folding is by simulating it on computers. In nature, folding occurs quickly (in about one millionth of a second), but it is so complex that its simulation on a regular computer could take decades. This focus area is a small one in an industry with many more such areas, but it needs serious computational power.
Other focus areas from this industry include pharmaceutical modeling, virtual surgery training, condition and diagnosis visualizations, total-record medical databases, and the Human Genome Project.
Oil and gas exploration
Seismograms convey detailed information about the characteristics of the interior of the Earth and seafloors, and an analysis of the data helps in detecting oil and other resources. Terabytes of data are needed to reconstruct even small areas; this analysis obviously needs a lot of computational power. The demand for computational power in this field is so great that supercomputing resources are often leased for this work.
Other geological efforts require similar computing power, such as designing systems to predict earthquakes and designing multispectral satellite imaging systems for security work.
Manipulating high-resolution interactive graphics in engineering, such as in aircraft engine design, has always been a challenge in terms of performance and scalability because of the sheer volume of data involved. Cluster-based techniques have been helpful in this area where the task to paint the display screen is split among various nodes of the cluster, which use their graphics hardware to do the rendering for their part of the screen and transmit the pixel information to a master node that assembles the consolidated image for the screen.
These industry examples are only the tip of the iceberg; many more applications including astrophysical simulation, weather simulation, engineering design, financial modeling, portfolio simulation, and special effects in movies demand extensive computational resources. The demand for increasingly more power is here to stay.
How Linux and clusters have changed HPC
Before cluster-based computing, the typical supercomputer was a vector processor that could typically cost over a million dollars due to the specialized hardware and software.
With Linux and other freely available open source software components for clustering and improvements in commodity hardware, the situation now is quite different. You can build powerful clusters with a very small budget and keep adding extra nodes based on need.
The GNU/Linux operating system (Linux) has spurred the adoption of clusters on a large scale. Linux runs on a wide variety of hardware, and high-quality compilers and other software like parallel filesystems and MPI implementations are freely available for Linux. Also with Linux, users have the ability to customize the kernel for their workload. Linux is a recognized favorite platform for building HPC clusters.
Understanding hardware: Vector vs. cluster
To understand HPC hardware, it is useful to contrast the vector computing paradigm with the cluster paradigm. These are competing technologies (Earth Simulator, a vector supercomputer, is still among the top 10 fastest supercomputers).
Fundamentally, both vector and scalar processors execute instructions based on a clock; what sets them apart is the ability of vector processors to parallelize computations involving vectors (such as matrix multiplication), which are commonly used in High Performance Computing. To illustrate this, suppose you have two double arrays a and b and want to create a third array x such that x[i]=a[i]+b[i].
Any floating-point operation like addition or multiplication is achieved in a few discrete steps:
- Exponents are adjusted
- Significands are added
- The result is checked for rounding errors and so on
A vector processor parallelizes these internal steps by using a pipelining technique. Suppose there are six steps (as in IEEE arithmetic hardware) in a floating-point addition as shown in Figure 2.
Figure 2. A six-step pipeline with IEEE arithmetic hardware
A vector processor does these six steps in parallel — if the i th array elements being added are in their 4 th stage, the vector processor will execute the 3 rd stage for the (i+1) th element, 2 nd stage for the (i+2) th, and so on. As you can see, for a six-step floating-point addition, the speedup factor will be very close to six times (at start and end, not all six stages are active) as all stages are active at any given instant (shown in red in Figure 2). A big benefit is that parallelization is happening behind the scenes, and you need not explicitly code this in your programs.
For the most part, all six steps can be executed in parallel leading to an almost six-fold improvement in performance. The arrows indicate the operations on the i th array index.
Compared to vector processing, cluster-based computing takes a fundamentally different approach. Instead of using specially optimized vector hardware, it uses standard scalar processors, but in large numbers to perform several computations in parallel.
Some features of clusters are as follows:
- Clusters are built using commodity hardware and cost a fraction of the vector processors. In many cases, the price is lower by more than an order of magnitude.
- Clusters use a message-passing paradigm for communication, and programs have to be explicitly coded to make use of distributed hardware.
- With clusters, you can add more nodes to the cluster based on need.
- Open source software components and Linux lead to lower software costs.
- Clusters have a much lower maintenance cost (they take up less space, take less power, and need less cooling).
Parallel programming and Amdahl’s Law
Software and hardware go hand in hand when it comes to achieving high performance on a cluster. Programs must be written to explicitly take advantage of the underlying hardware, and existing non-parallel programs must be re-written if they are to perform well on a cluster.
A parallel program does many things at once. Just how many depends on the problem at hand. Suppose 1/N of the total time taken by a program is in a part that can not be parallelized, and the rest (1-1/N) is in the parallelizable part (see Figure 3).
Figure 3. Illustrating Amdahl’s Law
In theory you could apply an infinite amount of hardware to do the parallel part in zero time, but the sequential part will see no improvements. As a result, the best you can achieve is to execute the program in 1/N of the original time, but no faster. In parallel programming, this fact is commonly referred to as Amdahl’s Law.
Amdahl’s Law governs the speedup of using parallel processors on a problem versus using only one serial processor. Speedup is defined as the time it takes a program to execute in serial (with one processor) divided by the time it takes to execute in parallel (with many processors):
T(1) S = ‑‑‑‑‑‑ T(j)
Where T(j) is the time it takes to execute the program when using j processors.
In Figure 3, with enough nodes for parallel processing, T’par can be made close to zero but Tseq does not change. At best, the parallel version cannot be faster than the original by more than 1+Tpar/Tseq.
The real hard work in writing a parallel program is to make N as large as possible. But there is an interesting twist to it. You normally attempt bigger problems on more powerful computers, and usually the proportion of the time spent on the sequential parts of the code decreases with increasing problem size (as you tend to modify the program and increase the parallelizable portion to optimize the available resources). Therefore, the value of N automatically becomes large. (See the re-evaluation of Amdhal’s Law in the resources section later in this article.)
Approaches to parallel programming
Let’s look at two parallel programming approaches: the distributed memory approach and the shared memory approach.
Distributed memory approach
It is useful to think a master-slave model here:
- The master node divides the work between several slave nodes.
- Slave nodes work on their respective tasks.
- Slave nodes intercommunicate among themselves if they have to.
- Slave nodes return back to the master.
- The master node assembles the results, further distributes work, and so on.
Obvious practical problems in this approach stem from the distributed-memory organization. Because each node has access to only its own memory, data structures must be duplicated and sent over the network if other nodes want to access them, leading to network overhead. Keep these shortcomings and the master-slave model in mind in order to write effective distributed-memory programs.
Shared memory approach
In the shared-memory approach, memory is common to all processors (such as SMP). This approach does not suffer from the problems mentioned in the distributed-memory approach. Also, programming for such systems is easier since all the data is available to all processors and is not much different from sequential programming. The big issue with these systems is scalability: it is not easy to add extra processors.
Parallel programming (like all programming) is as much art as science, always leaving room for major design improvements and performance enhancements. Parallel programming has its own special place in computing; Part 2 of this series examines parallel programming platforms and examples.
When file I/O becomes a bottleneck
Some applications frequently need to read and write large amounts of data to the disk, which is often the slowest step in a computation. Faster hard drives help, but there are times when they are not enough.
The problem becomes especially pronounced if a physical disk partition is shared between all nodes (using NFS, for example) as is popular in Linux clusters. This is where parallel filesystems come in handy.
Parallel filesystems spread the data in a file over several disks attached to multiple nodes of the cluster, known as I/O nodes. When a program tries to read a file, small portions of that file are read from several disks in parallel. This reduces the load on any given disk controller and allows it to handle more requests. (PVFS is a good example of an open source parallel filesystem; disk performance of better than 1 GBsec has been achieved on Linux clusters using standard IDE hard disks.)
PVFS is available as a Linux kernel module and can also be built into the Linux kernel. The underlying concept is simple (see Figure 4):
- A meta-data server stores information on where different parts of the file are located.
Multiple I/O nodes store pieces of the file (any regular underlying filesystem like ext3 can be used by PVFS).
Figure 4. How PVFS works
When a compute node in a cluster wants to access a file in this parallel filesystem, it goes through the following steps:
- It requests a file as usual, and the request goes to the underlying PVFS filesystem.
- PVFS sends a request to the meta-data server (steps 1 and 2 in Figure 4), which informs the requesting node about the location of the file among the various I/O nodes.
- Using this information, the compute node communicates directly with all the relevant I/O nodes to get all the pieces of the file (step 3).
All this is transparent to the calling application; the underlying complexity of making requests to all the I/O nodes, ordering of the file’s contents, etc., are taken care of by PVFS.
A good thing about PVFS is that it allows binaries meant for regular filesystems to run on it without modification — this is somewhat of an exception in the parallel programming arena. (Some other parallel filesystems are mentioned in resources section in the right side.)