NOTE: This content was originally published under the IBM developerWorks site. Since the location where this content was published is being taken offline the content is being copied here so it can continue to be accessed.
I receive many questions on how to configure GPFS for reliability. Questions like “what is quorum?” and “why do I need it?”, or “what are failure groups and how do I use them?” This article is an attempt to bring all of these topics into one place. This paper discusses the options you have when configuring GPFS for high availability.
Every application has different reliability requirements from scientific scratch data to mission critical fraud detection systems. GPFS supports a variety of reliability levels depending on the needs of the application. When designing a GPFS cluster consider what type of events you need your system to survive and how automatic you want to the recovery to be. Any discussion of reliability in a GPFS cluster starts with quorum.
The worst type of failure in a cluster is called split brain. Split brain happens when you have multiple nodes in a cluster that continue operations independently, with no way to communicate with each other. This situation cannot happen in a cluster file system because without coordination your file system could become corrupted. Coordination between the nodes is essential to maintain data integrity. To keep the file system consistent a lone node cannot be permitted to continue to write data to the file system without being coordinating with the other nodes in the cluster. When a network failure occurs some node has to stop writing. Who continues and who stops is determined in GPFS using a mechanism called quorum.
Maintaining quorum in a GPFS cluster means that a majority of the nodes designated as quorum nodes are able to successfully communicate. In a three quorum node configuration two nodes have to be communicating for cluster operations to continue. When one node is isolated by a network failure it stops all file system operations until communications are restored so no data is corrupted by a lack of coordination.
So how many quorum nodes do you need? There is no exact answer. Chose the number of quorum nodes based on your cluster design and your reliability requirements. If you have three nodes in your cluster, all nodes should be quorum nodes. If you have a 3,000 node cluster you do not want all nodes to be quorum nodes. True you can’t configure all of the nodes to be quorum since the maximum is 128, but even that is too many. When a node failure occurs the quorum nodes have to do some work to decide what to do. Can cluster operations continue? Who is the leader? So think of the number of quorum numbers just like any other committee. The more members of the committee the longer it takes to make a decision.
You can change node designations dynamically, so if a rack of nodes fail and they are going to be down for a while, you can designate another node as quorum to maintain your desired level of reliability. Choose the smallest number of quorum nodes that makes sense for your cluster configuration. Even in the largest clusters this is typically 5 to 7 quorum nodes. One quorum node per GPFS building block in large clusters or something similar is common. Yes 5 and 7 are odd numbers, and the general recommendation is to choose an odd number of quorum nodes. This is more a matter of style than requirement but it makes sense when considering how many nodes can fail. If you have 4 quorum nodes you need 3 available (4/2+1=3 or one more than half) to continue cluster operations, the same as if you had 5 (5/2+1=3)* quorum nodes. That is why an odd number is typically recommended. In a single node cluster (yes there are single node production “clusters”, typically for HSM) there is no one to communicate with so a single quorum node is all that is required.
*Yes I know that (5/2)+1 is 3.5, but you cannot have ½ a quorum node and 3.5 is greater than ½ +1 of the quorum nodes.
Node Quorum with Tiebreaker disks
Use tiebreaker disks when you have a two node cluster or you have a cluster where all of the nodes are SAN attached to a common set of LUNS and you want to continue to serve data with a single surviving node. Typically tiebreaker disks are only used in two node clusters.
Tiebreaker disks are not special NSD’s, you can use any NSD as a tiebreaker disk. You can chose one from three different file systems, or from different storage controllers for additional availability. In most cases using tiebreaker disks adds to the duration of a failover event, because there is an extra lease timeout that has to occur. In a two node cluster you do not have a choice if you want reliability, though this is why it is commonly recommended that if you have more than 2 nodes you use node quorum with no tiebreaker disks.
Using tiebreaker disks can improve failover performance only if you use SCSI3 persistent reserve.
File system descriptor quorum
File system descriptor quorum is one type of quorum in GPFS that is often overlooked. In a GPFS file system every disk has a header that contains information about the file system. This information is maintained on every disk in a file system but when there are more than three NSDs in a file system 3 copies of the file system descriptor are guaranteed to have the latest information, all of the others are updated asynchronously when the file system configuration is modified. Why not keep all of them up to date? Consider a file system with 1,000 disk drives. Each file system command would require that each copy is guaranteed to be up to date, that many copies can be difficult to guarantee, so three are maintained as official copies. For a file system to remain accessible two of the three official copies of the file system descriptor need to be available. We will discuss this more after looking at replication.
In GPFS you can replicate (mirror) a single file, a set of files or the entire file system and you can change the replication status of a file at any time using a policy or command. You can replicate metadata (file inode information) or file data or both. Though in reality, if you do any replication, you need to replicate metadata. Without replicated metadata, if there is a failure, you cannot mount the file system to access the replicated data anyway.
A replication factor of two in GPFS means that each block of a replicated file is in at least two failure groups. A failure group is defined by the administrator and contains one or more NSDs. Each storage pool in a GPFS file system contains one or more failure groups. Failure groups are defined by the administrator and can be changed at any time. So when a file system is fully replicated any single failure group can fail and the data remains online.
File system descriptor quorum how it effects replication
So far we have discussed a replication factor of two and two failure groups. There is one more aspect to replicating data in GPFS that is important to consider, and that is file system descriptor quorum. Remember that for a file system to remain accessible two of the three official copies of the file system descriptor need to be available. How can you do that in a replicated file system with two failure groups? You can’t. When there are more than three NSD’s in a file system GPFS creates three official copies of the file system descriptor. With two failure groups GPFS places one descriptor in one failure group and the other two in the other failure group (assuming there are at least 3 NSDs). In this configuration if you lose the failure group which contains the two official copies of the file system descriptor, the file system unmounts. Therefore for the file system to remain accessible you need to create one more failure group that contains at least a single NSD.
Typically this third failure group contains a single small NSD that is defined with a type of descriptor only (descOnly). The descOnly designation means that this disk does not contain any file metadata or data. It is only there to be one of the official copies of the file system descriptor. The descOnly disk does not need to be high performance and only needs to be 20MB or more in size so this is one case where often a local partition on a node is used for this NSD. To create a descOnly NSD on a node you can use a partition from a local LUN and define that node as the NSD server for that LUN so other nodes in the file system can see it.
IO patterns in a replicated system
When replicating a file system all writes go to all failure groups in a storage pool. Though with replicated data since you have two copies of the information there are some optimizations GPFS can do when your application is reading data. By default when a file system is replicated GPFS spreads the reads over all of the available failure groups. This configuration provides the best read performance when the nodes running GPFS have equal access to both copies of the data. For example this behavior is good if GPFS replication is used in a single data center to replicate over two separate storage servers all SAN attached to all of the GPFS nodes.
The readReplicaPolicy configuration parameter allows you to change the read IO behavior in the file system. If you change this parameter from default to a value of local GPFS changes the read behavior with replicated data. A value of local has two effects on reading data in a replicated storage pool. Instead of simply reading from both failure groups GPFS reads data from the failure group that is on either on “A local block device” or on a “A local NSD server.”
A local block device means that the path to the disk is through a block special device. On Linux, for example that would be a /dev/sd* device or on AIX a /dev/hdisk device. GPFS does not do any further determination, so if disks at two separate sites are connected using a long distance SAN connection GPFS cannot distinguish what copy is local. So to use this option connect the sites using the NSD protocol over TCP/IP or InfiniBand Verbs (Linux Only).
A local NSD server is determined by GPFS looking at the subnet used in ethe daemon node name to determine what NSD servers are “local” to an NSD client. For NSD clients to benefit from “local” read access the NSD servers supporting the local disk need to be on the same subnet as the NSD clients accessing the data. This parameter is useful when GPFS replication is used to mirror data across sites and there are NSD clients in the cluster. This keeps read access requests from being sent over the WAN.
Figure 1 is an example of a multisite configuration that can benefit from a readReplicaPolicy of local. In this example Location 1 and Location 2 both have a copy of the file system data and metadata. Each cluster is defined on a different subnet (for daemon node name) Location 1 on 22.214.171.124 and Location 2 on 126.96.36.199 and the servers on each subnet can communicate usinig TCP/IP routing. The readReplicaPolicy parameter is set to local so the compute cluster at Location 1 reads from the NSD servers in Location 1.
Figure 1 is an example of a multisite configuration that can benefit from a readReplicaPolicy of local. In this example Location 1 and Location 2 both have a copy of the file system data and metadata. The servers in eadch site are on a single subnet and readReplicaPolicy=local. So the compute cluster at Location 1 reads from the NSD servers in Locaiton 1.
GPFS can be configured for basic availability using RAID protected data all the way to multi-site configurations with GPFS replicated metadata and data. What configuration you choose depends on your requirements and budget.