IBM Spectrum Scale Erasure Code Edition (ECE) has the ability to use storage rich servers to provide a software network Erasure Code protected file system service using erasure code. It provides high performance and high data reliability by distributing data strips across all disks on these servers through network. The disks used to store data strips of the same block is selected by disk locations to achieve best fault tolerance, which gives ECE the ability to provide data service in disk failures or even nodes failures.
For more information on IBM Spectrum Scale Erasure Code Edition see the Spectrum Scale Knowledge Center here:
2. Recovery Group
ECE puts disks on a set of storage rich servers into a same group named as “Recovery Group”. Disks of matching capacity and throughput rate are further divided into multiple smaller groups named as “Decluster Array”. ECE requires each DA has the same type of disks which are equally balanced in these servers.
Virtual disks are created from physical disks (pdisks) in a Decluster Array. When an application writes to a file system created from ECE virtual disks, the data blocks are handled by each vdisk and distributed to physical disks in Decluster Array. For example, vdisk may use 8+2p erasure code to store data. When file system writes an 8MiB data block to virtual disk, the virtual disk splits it into eight 1MiB data strips and computes additional two 1MiB parity strips, then writes these strips to 10 different disks in different servers.
Decluster Array has spare space reserved for data rebuild. Users can specify how many space to be reserved for data rebuild. The reserved space size is integral multiple of physical disk capacity. Like data blocks, the reserved spare space is also distributed on all disks and servers. When disk failure happened, the data on this disk will be reconstructed in the spare space.
3. Fault tolerance
ECE knows each pdisk location and tries to distribute data and parity strips of each data block in disks belonging to different locations. Figure 1 shows an example of how one data block (a.k.a stripe) is placed in disks of different nodes.
In Figure 1, ECE has 5 storage rich servers as the I/O server, each server has 12 disks, and all disks in the RG belongs to the same DA. In this example, the Virtual Disk created in this DA use 8+2p erasure code, and the DA has 6 disks spare space reserved for data rebuild.
When file system writes data to Virtual disk, the data block is splitted into 8 data strips plus 2 parity strips, i.e. totally 10 strips need to be written to disks. Disks are selected from different failure domains according to which nodes they are attached to. ECE tries to put strips on disks of different failure domains. In this case, for any data block, each node will only hold 2 strips and the DA can tolerate one node failure with 12 disks missing on this node ( In Figure 1, Node 5 failure will only lose 2 strips of 10 strips in one stripe ). There are still 8 strips that can be used to rebuild the data.
4. How disk failures affect fault tolerance
In Figure 1, we describe how ECE can tolerate one node failure in this configuration. If disk failure happens on this system before node failure, can it still tolerate one node failure?
Figure 2 has the same configuration as Figure 1. This time, node 1 has one physical disk failure before node 5 fails.
When we create the Declustered Array in Recovery Group, we specify to reserve 6 disks spares for rebuilding data. The spare space is evenly scattered on all disks, so the reserved spare space on each node is one and a fifth of physical disk space with 5 nodes. With one disk failure on node 1, ECE will start to rebuild the data of the failed disk to reconstruct the stripe with 8+2P. It first tries to put rebuild data on the same node 1 for the best node fault tolerance as before. There will be still fifth disk spare space left on node 1 after rebuilding data of one failed disk. As Figure 2 shows, lost strips are reconstructed on disks of node 1, and all data needed to be rebuilt can find enough reserved spare space on disks of node 1. The system can still have one node fault tolerance as each block still keeps 2 strips on each node.
Figure 3 has a second disk failure on node 1. When ECE tries to rebuild data for the second failing disk, it can’t find enough spare space to rebuild data on node 1 since there only fifth disk space left, so ECE starts to use spare space on other nodes to rebuild some data. As Figure 3 shows, one data strip is reconstructed on node 5. At this time, some data block has 3 strips on same node, so the system has no available one node fault tolerance with 8+2p erasure code.
Figure 4 has different configuration from Figure 1, 2 and 3. In Figure 4, the system still uses 8+2p erasure code and 5 nodes, but each node have 24 disks, and there are 10 spare disk space as we can configure more spare space with larger disk numbers.
Each node has reserved 2 disk space for rebuild. When one node has 2 failing disks, all their data can be rebuilt in the same node. As Figure 4 shows, if the spare disk space number reserved is double the node number, the system can still keep one node fault tolerance when each node has 2 disk failures.
5. Two node fault tolerance
When the node number equal or exceed the erasure code length, ECE can have more than one node fault tolerance.
Figure 5 is a system with 10 nodes, using 8+2p erasure code, and only has 6 spare disk space ( less than the node number ).
In the system as Figure 5 shows, each node only holds one strip for each data block. With 2 parity strips redundancy, this ECE system has 2 node fault tolerance. When one disk failure happens on node, ECE will use other node’s spare spare to rebuild data given the spare space on the same node isn’t enough. The advantage is that even if one node has multiple ( less than 6 ) disks failures, ECE still has one node fault tolerance. The rebuilding process will distribute the data among different nodes and keep each node with maximum of 2 strips from each data block.
When planning fault tolerance of a IBM Spectrum Scale Erasure Code Edition system, we recommend more than 1 node fault tolerance, e.g. 1 node plus 1 pdisk fault tolerance (see section ‘Planning for erasure code selection’ in ECE knowledge center for more details). We also need to consider the reserved rebuild spare space properly, which can impact the fault tolerance in different situations. You can try to take the following example for a planning exercise: 6 nodes, 8+3p erasure code, 4 spare disk space; a healthy system has 1 node plus 1 pdisk fault tolerance; what is the fault tolerance left after 2 pdisk failures in one node? ( Answer at below )
( Answer: With 8+3p there are 11 strips distributed across 6 nodes, so each node will have at most 2 strips from any data block. With 4 disk spare space, each node will have 2/3 spare space. With 2 pdisk failures on one node, there is not enough spare space on that node to hold the data from failed disks, so data must be moved to other nodes. This results in 11 strips on 5 remaining nodes, so one node will have at most 3 strips. Because we are using 8+3p, if the node with 3 strips fails, we still have 8 strips to rebuild lost data, so we still maintain one node fault tolerance.)