1. Introduction

Elastic Storage Server use Reed-Solomon Codes to Protect data when physical disk failure happens. ESS also have ability to detect disk locations and try to place data strips across disks belongs to different locations and tries to given Building Block a higher fault tolerence known as disk failure domains.

2. ESS Building Block

ESS building block have 2 IO servers to provide virtual disks service to GPFS file system, bounch of physical disks was attached to both server. Each IO nodes serves one Recovery Group and each RG have vdisks build in Declustered Array. Below is a picture showing the basic structure of one Building Block providing service in a cluster:

And This picture can map into a typical configuration of ESS with 6 disk enclosures, have 2 servers and 6 disk enclosures physically as below picture show, there are 6 disk enclosures each have 2 path connected to each server, and all left part of disks belongs to RG01, right part belongs to RG02; DA1 only have Spin disks in disk enclosures, the SSD disk belong to a seperated DA for loging. User virtual disks was created in DA1, each vdisk will spread it’s strips across disks in DA1.

3. RAID Code tolerance

ESS support 3 way-mirror, 4 way-mirror, 8+2p and 8+3p raid code. Taken 8+3p as an example, there will be 11 data strips placed on 11 disks for each user data block; and when one disk failure, the blocks placed strip on this disk will still have 2 data redundancy remained on other available disks; when 3 disk failure happens at same time, there will be some user data blocks have no redundancy left.

4. Disk Failure Domains

ESS have the ability to known each disk’s physical location, so when place strips for each data block, it can optimize the strip location on disks by chossing disks belongs to different enclosures and disk drawers, which will give ESS a higher fault tolerance.
Below picture show a example of use DCS3700 disk enclosure to build ESS building block and how disks was selected to put a data block of 8+3p stripe:

DCS3700 disk enclosure have 5 disk drawers, each disk drawer carry 12 disks; this picture show a ESS GL6 have 6 DCS3700 disk enclosures. When ESS need to write a 8+3p stripe, it will try to put 11 strips on different enclosures and drawers, the red color disk is the chosen 11 disks to put this 11 strips; From this picture showing, we will see each enclosure will only have 2 strips at most, and when one entire disk enclosure lost, will only have 2 strip lost; Stripes still one data redandancy, since strip also chossen to put data on different disk drawers, so now, ESS still have one drawer’s disks can be lost, the remain redundancy is one drawer.

5. Example of a ESS GS6

Now, we use ESS GS6 as a example to show how real environment report it’s fault tolerance. GS6 have 6 disk enclosures which don’t have drawers, each disk enclosure have 24 disks.

mmlsrecoverygroup with -L option will show the fault tolerance of current RG state:

This Recovery Group have user vdisks of 4 way-mirror and 8+3p RAID Code, with no disk failures in 6 disk enclosures, the fault tolerance for 8+3p vdisk is ” 1 enclosure + 1 pdisk” which means this ESS can have 1 disk enclosure plus 1 pdisk (not belongs to this enclosure) can be missing at same time and ESS can still keep data integrity to provide service. ( This type of disk enclosure don’t have drawers, if it’s DCS3700, it will show “1 enclosure + 1 drawer”)

4 way-mirror vdisks can have 3 enclosure missing at same time theoretically, but RG’s metadata limit it’s tolerance, This ESS can’t missing 2 disk enclosures at same time.

Then we missing one disk enclosure in the other RG of this ESS which have the same tolerance, after disk hospital detect all disks in one enclosure missing, ESS start to rebuild the data:

After some time the rebuild process finished, mmlsrecoverygroup command will show new fault tolerance as “1 pdisk”:

At this point, ESS still have 1 pdisk fault tolerance. After missing disks of enclosure back, ESS will start rebuild and rebalance process to restore the fault tolerance.

Join The Discussion

Your email address will not be published. Required fields are marked *