IBM Spectrum Scale offers flexible and scalable software-defined file storage for analytics workloads. Enterprises around the globe have deployed IBM Spectrum Scale to build large data lakes for running data intensive workloads, including AI, high-performance computing (HPC) and analytics. IBM Elastic Storage Server (ESS) is a pre-integrated storage solution that is powered by IBM Spectrum Scale software on IBM Power Systems and disk enclosures. IBM ESS models are built to offer almost linear scalability in performance and capacity with very high reliability. Hortonworks Data Platform (HDP) is a leading Hadoop and Spark distribution.
There are two major storage deployment models for Hadoop.
– One is share-nothing storage model in which compute and storage resources are both coming from storage-rich servers. This is referred as traditional Hadoop architecture based on native HDFS (Hadoop Distributed File System) as the distributed file system running on Hadoop cluster nodes.
– The other one is shared storage model in which a shared storage system, like ESS, provides storage service to Hadoop cluster. In this model Shared storage systems implement HDFS protocol interface to enable access to the centralized data for HDFS clients running on Hadoop nodes.
Shared storage deployment model is becoming very popular primarily because it disaggregates storage from compute in Hadoop environment enabling compute and storage to grow independently as per business requirements. This significantly helps controlling the cluster sprawl and data center footprint. Most of the commercial shared storage offerings allow accessing the same data using multiple access protocols in addition to the native HDFS APIs. Industry standard protocols access (e.g. Windows SMB, NFS, S3) enables organizations to build a single common data lake for Hadoop and non-Hadoop applications. Adoption of containerized workloads is another reason why shared storage deployments are being considered. While these reasons attract enterprises to consider shared storage for Hadoop deployments, they sometimes have concerns about performance and how shared storage model would compare in performance with traditional native HDFS based deployment.
IBM Spectrum Scale has been recognized for its performance and flexibility. It is the high performance file system behind the two fastest, smartest supercomputers, Summit and Sierra, and is used broadly by IBM, our partners, and our clients for industry standard file system benchmarks like SPEC-SFS or IO-500. In this article, we are focusing on just HDFS interface implementation for IBM Spectrum Scale / ESS that gets used for running Hadoop workload. We are sharing some of the real customer stories from PoCs or evaluations done by our customers in which it was demonstrated how Hadoop implementations based on Hortonworks Data Platform (HDP) or community Hadoop with ESS shared storage can provide comparable or better performance than native HDFS based shared nothing architecture.
Customer Story 1: Comparing “traditional HDP deployment on Power servers” with “HDP on Power + ESS shared storage”
We ran this test as a part of PoC for one of our customers considering HDP on Power servers. The goal was to prove HDP on Power with ESS shared storage delivers comparable or better performance to HDP on Power with traditional native HDFS shared nothing model.
For this comparison, we used a 1 TB MapReduce Terasort (built in Hadoop version 3.0) benchmark performance test with following hardware with two kinds of configurations. First configuration is HDP setup with native HDFS and shared nothing storage deployment model. Refer to Figure 1. With this configuration, all benchmark data is saved in data node internal disks on HDFS from HDP.
The second configuration is for HDP on Power setup with IBM ESS shared storage deployment model. Refer to Figure 2. Compute resources are the same as the first configuration. However, they are only running the compute and data is all on the ESS. Each client uses the IBM Spectrum Scale client service with the HDFS connector so no change in the benchmark code was required to shift from shared nothing to centralized storage.
Benchmark result in time taken as shown in Figure 3 demonstrated the ESS based shared storage setup performed 15% faster compared with native HDFS with internal disks.
In addition to looking at performance, our clients also compare costs for shared nothing configuration VS a shared storage configuration. On a pure capital expense, we have seen typically ESS based shared storage configurations demonstrate significant cost benefits as Hadoop storage requirements grow beyond a few hundred terabytes. The total cost of can be significantly lower by eliminating data copies, reducing management overhead, and simplifying data governance and protection requirements. In our experience, most of our clients choose ESS storage when they have any requirement to share data between Hadoop and non-Hadoop applications.
Customer Story 2: Comparing “traditional HDP deployment on x86 servers” with “HDP on Power + ESS shared storage”
Here is another interesting client story where the client already had an inhouse traditional HDP environment on x86 servers. The client requested IBM team to setup a comparable environment in an IBM lab for HDP running on Power with ESS shared storage. The requirement for this PoC was that the client wanted to run tests on both the environments (their in-house environment and the environment in IBM lab) without any intervention from IBM team.
And here are the benchmark results.
The client team ran HiBench workload taking ESS disk utilization to maximum and confirmed that ESS performance was better than their x86 server-based storage environment. Since IBM team did not have visibility to exact tests ran by the clients, we are not able to provide more statistics here. Above table is based on the descriptive feedback received from the client that eventually adopted ESS as the shared storage behind their HDP environment.
Customer Story 3: Comparing “traditional Hadoop deployment on x86 servers” with “Hadoop on Power + ESS shared storage” for real-world genome workload
IBM assisted Louisiana State University (LSU) in their evaluation of Hadoop on Power + ESS environment with traditional x86 based Hadoop environment. The workload used for this evaluation was De novo genome assembly provided by the LSU team. It is Hadoop based workload for De-Bruijn graph construction, which is data as well as compute intensive workload.
Here is summary of environments used for the comparison.
You can learn more about this experiment from this published report. Here is the result of one the experiments mentioned in this report.
For the dataset of 3.2TB of metagenome data, the process completed in 6 hours and 22 minutes on the IBM POWER8 cluster using only 40 nodes. This same process took more than 20 hours to complete on 120 Intel nodes available at LSU. The summary of the results is that the IBM cluster produced more than 3x performance improvement using 3x fewer nodes.
In these and other tests we have seen that shared data storage is an excellent option for Hadoop workloads. The highly reliable, high-performance ESS server delivered excellent performance across these three comparisons. When considering building, expanding or upgrading any Hadoop infrastructure, IBM Spectrum Scale on ESS can be a better choice with the power to support demanding Hortonworks applications and the flexibility of support for additional applications on the same shared storage.