If you ask a Data Scientist or a Data Analyst if they know, or care, what infrastructure their data platform is running on, the vast majority will have no idea and really don’t even want to know. However, if you ask them if it would improve their life and help their business if they could get their results in half the time or double their data set size for greater accuracy, then they resoundingly pay attention.
So for those architecting data platforms, they need to take a close look at how to select optimized servers and storage early in the planning process so they can ensure they can meet the growing demands of their data consumers as projects grow and expand. This starts with selecting a Linux server for the cluster environment that not only meets the commodity price range for Hadoop servers but also provides throughput advantages to deliver results faster. The IBM Power Systems family of scale out Linux servers, developed within the OpenPOWER community, have more threads, larger and faster cache and memory bandwidth advantages to guarantee 3x query throughput for your infrastructure dollars.
Now, fast processors are only useful if you can ensure your data is available in the platform when it is required. Waiting for data loading and copying from one silo to another can cause a massive productivity hit to your data scientists and analysts. Also, if your Hadoop cluster sprawl has added tens or hundreds of data-rich servers just for data storage capacity, then you can end up with an imbalanced cluster with over-provisioned and under-utilized compute capacity. As one of my distinguished colleagues wisely says: “all computers wait at the same speed”.
To unlock the advantages of an optimized POWER8 processor, the IBM Elastic Storage Server (ESS) addresses the two key inhibitors I just mentioned to ensure the data is ready when you need it and to prevent wasted compute resources. Firstly, with the included Spectrum Scale file system, you have a single global file system that can serve mixed analytics on a single version of the data including POSIX, NFS, SMB, Object and of course Hadoop HDFS. In a nutshell, this means your data engineers won’t waste time copying data between different application data silos, in and out of the data lake, which ultimately means the data scientists have faster access to the data they need.
Secondly, ESS provides a storage dense appliance in a wide range of capacities and drive types, which decouple HDFS storage from the compute nodes. This has the dual benefits of allowing a significant reduction in the Hadoop workers nodes, as they can be optimized for the compute function, and cutting the data storage required from the 3X replication of standard HDFS to a mere 30% overhead with ESS software RAID while delivering greater resiliency, faster disk rebuild times and near unlimited scalability.
Today I’m happy to announce that the combination of IBM Power Systems servers with IBM Elastic Storage Server is now available as a superior optimized infrastructure combination for your Hortonworks Data Platform data lake or machine and deep learning environments, such as IBM PowerAI and Data Science Experience.
You can learn more about Hortonworks Data Platform on IBM Power Systems with Elastic Storage Server by registering for our upcoming webinar, and by visiting the IBM Booth at the Strata Data Conference, October 27-28 in New York City.