In October of 2014, IBM published the world’s first audited SQL on Hadoop benchmark comparing Hive, Impala, and Big SQL using TPC-DS inspired data and queries. It has been a year now since that benchmark and SQL engines on Hadoop have improved significantly. An update was overdue so we decided to run a similar type of test again and shared the results at the recent Strata + Hadoop World conference in New York City.
If you have not heard of Big SQL before, I’ll simply start with –
Big SQL makes access to Hive data faster and more secure.
At the conference, we also hooked up IBM Cognos so that we could do live demo side-by-side queries for visitors at our booth. This video captures what we demonstrated at the Conference.
Hive is the de-facto standard for SQL on Hadoop as it is included in every commercial Hadoop distribution. It’s also a commonly used baseline for comparing performance of other SQL engines. It was the first SQL engine for Hadoop and has improved significantly since our last benchmark.
When we talk about the relationship between Hive and Big SQL, we have to look a level deeper to understand that Hive is really 3 things:
- Hive is a SQL Execution engine that converts SQL to a series of map reduce programs
- Hive defines a storage model for how warehouse data should be organized in Hadoop
- Hive has a metastore that is used not only by Hive, but other applications that integrate with Hadoop.
Big SQL provides an alternate execution engine (only) but preserves Hive storage model and Hive metastore. In this way, users can benefit from the amazing performance and security capabilities of Big SQL while using only open data formats. In fact, Big SQL and Hive both exist on our customer’s Hadoop clusters and can run concurrently. Tables created in Hive are visible to Big SQL and vice versa.
For this test, we compared the latest version of Hive 1.2.1 and Big SQL V4.1 – both running on the IBM Open Platform. Two equivalently configured 20-node clusters were setup on Softlayer (using bare metal servers) configured according to IBM’s reference architecture for enterprise Hadoop – and once again, we aimed to run the Hadoop-DS benchmark (based on TPC-DS).
Please note that unlike last time, this is not an audited benchmark nor is it an official TPC-DS benchmark. Audited benchmarks are really expensive to do and take a long time to complete. As noted earlier, our last published result was a year ago. We hope to do these performance tests 2-3 times per year given how rapidly technologies are advancing this space – publishing each one as audited would be impractical.