Big SQL vs Spark SQL at 100TB: How do they stack up?

Introduction:

Early adopters in the SQL over Hadoop space are now creating large Data Lakes to store their more precious commodity, data. IBM predicted this trend when they published the first (and only) independently audited Hadoop-DS benchmark in Oct 2014. Hadoop-DS is a derivative of the industry standard TPC-DS benchmark, customized to better match the capabilities of the SQL over Hadoop space in a Data Lake environment. At that time, IBM used a scale factor of 10TB to compare Big SQL with other leading SQL over Hadoop solutions, namely Hive 0.13 and Impala 1.4.1. Since then, all products have made significant improvements in performance and scalability. HortonWorks and Cloudera have also invested heavily to enrich their support of the SQL language (IBM was already there).

Spark has since emerged as a favorite for analytics among the open source community, and Spark SQL allows users to formulate their questions to Spark using the familiar language of SQL. So, what better way to compare the capabilities of Spark than to put it through its paces and use the Hadoop-DS benchmark to compare performance, throughput, and SQL compatibility against Big SQL. Since this is 2017, and Data Lakes are getting larger by the day, we chose to use a scale factor of 100TB.

This study will once again highlight the need for a SQL over Hadoop engine that is not only fast and efficient, but more importantly can successfully execute complex queries on big data. Performance is not the only factor when choosing a SQL over Hadoop solution. The solution also needs to be backed up by a mature and robust runtime environment. Cost continues to be one of the major attractions to a business adopting SQL over Hadoop solutions (especially when compared to traditional RDBMs), but failing queries, for whatever reason, will severely impact productivity and reduce the cost effectiveness of those SQL over Hadoop solutions which fail to deliver.

Summary

During the course of the project we discovered that Big SQL is the only solution capable of executing all 99 queries unmodified at 100 TB, can do so 3x faster than Spark SQL, while using far fewer resources. These facts lead us to the conclusion that Data Professional using Big SQL are 3x more productive than those using Spark SQL.

Test Environment:

Both the Big SQL and Spark SQL tests were executed on the same cluster. The cluster had been built with Spark in mind; that is, it had lots of memory and SSDs replaced the traditional rotational disks.
A 28-node cluster was used, consisting of:

2 management nodes (Lenovo x3650 M5), collocated with data nodes
28 data nodes (Lenovo x3650 M5)
2 racks, 20x 2U servers per rack (42U racks)
1 switch, 100GbE, 32 ports, 1U, (Mellanox SN2700)

Each data node consisted of:

CPU: 2x E5-2697 v4 @ 2.3GHz (Broardwell) (18c) Passmark: 23,054 / 46,108
Memory: 1.5TB per server, 24x 64GB DDR4 2400MHz
Flash Storage: 8x 2TB SSD PCIe MVMe (Intel DC P3700), 16TB per server
Network: 100GbE adapter (Mellanox ConnectX-5 EN)
IO Bandwidth per server: 16GB/s, Network bandwidth 12.5GB/s
IO Bandwidth per cluster: 480GB/s, Network bandwidth 375GB/s

Configuration and Tuning:

Configuration and tuning of the products were undertaken by two different teams; the Big SQL performance team and the Spark SQL performance team here at IBM.

The Spark team used the newly released version 2.1 of Spark SQL. Tuning for Spark was in-line with the rules and regulations set out in the Hadoop-DS specification with one key exception. The number of executors property passed to the Spark SQL shell was tuned differently for the single and 4-stream runs. Each of the streams in the 4-stream run was passed an executor value which was one quarter of that used for the single stream-run. This prevented any one stream in the 4-stream run from grabbing all the execution slots on the cluster, which would essentially result in each query being executed serially. Although this is a simple, and some might say logical piece of tuning, it is practically impossible to do this efficiently on a production cluster in which the number of concurrently executing queries will not be known. However, the team decided that without this piece of tuning, a fair comparison against Big SQL would not be possible.

The Big SQL team used a Technology Preview of Big SQL version 4.3. The technology preview contains support for a new feature in which each physical node can be host to multiple Big SQL worker processes (previously, only a single Big SQL worker process was allowed on each physical node). Multiple Big SQL workers on a single physical node provide greater parallelization of operations in a Big SQL environment, and hence improved performance. Considering the large amount of memory and CPU resources of the machines in the test cluster, the team configured each physical node to contain 12 Big SQL workers – as depicted in Figure 1.

Unlike Spark SQL, Big SQL does not require fine tuning based on the number of concurrent query streams, and so the same configuration for both single and 4-stream runs was used. This is possible because Big SQL has mature autonomics, including:

A Self Tuning Memory Manager (STMM) which dynamically manages memory allocation between consumers based on query concurrency
A sophisticated Workload Manager (WLM) which provides control over workload priorities, helps maintain consistent response times and enforce SLAs
Automatically adjusts the level of parallelism at query execution time to account for system activity

These are all key features which take the pain out of tuning Big SQL – either in a benchmark or real-life environment – and for many years now, organizations have deemed these features as essential in relational databases in order to support production environments. So why are these features found in Big SQL and not most other SQL over Hadoop solutions? Put simply, it’s because IBM built the core of Big SQL around our existing relational database technology (which has been running in production systems for 25+ years). As such, Big SQL has a hardened, mature, and very efficient SQL optimizer and runtime, along with a set of features to support production level deployments – these are the key to Big SQLs world class scalability and performance.

For this benchmark, the majority of the Big SQL tuning effort centered around multiple Big SQL workers co-existing on a single node (this was the first time this feature had been road-tested at this scale). This feature is available in Big SQL v4.3 Technology Preview and logical Big SQL workers can be added manually after install through Ambari. The Big SQL engineering team are using the experience gained from this exercise to improve the autonomics and set meaningful out of the box defaults for the Big SQL v4.3 release. So Big SQL customers will not have to do any of this tuning themselves. The Spark SQL team’s experiences are being used to create a set of best practices. Until Spark SQL has a mature set of self-tuning and workload management features, these best practices will have to be applied manually.

Query Execution:

Figure 2: Big SQL Queries Across Different Scale Factors

Using 100TB of data, all 99 queries originating from the TPC-DS workload successfully executed in 4 concurrent query streams using Big SQL v4.3 (making a total of 396 queries in all). On the first run through three Big SQL queries had longer than expected execution times. These queries were tuned using Statistical Views and Column Group Statistics. These unique capabilities are invaluable to Big SQL customers; allowing them to collect detailed information about complex relationships which often exist in real-life data, which in turn results in improved performance.

Failed Spark Queries Across Scale Factors — Figure 3: Spark SQL Queries Across Different Scale Factors

Classification of Spark Failures — Figure 4: Classification of Spark SQL Query Failures

Although Spark SQL v2.1 can execute all 99 queries successfully at 1GB and 1TB (and has been able to do so since v2.0), two queries failed at 10TB, and there were significantly more failures at 100TB. After a reasonable amount of effort spent tuning Spark (by Spark engineers, not Big SQL engineers), a total of 83 queries could be successfully executed across the 4-streams at 100TB. In order to ensure an apples to apples comparison, this set of 83 queries was used as the basis for comparing Big SQL vs Spark SQL performance.

Figure 4: Big SQL vs Spark SQL Query Breakdown at 100TB — Figure 5: Big SQL and Spark SQL Query Breakdown at 100TB

The Spark failures can be categorized into 2 main groups; 1) queries not completing in a reasonable amount of time (less than 10 hours), and 2) runtime failures. In addition, almost half (7) of the Spark SQL queries which fail at 100TB are complex in nature. This combination of facts indicates that Spark SQL struggles with more complex queries at larger scale factors; which should come as no real surprise since Spark is a relatively new technology which lacks the maturity of a modern RDBMS. These findings are also in-line with the original Hadoop-DS work in 2014; where both Hive and Impala struggled with complex queries (all be it at 10TB), but Big SQL was able succesfully execute all 99 queries, in 4 streams at 30TB.

A major inhibitor for Data Professionals, or any user for that matter, is wasting valuable time re-writing a valid query into a form which the SQL interpreter can understand, and one which the execution engine can successfully complete. ImportantThe figures from this study indicate a Data Professional using Spark SQL will spend valuable time re-writing, or tuning, 1 or 2 in every 10 queries (16% to be exact). Unfortunately, this is often the case with immature SQL engines, and particularly with those in the SQL over Hadoop space. The problem was first highlighted with Hive and Impala in the original Hadoop-DS evaluation at 10TB, and remains true today with Spark SQL. Its also worth noting that with many SQL over Hadoop engines, successfully executing queries at lower scale factors is no guarantee of success with real big data.

Building a 100TB database

Make no mistake, at 100TB this is a BIG DATA workload – the final database contains more than half a trillion rows:

Figure 6: 100TB Row Counts — Figure 6: 100TB Hadoop-DS Table Row Counts

The large fact tables were partitioned to take advantage of the new ‘partition elimination thro join keys’ feature available from Big SQL v4.2 onwards. All sales tables were partitioned on their respective ‘*_SOLD_DATE_SK’ column and returns tables on their respective ‘*_RETURN_DATE_SK’ column. Partitioning in this manner is compliant with the original TPC-DS specification and allows partition elimination during query execution – which can greatly improve the time and efficiency of queries.

Figure 7: 100TB Database Build Times (secs)

The database build consists of two distinct operations, a load phase and a statistics gathering phase. The load phase takes the data from the raw text generated by the data generator and converts it to Parquet storage format. Parquet is the most optimal storage format for querying in both Big SQL and Spark SQL, and was an ideal choice for these tests. The load phase is common to both Big SQL and Spark SQL and took just over 39 hours to complete.

The majority of the 39 hours was spent loading the NULL partition (SS_SOLD_DATE_SK=NULL) of the STORE_SALES table. It turns out the STORE_SALES table is heavily skewed on the null partition (which is approx. 570GB in size, compared to other partitions which are approx. 20GB). Since queries are executed repeatedly, and database build is a onetime operation, neither team dedicated resources to improving performance of the load. Heavy skew in real-life data sets is common though. The heavy skew on the chosen partitioning column in this data set presented challenges executing some of the more complex queries. The engineering team worked to improve the Big SQL Scheduler to address these challenges and these improvements will be available in Big SQL v4.3.

Gathering statistics allows the Big SQL cost-based optimizer to make informed decisions about the most efficient access strategy, and is the key reason Big SQL can execute all 99 TPC-DS queries with class-leading performance. Currently, Spark SQL does not require the same level of statistics to be gathered because its optimizer is not sufficiently mature to be able to make use of these detailed statistics. However, as the Spark optimizer matures, its reliance on detailed statistics will increase, as will the time to gather said statistics, but consequently query elapsed times will also improve.

Comparing Big SQL and Spark SQL Performance:

In real-life, a single user does not have sole use of an organization’s Hadoop cluster. Consequently, we focused on comparing the performance of Big SQL and Spark SQL across 4 concurrent query streams. To make an apples to apples comparison, the resulting workload only included those 83 queries which successfully completed in Spark SQL in a reasonable amount of time (less than 10 hours). Removing the queries which Spark SQL could not complete in 10 hours biases the results in favor of Spark, since keeping the four queries which timed-out would almost double the Spark SQL execution time.

Figure 8 compares the elapsed time of completing the 83 queries across all 4-streams (total of 332 queries) in both Spark SQL and Big SQL, as well as the elapsed time for Big SQL to complete all 99 queries across 4-streams (total of 396 queries). Spark SQL took just over 43 hours to complete the test, whereas Big SQL completed the same number of queries in just over 13.5 hours – making Big SQL 3.2x faster than Spark SQL.

Figure 8: Workload Elapsed Times at 100TB for 4 Query Streams — Figure 8: Workload Elapsed Times at 100TB

In fact, Big SQL was able to complete a 4-stream workload, with each stream executing the full complement of 99 queries almost 2x faster than Spark SQL could execute 83 queries per stream. This is more impressive when you consider the fact that 7 of the 16 queries Spark SQL is unable to execute are the most complex queries in the workload.

Although impressive for Big SQL, the performance differential would be much larger if the query set included those queries from the original set of 99 which timed-out in Spark SQL after 10 hours.

ImportantPutting these results into a real-life context means Data Professionals using Big SQL can be three times more productive than one using Spark SQL.

Figure 9: Comparison of Average CPU Consumption — Figure 9: Average CPU Consumption for 4-streams at 100TB

Analysis of CPU consumption across the cluster shows Big SQL averaged 76.4% and Spark SQL 88.2% for the duration of the workload. Although percentage of user CPU consumed was approximately the same between Big SQL and Spark SQL, system CPU for Spark SQL is almost 3x more than Big SQL. System CPU cycles are undesirable CPU cycles which are not doing useful work on behalf of your application. Note: IO wait CPU times for both products were negligible.

Of course, since the Spark SQL tests take 3 times longer to complete, then during the course of executing all 332 queries in the workload, Spark SQL actually consumed in excess of 3x more CPU cycles and 9x more system CPU cycles than Big SQL – highlighting the greater efficiency which comes with the maturity of the Big SQL optimizer and runtime. Big SQL is not only 3.2x faster than Spark SQL, it also achieves this using far fewer CPU resources.

Figure 10: Average I/O Rates for 4-streams at 100TB (per node)

Big SQL’s efficiency is also highlighted when examining the volume of I/O undertaken during the test. The average read rate for Spark SQL is on average 3.6x greater than Big SQL, whilst its write rate is on average 9.4x greater. Extrapolating the average I/O rate across the duration of the tests (Big SQL is 3.2x faster than Spark SQL), then Spark SQL actually reads almost 12x more data than Big SQL, and writes 30x more data.

Figure 11: Max I/O Rates for 4-streams at 100TB (per node)

Interestingly, Big SQL has a higher maximum I/O throughput, indicating that when the need arises, Big SQL can drive greater throughput through the I/O subsystem than Spark SQL.

The large difference in volume of data read/written, along with the fact that Big SQL can drive the I/O subsystem harder is a simple, but effective indicator of the greater efficiency within Big SQL.

ImportantMapping this back to our real-life scenario, our Data Professional is not only three times more productive using Big SQL, but they use fewer resources to get their job done, which in-turn allows other Data Professionals to run more analysis on the cluster at the same time.

What makes this report any different?

There’s already a proliferation of benchmark reports for SQL over Hadoop solutions; some show solution x is many times faster than y, whilst others show that y is faster than x – but how can both be true? I’m a veteran of many competitive benchmarks and I can honestly say that benchmarks are won and lost by defining the scope of the tests. Each vendor does it upmost to influence the benchmark specification to highlight the strengths of its own product, and the weaknesses of its competitors. This is exactly what is happening in the SQL over Hadoop space today and the reason organizations are faced with a plethora of conflicting performance information when it comes to these solutions. So, what makes this report different:

The scale factor – after all, we are talking about Big Data! It’s the only report to be published at 100TB. Many other reports published at smaller scale factors don’t even use the whole data set.
It’s comprehensive. It does not cherry pick the queries to highlight the benefit of one solution over the other. In line with the original Hadoop-DS benchmark, IBM compared apples with apples, comparing all queries which work on both products.
The products were tuned by two competing (but friendly) IBM teams, both striving to highlight the benefits of their product. In the end, both teams’ experiences will benefit their respective products.
It sticks to the rules of the Hadoop-DS specification (which is based on the industry standard TPC-DS benchmark)

Summary:

There is no doubt that Spark SQL has come a long way in a very short period of time. The fact Spark SQL can complete 83 of the 99 queries at 100TB is a testament to the open source community driving this project. However, there is still no substitute for product maturity, and the strength of Big SQL comes from the fact it is built on top of IBM’s database technology. It’s this lineage which raises Big SQL above other vendors in the SQL over Hadoop space.

Figure 12: Author's view of SQL on Hadoop Marketplace — Figure 12: Author’s view of SQL on Hadoop Marketplace

The author’s view of this space, albeit rather unscientific, is depicted in Figure 12. Big SQL is ahead of the pack of open source SQL over Hadoop solutions chiefly because Big SQL inherited much of the rich functionality (and performance) that comes from IBM’s database heritage. Assuming Spark SQL maintains its current momentum, it may eventually overtake Big SQL for speed, scalability and capability. However, as technologies mature, their rate of improvement slows down – after all, it’s much easier to make a 10x improvement in a product that is one year old verses one that is ten years old simply because at the early stages of a products life-cycle the bar is much lower.

To summarize, this report has shown that Big SQL:

Can execute all 99 TPC-DS queries at 100TB, in 4 concurrent -streams, and
Can do so at least 3x faster than Spark SQL, and
Uses fewer cluster resources to achieve this class leading performance.

But what does this really mean for organizations considering purchasing a SQL over Hadoop solution? ImportantPut simply, it means Big SQL enables their Data Professionals and Business Analysts to be (at least) 3 times more productive. Deep analytics which may take 3 hours to complete with Big SQL, will take 9 hours (or over a working day) with Spark SQL. During a single day, the Big SQL user could run the analysis 3 times over, with different combinations of input, to model different scenarios. It would take at least 3 working days to complete the same analysis with Spark SQL; maybe even longer due to the fact there is a higher chance the query will not complete, and the generally longer duration required to tune Spark SQL queries.

A YouTube video is also available detailing our experiences.