Spark 1.6.0 was released on Jan 4th, and we took it for a “test drive”. In our performance labs, we tested four workloads with varying data volumes on a bare metal Hadoop cluster and compared results to earlier Spark versions using the same workloads. Here is the preview.
- We ran all workloads using
yarn-clientmode. We found that Spark 1.6.0 requires more driver memory than previous versions. For example, the TPCDS SQL workload used 8GB driver memory in 1.5.1 but needed 12GB for Spark 1.6.0 in order to run without failed stages. So the memory footprint of Spark 1.6.0 increased noticeably, and based on your application, you may also need to increase executor’s memory.
- The TPCDS SQL workload improved by almost 10% in performance compared to 1.4.1. Almost all queries in the workload ran faster, making this a really solid release for Spark SQL. You can evaluate Spark SQL using this test kit on github.
- The K-Means workload degraded by 20% or so from 561 seconds in Spark 1.5.1 to 680 seconds in 1.6.0. We have noticed higher CPU usage in 1.6.0. K-Means library updates may also affect this workload (i.e., code change may be required in our workload). We are looking into the cause.
More to come
You may ask where is Spark Streaming? We are currently evaluating how streaming applications and Parquet data formats are affected by Spark 1.6.0. Please stay tuned for more performance information.