IBM has helped integrate all 99 queries, derived from the TPC-DS Benchmark (v2), into the existing spark-sql-perf performance test kit developed by Databricks. The 99 queries were generated using the TPC-DS query generator and are based on the 100-GB scale factor. Although not all 99 queries are “runnable”, one can now modify the run lists to include/exclude queries of interest. See this Scala version of the query file for more details.
With the latest spark-sql-perf test kit, you can evaluate and compare your Spark SQL infrastructure for its performance. It supports generating TPC-DS data sets using the TPC-DS data generator, EXPLAIN, execution time capturing, and allows for both the Spark SQL dialect and HiveQL, though HiveQL is recommended for most use cases.
Using Parquet file format, our lab tests have been able to run at least 50 out of the 99 queries successfully on Spark 1.5. Failed queries were mainly caused by parsing, Out-of-Memory and hanging errors during execution in the Spark engine. And this is the area the community is diligently working to fix, and you are welcome to contribute.
Thanks to Kenneth Chen (IBM) for developing Spark SQL and HiveQL variants of the TPC-DS queries. Happy testing!
Disclaimer: The spark-sql-perf workload is derived from the TPC-DS Benchmark and as such is not comparable to published TPC-DS Benchmark results.