IBM has helped integrate all 99 queries, derived from the TPC-DS Benchmark (v2), into the existing spark-sql-perf performance test kit developed by Databricks. The 99 queries were generated using the TPC-DS query generator and are based on the 100-GB scale factor. Although not all 99 queries are “runnable”, one can now modify the run lists to include/exclude queries of interest. See this Scala version of the query file for more details.

With the latest spark-sql-perf test kit, you can evaluate and compare your Spark SQL infrastructure for its performance. It supports generating TPC-DS data sets using the TPC-DS data generator, EXPLAIN, execution time capturing, and allows for both the Spark SQL dialect and HiveQL, though HiveQL is recommended for most use cases.

Using Parquet file format, our lab tests have been able to run at least 50 out of the 99 queries successfully on Spark 1.5. Failed queries were mainly caused by parsing, Out-of-Memory and hanging errors during execution in the Spark engine. And this is the area the community is diligently working to fix, and you are welcome to contribute.

Thanks to Kenneth Chen (IBM) for developing Spark SQL and HiveQL variants of the TPC-DS queries. Happy testing!

Disclaimer: The spark-sql-perf workload is derived from the TPC-DS Benchmark and as such is not comparable to published TPC-DS Benchmark results.

1 comment on"99 TPC-DS Queries Integrated Into spark-sql-perf"

  1. Hi Jesse,

    Thanks for sharing the information, very helpful indeed.
    I have a quick question regarding “at least 50 out of the 99 queries successfully on Spark 1.5”.

    Do you happen to have information about which 50 queries on Spark 1.5?

    We’re using the TPC-DS benchmark for some engineering work (one of them is to compare the performance on Power8 (spark1.5 vs spark 1.6). In case of query97, I can see that the query can run to the end (i.e. with no re-try or error/exception) on both spark versions via Spark UI, but the stage shuffle amounts are very different. Moreover, the query results are different: on spark 1.5, it returns a set of null, null, null, while on spark 1.6: 1619860476 857297755 253306, — both on on 3Tb.

    Not sure if the above said 50 queries are both good in terms of successful runs and run results.

    Would be a great help for the project if we can find the information on that.

    Thanks for any hint …


Join The Discussion

Your email address will not be published. Required fields are marked *