Archived | Explore Spark SQL and its performance using TPC-DS workload
Learn how to set up and run the TPC-DS benchmark to evaluate and measure the performance of your Spark SQL system
This developer pattern demonstrates how to evaluate and test your Apache Spark cluster using TPC Benchmark DS (TPC-DS) workloads. Two modes of execution are described: using an interactive command-line shell script and using a Jupyter Notebook running in the IBM Watson Studio.
Apache Spark is a popular distributed data processing engine built for speed, ease of use, and sophisticated analytics, with APIs in the Java™ programming language, Scala, Python, R, and SQL. Like other data processing engines, Spark has a unified optimization engine that computes the optimal way to execute a workload with the main purpose of reducing disk IO and CPU usage.
We can evaluate and measure the performance of Spark SQL using the TPC-DS benchmark. TPC-DS is a widely used industry-standard decision support benchmark used to evaluate the performance of data processing engines. Given that TPC-DS exercises some key data warehouse features, running it successfully reflects the readiness of Spark in terms of addressing the need for a data warehouse application. Apache Spark 2.0 supports all 99 decision support queries that are part of this TPC-DS benchmark.
There are many other useful reasons for running TPC-DS on your Spark installation:
- Sanity check to make sure there are no possible configuration or installation issues.
- To compare against other potential competing engine solutions.
- Run before/after tests to verify performance gains when upgrading.
This pattern is aimed at helping Spark developers quickly set up and run the TPC-DS benchmark in their own development setups.
When you have completed this pattern, you will understand:
- How to set up the TPC-DS toolkit.
- How to generate TPC-DS datasets at different scale factor.
- How to create Spark database artifacts.
- How to run TPC-DS benchmark queries on Spark in local mode and see the results.
- Considerations when increasing the data scale and running against a Spark cluster.
- Compile the toolkit and generate the TPC-DS dataset by using the toolkit.
- Create the Spark tables and generate the TPC-DS queries.
- Run the entire query set or a subset of queries and monitor the results.
- Create the Spark tables with pre-generated dataset.
- Run the entire query set or individual query.
- View the query results or performance summary.
- View the performance graph.
Ready to put this code pattern to use? Complete details on how to get started running and using this application are in the README.