Learn more >
Get the code
by Rich Hagarty, Dilip Biswal | Updated March 28, 2019 - Published September 30, 2017
API ManagementData scienceDatabasesJavaMicroservices
Archived date: 2019-06-04
This developer pattern demonstrates how to evaluate and test your Apache Spark cluster using TPC Benchmark DS (TPC-DS) workloads. Two modes of execution are described: using an interactive command-line shell script and using a Jupyter Notebook running in the IBM Watson Studio.
Apache Spark is a popular distributed data processing engine built for speed, ease of use, and sophisticated analytics, with APIs in the Java™ programming language, Scala, Python, R, and SQL. Like other data processing engines, Spark has a unified optimization engine that computes the optimal way to execute a workload with the main purpose of reducing disk IO and CPU usage.
We can evaluate and measure the performance of Spark SQL using the TPC-DS benchmark. TPC-DS is a widely used industry-standard decision support benchmark used to evaluate the performance of data processing engines. Given that TPC-DS exercises some key data warehouse features, running it successfully reflects the readiness of Spark in terms of addressing the need for a data warehouse application. Apache Spark 2.0 supports all 99 decision support queries that are part of this TPC-DS benchmark.
There are many other useful reasons for running TPC-DS on your Spark installation:
This pattern is aimed at helping Spark developers quickly set up and run the TPC-DS benchmark in their own development setups.
When you have completed this pattern, you will understand:
Ready to put this code pattern to use? Complete details on how to get started running and using this application are in the README.
Back to top