Taxonomy Icon

Data Science

Explore Spark SQL and its performance using TPC-DS workload

Get the code

Summary

This developer pattern demonstrates how to evaluate and test your Apache Spark cluster using TPC Benchmark DS (TPC-DS) workloads. Two modes of execution are described: using an interactive command-line shell script and using a Jupyter Notebook running in the IBM Watson Studio.

Description

Apache Spark is a popular distributed data processing engine built for speed, ease of use, and sophisticated analytics, with APIs in the Java™ programming language, Scala, Python, R, and SQL. Like other data processing engines, Spark has a unified optimization engine that computes the optimal way to execute a workload with the main purpose of reducing disk IO and CPU usage.

We can evaluate and measure the performance of Spark SQL using the TPC-DS benchmark. TPC-DS is a widely used industry-standard decision support benchmark used to evaluate the performance of data processing engines. Given that TPC-DS exercises some key data warehouse features, running it successfully reflects the readiness of Spark in terms of addressing the need for a data warehouse application. Apache Spark 2.0 supports all 99 decision support queries that are part of this TPC-DS benchmark.

There are many other useful reasons for running TPC-DS on your Spark installation:

  • Sanity check to make sure there are no possible configuration or installation issues.
  • To compare against other potential competing engine solutions.
  • Run before/after tests to verify performance gains when upgrading.

This pattern is aimed at helping Spark developers quickly set up and run the TPC-DS benchmark in their own development setups.

When you have completed this pattern, you will understand:

  • How to set up the TPC-DS toolkit.
  • How to generate TPC-DS datasets at different scale factor.
  • How to create Spark database artifacts.
  • How to run TPC-DS benchmark queries on Spark in local mode and see the results.
  • Considerations when increasing the data scale and running against a Spark cluster.

Flow

flow

  1. Compile the toolkit and generate the TPC-DS dataset by using the toolkit.
  2. Create the Spark tables and generate the TPC-DS queries.
  3. Run the entire query set or a subset of queries and monitor the results.
  4. Create the Spark tables with pre-generated dataset.
  5. Run the entire query set or individual query.
  6. View the query results or performance summary.
  7. View the performance graph.

Instructions

Ready to put this code pattern to use? Complete details on how to get started running and using this application are in the README.