Apache Spark is a popular distributed data processing engine built around speed, ease of use, and sophisticated analytics, with Java™, Scala, Python, R, and SQL APIs. Like other data processing engines, Spark has a unified optimization engine that computes the optimal way to execute a workload aimed at reducing disk IO and CPU usage.
TPC-DS is a widely used industry-standard decision support benchmark used to evaluate the performance of data processing engines. Given that TPC-DS exercises some key data warehouse features, running TPC-DS successfully reflects the readiness of Spark in terms of addressing the needs of a data warehouse application. Apache Spark 2.0+ supports all 99 decision support queries that are part of this TPC-DS benchmark.
The new developer pattern Explore Spark SQL and its performance using TPC-DS workload demonstrates the steps required to quickly set up and run TPC-DS workloads when targeted to run against either their local spark environment, or using the Spark as a service through the IBM Data Science Experience (DSX). Additionally, the pattern provides demos, code, and instructions to run the same workload on a larger data scale. The pattern is made available in two forms: using a command-line script and using a Jupyter Notebook in DSX.