Apache Spark is a popular distributed data processing engine built around speed, ease of use, and sophisticated analytics, with Java™, Scala, Python, R, and SQL APIs. Like other data processing engines, Spark has a unified optimization engine that computes the optimal way to execute a workload aimed at reducing disk IO and CPU usage.

TPC-DS is a widely used industry-standard decision support benchmark used to evaluate the performance of data processing engines. Given that TPC-DS exercises some key data warehouse features, running TPC-DS successfully reflects the readiness of Spark in terms of addressing the needs of a data warehouse application. Apache Spark 2.0+ supports all 99 decision support queries that are part of this TPC-DS benchmark.

The new developer pattern Explore Spark SQL and its performance using TPC-DS workload demonstrates the steps required to quickly set up and run TPC-DS workloads when targeted to run against either their local spark environment, or using the Spark as a service through the IBM Data Science Experience (DSX). Additionally, the pattern provides demos, code, and instructions to run the same workload on a larger data scale. The pattern is made available in two forms: using a command-line script and using a Jupyter Notebook in DSX.

Join The Discussion

Your email address will not be published. Required fields are marked *