What is Apache Spark?

 

You’ve heard of Apache Spark, but can you really explain it? What problems does it solve? How does it solve them? If you want to know the answers to these questions, then this post is for you.

The Problems

A tool is only as useful if it solves a set of problems, right? So let’s talk about the problems Spark solves.

We need answers (and quickly)

In batch processing, waiting for long running jobs is expected, but in today’s enterprise, answers are needed quickly (in “near real time”). But the attributes of Big Data (velocity, volume, and variety) continue to make it tougher to get answers to business questions. It’s important to get those answers quickly.

So.much.data.

Data sources are numerous and getting more so. From IoT devices, real-time trading, clickstreams, apps, social media (and on and on), the sources of data continue to grow. All that data has to be processed somehow in order for analysts to make sense of, and derive business value from it. Now you need to be able to process all that data so you can transform it all into something you can use. It’s important to be able to handle the vast quantities of data coming in at ever increasing speeds and from ever more sources!

How does A go with B (and C, and D, and…)?

You have all this great data, from customer transactions, social media interactions, to geospatial data (to name just a few). Now you need to see how all those dimensions relate to one another. It’s important to be able to see a thorough analysis of this graph of data so you can figure out which data dimensions matter and which are just noise.

We need to know what’s going to happen (and when)

You have all this great historical data. Awesome! Now you need to go through it and see what happened, and why so you can see what’s coming down the pike. It’s important to be able to analyze all this data so you can predict what’s ahead for your business.

What Apache Spark is Not

It’s common (and often easy) to conflate two or more related technologies that solve a similar set of problems, and use them interchangeably when you shouldn’t. To avoid that mistake with Spark, let’s talk about what it is NOT.

Hadoop

Hadoop is a Big Data file storage and data processing framework that uses a technique called MapReduce that reads from a massive disk cluster, transforms the data, and writes it back to disk. Spark on the other hand, uses a Directed Acyclic Graph (DAG) to process in-memory data in a set of steps that have various dependencies among them (Gradle uses a DAG also), and does not handle file storage itself like Hadoop (via Hadoop Distributed File System, HDFS) does.

Map Reduce

It’s easy to confuse Spark Core and MapReduce because they are both important in the realm of Big Data. MapReduce is fundamentally a single-pass algorithm: the data is read in, MapReduce transforms it, and the data is written back out to disk. If another transformation is required, the steps are repeated. Spark, on the other hand, performs all of its processing in memory – in multiple iterations if necessary – and uses the DAG to figure out the optimal order of the steps to execute.

Mutually exclusive with Hadoop

Spark was designed to work with Hadoop, so Hadoop and Spark work very well together. In fact, the Spark download includes Hadoop client libraries for using HDFS (for storage management), and YARN (for resource management and scheduling).

The Solutions

In the opening section, I reviewed you some of the problems Spark solves. Now I’ll show you how Spark solves those problems.

We need answers (and quickly)

Near real-time analytics demands performance, period. Spark processes data from memory, so it’s fast. Spark’s core library makes it easy to write code that can be optimized for the fastest possible results. Up to 100 times faster than MapReduce!

So.much.data.

Possibly the greatest benefit of using Spark is its ability to process live stream data. Data from trading floors, social media click-streams, and IoT devices must be transformed quickly before it is transferred to disk. Using Hadoop HDFS, the data would need to be written to disk, then read back in for Map/Reduce transformation processing, then written back out to disk, before it can be put into the analyst’s hands.

Spark Streaming allows the incoming data to be processed in memory, then written to disk for later augmentation (if necessary) and further analytics.

How does A go with B (and C, and D, and…)?

Data from multiple sources (e.g., transaction data, social media data, click-streams, just to name a few) have hidden correlations that can be teased out to help you discover – what on the surface appears to be disparate dimensions, but actually weave together – new views and insights into your data. But to do that in a meaningful way requires flexibility to transform the data (and speed doesn’t hurt, right?) to find the right edges.

Spark GraphX has both: the flexibility of multiple algorithms, and the speed to transform and join the data in a number of different ways.

We need to know what’s going to happen (and when)

Having a trove of historical data is an extremely valuable asset when it comes to predicting the future. But predictive analytics requires serious software (and hardware, of course).

Spark’s MLib is high performance (surprised?) Machine Learning (ML) library that employs a number of tried-and-true algorithms (like classification, regression, and clustering), featurization techniques (like transformation and dimensionality reduction), and utilities (like linear algebra and statistics).

Conclusion

In this post, we looked at some of the Big Data problems that arise, what Spark is NOT, and how Spark addresses those problems. Put Spark to work on your next Big Data project!

References

I’ve sprinkled links throughout this document to help you learn more about Apache Spark, but below a few here that are more overview-level. Enjoy!

  • http://www.datamation.com/data-center/hadoop-vs.-spark-the-new-age-of-big-data.html
  • https://mapr.com/blog/5-minute-guide-understanding-significance-apache-spark/
  • https://www.qubole.com/blog/apache-spark-use-cases/
  • http://www.infoworld.com/article/2611326/big-data/graph-analysis-will-make-big-data-even-bigger.html

Save

2 comments on"What is Apache Spark?"

  1. tbh i prefer rethinkDB

    • CInderSec, thanks for your comment, and while I admire brevity, I’m confused by (what seems like) your preference of rethinkDB to Spark. rethinkDB is a database and Spark is a data processing framework (and then some). The proverbial comparison of apples and oranges comes to mind (or maybe in this case apples to tractors).

      Can you please elaborate on your comment?

      –jsp

Join The Discussion

Your email address will not be published. Required fields are marked *