In this article, we will focus on the Spark Kernel’s architecture: how we achieve fault tolerance and scalability using Akka, why we chose ZeroMQ with the IPython/Jupyter message protocol, what the layers of functionality are in the kernel, and elaborate on an interactive API from IPython called the Comm API.
Apache Toree acts as a gateway between an application and a Spark cluster.
NOTE: This project, originally call Spark Kernel, is now Apache Toree, an Apache incubator project. For additional details, and to contribute to the project, see the Apache Toree incubator home page.
WEBCAST REPLAY: Watch the replay of the Apache Toree Tech Talk, recorded on September 14, 2016.
Apache Spark is fast clustering software for large-scale data processing, and it is quickly gaining momentum. Spark runs standalone or in a cluster, in the cloud or on-premise hardware. If you have data-intensive applications, you’re no doubt already looking into Spark, just as the IBM Emerging Internet Technologies team did when it was investigating applications to analyze streaming and static data.
However, when investigating how to move to Spark, they ran into a problem: How to enable an interactive application against Apache Spark? There were several options to communicate with a Spark cluster, but none provided the necessary flexibility combined with a API that would work for them. The solution? They rolled their own communication backbone. Apache Toree (previously Spark Kernel) acts as the middleman between the application and a Spark cluster.
Check these links for more background:
- Spark Dev: A community for practitioners and designers innovating in Spark
- Video: Spark Kernel meetup
- How-to: Enable interactive applications against Apache Spark
Why should I contribute?
The Spark Kernel team is not soliciting code contributions just yet, but stay tuned for more on this. In addition to gaining experience with Apache Spark, you’ll learn how the project uses the IPython kernel (written in Scala), the Akka JVM concurrency framework, and ZeroMQ for messaging.
What technology problem will I help solve?
Apache Toree avoids the friction of repackaging and shipping jars, such as with Spark Submit and current RESTful services. It removes the requirement to store results into an external datastore. It acts as a proxy between applications and a Spark cluster. Lastly, Spark clusters behind firewalls can expose only the ports of the kernel and therefore allow applications to communicate with clusters through the kernel.
How will Apache Toree help my business?
Performance of data-intensive applications can be greatly improved by clustering technology, but not if your application can’t access it. Apache Toree provides a reliable and powerful API into Apache Spark, which allows companies to build interactive applications that can take advantage of high performance Spark clusters. Apache Toree is also designed to be fault tolerant and scalable, maintaining uptime and reliable performance.
Apache Toree blog posts
Spark Kernel was designed to get around limitations that prevented us from enabling interactive, remote applications to work with Apache Spark.