Apache Toree: Graduated IBM open source project

NOTE: This project, originally call Spark Kernel, is now Apache Toree, an Apache incubator project. For additional details, and to contribute to the project, see the Apache Toree incubator home page.

Apache Spark is fast clustering software for large-scale data processing, and it is quickly gaining momentum. Spark runs standalone or in a cluster, in the cloud or on-premise hardware. If you have data-intensive applications, you’re no doubt already looking into Spark, just as the IBM Emerging Internet Technologies team did when it was investigating applications to analyze streaming and static data.

However, when investigating how to move to Spark, they ran into a problem: How to enable an interactive application against Apache Spark? There were several options to communicate with a Spark cluster, but none provided the necessary flexibility combined with a API that would work for them. The solution? They rolled their own communication backbone. Apache Toree (previously Spark Kernel) acts as the middleman between the application and a Spark cluster.

Check these links for more background:

Why should I contribute?

The Spark Kernel team is not soliciting code contributions just yet, but stay tuned for more on this. In addition to gaining experience with Apache Spark, you’ll learn how the project uses the IPython kernel (written in Scala), the Akka JVM concurrency framework, and ZeroMQ for messaging.

What technology problem will I help solve?

Apache Toree avoids the friction of repackaging and shipping jars, such as with Spark Submit and current RESTful services. It removes the requirement to store results into an external datastore. It acts as a proxy between applications and a Spark cluster. Lastly, Spark clusters behind firewalls can expose only the ports of the kernel and therefore allow applications to communicate with clusters through the kernel.

How will Apache Toree help my business?

Performance of data-intensive applications can be greatly improved by clustering technology, but not if your application can’t access it. Apache Toree provides a reliable and powerful API into Apache Spark, which allows companies to build interactive applications that can take advantage of high performance Spark clusters. Apache Toree is also designed to be fault tolerant and scalable, maintaining uptime and reliable performance.