Prashant Sharma, Apache Spark committer, has just joined the Spark Technology Center team. STC Engineer and Apache Spark contributor Holden Karau and Prashant talk about Spark Streaming, datasets, and Prashant’s best advice to newcomers to Apache Spark.
By its nature, open-source development involves thousands of simultaneous code contributions and JIRA efforts across dozens of interlocking code bases. In that fast and turbulent world, Apache project committers are the avatars of harmony and prudence. And nowhere is the need for cool heads more crucial than with Spark, currently the most active project in the Apache Hadoop ecosystem — if not in the ASF at large.
It’s the committers who monitor commits from the community, and who make the final choices about which patches to apply. But more than that, they’re asked to decide on release plans, and they’re expected to be an easy resource for users—fielding questions, giving guidance, and steering users out of blind alleys.
A big job, and it’s all volunteer.
What sort of person takes on those tasks, forfeiting huge chunks of free time in the process? Prashant first started working on Spark in October 2012, and has since worked on over 70 JIRAs and offered up 166 commits, making him 18th in the world for total Spark commits. His focus has been streaming pluggable receivers and receivers for Akka Actors and ZeroMQ, support for Scala (including an REPL port), Java 8, MIMA checks, and more. Check out Prashant’s github here: https://github.com/ScrapCodes – Steve Moore
Holden Karau: What planned solutions are there for making accumulators work with streaming high availability?
Prashant: A lot of algorithms including MLLib depend on the accumulator to work correctly. As far as I know, there is nothing stopping the accumulator from working well with streaming high availability.
Holden: There are cool integrations done with Kafka. Can we see integration with other sources that supports back pressure nicely — and that has parity with Kafka performance?
Prashant: It should be possible to have back pressure support on message queues supporting high water mark with swap to disk. ZeroMQ is one such message queue. Also, I think message queues with reliable delivery and persistence support for mailbox can be configured to support back pressure. Scalability and performance parity with Kafka is really questionable, since Kafka is designed to support high throughput and focuses on optimizing the flush to disk. Typical AMQP complaint message queues “may” not have this as their primary goal. Before making comments regarding performance parity, things will have to be well tested. At the moment, I’m not aware of any such performance comparison.
Holden: What other plans do you have for Spark 2.x?
Prashant: For Spark 2.x, I want to work on optimizing the latency of Spark scheduling, which will have a great advantage for streaming jobs. It’s highly desirable for streaming to have low latency and it’s also a requirement at IBM. Currently, streaming is limited by the latency. By design, Spark Streaming is different from other streaming systems; it has advantages and limitations of its own over other Streaming systems. I want to take a stab at alternative scheduling techniques and would like to see if it’s possible to optimize latency and improve scalability of the system overall. There’s a lot of research being done in this area, like Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica’ Sparrow, or Pamela Delgado’s work.
Holden: What are your thoughts on the integration of Spark and datasets?
Prashant: The Dataset API is certainly an improvement over the Dataframe API. It gives the convenience of working with RDDs and performance and optimization of Dataframes. With the help of encoders, you get a very small memory footprint. For a long time, this has been a bottleneck in Spark. In the future, I hope to see the Dataset API becoming mainstream.
Holden: Where do you feel the sweet spot for Spark Streaming is today and in the near future? For example, it’s not a great fit now for a real-time trading system, but it could be a great fit for an inventory control system. This will change over time, but I’d be interested to know what kinds of things do you think it will be a fit for?
Prashant: Spark Streaming is ideal wherever integration with Spark is essential. Plus, it benefits from all the features of Spark, be it MLLib or the convenience and reliability of RDD, Dataframe APIs, or in-built receivers. Those are the most compelling reason for choosing Spark Streaming. Spark Streaming is always going to be a near real-time system, though its latency can be improved in the future. In the future, I would like to see Spark Streaming as an all-rounder streaming solution, with support for CEP, and so on.
Holden: What would your advice be to someone interested in contributing to Spark Streaming? It’s not always enough to look at starter jiras.
Prashant: My advice to newcomers is, “It’s the personal interest in the project that brings value to it”. Starter JIRAs help in building confidence initially and for understanding the review process of the project. An in-depth understanding of the working of the system (which can also be done while doing starter jiras) and participation in discussions is very helpful. So, my advice is use Spark Streaming and share your experience, especially your opinions of what can be improved. That will open new avenues on what can be done in the future.