There has been an explosion of interest in stream computing over the last year as real-time operational insight and immediate engagement are becoming commonplace business requirements. Because stream computing platforms need to keep up with “now” and data volumes are growing every day, overall throughput, scalability, and efficiency are critical characteristics when choosing a system.
With the popularity of stream computing expanding, there are many new platforms using different techniques ranging from in-memory to disk-based and mini-batch to continuous processing systems. Qualitatively, we expect mini-batch and disk-based approaches to have higher latency and lower throughput than in-memory and continuous processing systems, but to truly understand the differences we need meaningful measurements and benchmarks. We’ve previously published a Log Analytics benchmark with results for Streams, but up until now there have been few published apple-to-apples system comparisons.
With a lack of published data, we are often asked by clients to help with performance evaluations for scenarios related to one of their business problems. Recently our teams at the IBM Research Dublin Lab and IBM Software Group Europe conducted one such evaluation with an email processing scenario for a spam detection service. In this analysis, they built and tuned a version of the application on Apache Storm (an Apache Incubator project for stream computing) and a version of the application on IBM InfoSphere Streams. They then measured the throughput and CPU utilization of each platform for these applications and the Enron email dataset.
What they found is that for the benchmark application and scenarios tested, Streams outperforms Apache Storm by 2.6 to 12.3 times in terms of throughput while simultaneously consuming 5.5 to 14.2 times less CPU time. Furthermore, the throughput and CPU time gaps widen as data volume, degree of parallelism, and/or number of processing nodes grows — the tests in this study only went up to 4 nodes and 8-way parallelism.
The team consulted with experienced Storm developers and spent more than a month trying and measuring different tuning options. For Streams, built-in fusion was used, but the gap was already wide enough that more elaborate performance tuning was not done. To understand whether application logic or the base infrastructure had the biggest impact, they also tested trivial applications (one with a read followed by a simple count followed by a write, and one with a read followed by a write). In both of these restricted benchmarks, the gap was similar to the full benchmark case, meaning that for these workloads, Streams as a platform is far more efficient than Apache Storm.
We have also worked with clients to compare Streams and Apache Storm for applications including monitoring of health care sensor data, real-time marketing, and call detail record processing in telecommunications. For each of these patterns we have seen similar gaps in throughput and CPU utilization between Streams and Apache Storm.
Achieving 10x the throughput on 1/10th the hardware is a good start. When you add the development efficiency of a higher level composition language and developer tools, a rapidly growing community, a vibrant open collaboration for analytics and connectors on GitHub, and the confidence that comes from working with an enterprise partner, we think Streams is an incredibly compelling choice.
For the detailed analysis, you can read the full Streams and Apache Storm comparison written by Zubair Nabi and Eric Bouillet of IBM Research Dublin Lab and Andrew Bainbridge and Chris Thomas of IBM Software Group Europe.