by Vahid Hashemian | Published June 5, 2018
In the IT world, Apache Kafka (Kafka hereafter), is currently the most popular platform for distributed messaging or streaming data. Any application that works with any type of data (logs, events, and more) and requires that data to be transferred, and perhaps also transformed as it moves among its components can benefit from Kafka. Kafka started as a project in LinkedIn and was later open sourced to facilitate its adoption. During the past few years, it has continued as an open source project and matured a great deal. Some big names in IT use it in their production environment.
A few basic components in Kafka are:
Kafka stores data in topics. Producers send data to specific Kafka topics, and consumers read data also from specific topics. Each topic has one or more partitions. Data sent to a topic is ultimately stored in one, and only one, of its partitions. Each partition is hosted by one broker and cannot expand across multiple brokers.
There are a few reasons for the continued popularity and adoption of Kafka in the industry:
Figure 1 shows a simple Kafka cluster that contains four brokers. Three topics t1, t2, and t3, are stored in this cluster. t1 has a single partition and is replicated three times, t2 and t3 each have two partitions and are replicated twice. It is clear from this image that this cluster can survive a single broker failure without losing any data. It can survive a lossless double-broker failure only if brokers 1 and 4 or brokers 3 and 4 are the failed pairs. Any other failed pair means some data will be lost.
A variety of producer and consumer configurations can work with this cluster. For example:
In some use cases, we could have real-time and continuous streams of data go into some of these topics. For example, topic 1 contains temperature readings from various sensors in a factory, while topic 2 has detailed information about those sensors. Then Client 3 in the above configuration would be continuously receiving temperature readings, cross-checking them with the most recent sensor specs, detecting anomalies and reporting them in topic 3. In this scenario, Client 3 is a simple streams application that reads data from one or more Kafka topics, performs some processing, and writes output to another Kafka topics, all in real-time.
Real-time analysis of data coming from IoT devices or user actions on a website are a couple of basic examples that Kafka Streams can easily handle. Some other use cases are listed in Kafka Streams documentation referenced at the end of this article.
Because of the features described above, Kafka is a popular choice for streaming data and ETL scenarios. In fact, Kafka Streams API is part of Kafka and facilitates writing streams applications that process data in motion. It would be fair to say that Kafka emerged as a batch processing messaging platform and has now become a favorite streams processing platform. Kafka Streams is even augmented with another open source project, called KSQL, that hugely simplified writing Kafka Streams applications using SQL-like declarations.
Kafka and Kafka Streams have much more to offer than described in this short article. The referenced materials below describe Kafka and Kafka Streams in additional details and provide coding examples. They are highly recommended for anyone who wants to get a better understanding of the internals of Kafka and Kafka Streams and how to use them in practice.
Before open source was cool, IBM worked to establish open source as technology that's safe (and good!) for the enterprise.
Learn how to leverage Apache Kafka for real-time monitoring of website visitors.
Back to top