An introduction to Kafka
Learn all about the most popular platforms for distributed messaging or streaming data
In the IT world, Apache Kafka (Kafka hereafter), is currently the most popular platform for distributed messaging or streaming data. Any application that works with any type of data (logs, events, and more) and requires that data to be transferred, and perhaps also transformed as it moves among its components can benefit from Kafka. Kafka started as a project in LinkedIn and was later open sourced to facilitate its adoption. During the past few years, it has continued as an open source project and matured a great deal. Some big names in IT use it in their production environment.
A few basic components in Kafka are:
- Broker: A Kafka broker is where the data sent to Kafka is stored. Brokers are responsible for receiving and storing the data when it arrives. The broker also provides the data when requested. Many Kafka brokers can work together to form a Kafka cluster. Kafka uses Apache ZooKeeper to store metadata about the cluster. Brokers use this metadata to detect failures (for example, broker failures) and recover from them.
- Producer: A producer is an entity that sends data to the broker. There are different types of producers. Kafka comes with its own producer written in Java, but there are many other Kafka client libraries that support C/C++, Go, Python, REST, and more.
- Consumer: A consumer is an entity that requests data from the broker. Similar to producer, other than the built-in Java consumer, there are other open source consumers for developers who are interested in non-Java APIs.
Kafka stores data in topics. Producers send data to specific Kafka topics, and consumers read data also from specific topics. Each topic has one or more partitions. Data sent to a topic is ultimately stored in one, and only one, of its partitions. Each partition is hosted by one broker and cannot expand across multiple brokers.
There are a few reasons for the continued popularity and adoption of Kafka in the industry:
- Scalability: Two important features of Kafka contribute to its scalability. A Kafka cluster can easily expand or shrink (brokers can be added or removed) while in operation and without an outage. A Kafka topic can be expanded to contain more partitions. Because a partition cannot expand across multiple brokers, its capacity is bounded by broker disk space. Being able to increase the number of partitions and the number of brokers means there is no limit to how much data a single topic can store.
- Fault tolerance and reliability: Kafka is designed in a way that a broker failure is detectable by other brokers in the cluster. Because each topic can be replicated on multiple brokers, the cluster can recover from such failures and continue to work without any disruption of service.
- Throughput: Brokers can store and retrieve data efficiently and at a super-fast speed.
Figure 1 shows a simple Kafka cluster that contains four brokers. Three topics t1, t2, and t3, are stored in this cluster. t1 has a single partition and is replicated three times, t2 and t3 each have two partitions and are replicated twice. It is clear from this image that this cluster can survive a single broker failure without losing any data. It can survive a lossless double-broker failure only if brokers 1 and 4 or brokers 3 and 4 are the failed pairs. Any other failed pair means some data will be lost.
Figure 1. A simple Kafka cluster
A variety of producer and consumer configurations can work with this cluster. For example:
- Client 1 can produce to topic 1 (acting as a producer)
- Client 2 can produce to topic 2 (acting as a producer)
- Client 3 can read from topics 1 and 2, and write to topic 3 (acting as both consumer and producer)
- Client 4 can read from topic 3
In some use cases, we could have real-time and continuous streams of data go into some of these topics. For example, topic 1 contains temperature readings from various sensors in a factory, while topic 2 has detailed information about those sensors. Then Client 3 in the above configuration would be continuously receiving temperature readings, cross-checking them with the most recent sensor specs, detecting anomalies and reporting them in topic 3. In this scenario, Client 3 is a simple streams application that reads data from one or more Kafka topics, performs some processing, and writes output to another Kafka topics, all in real-time.
Real-time analysis of data coming from IoT devices or user actions on a website are a couple of basic examples that Kafka Streams can easily handle. Some other use cases are listed in Kafka Streams documentation referenced at the end of this article.
Because of the features described above, Kafka is a popular choice for streaming data and ETL scenarios. In fact, Kafka Streams API is part of Kafka and facilitates writing streams applications that process data in motion. It would be fair to say that Kafka emerged as a batch processing messaging platform and has now become a favorite streams processing platform. Kafka Streams is even augmented with another open source project, called KSQL, that hugely simplified writing Kafka Streams applications using SQL-like declarations.
Kafka and Kafka Streams have much more to offer than described in this short article. The referenced materials below describe Kafka and Kafka Streams in additional details and provide coding examples. They are highly recommended for anyone who wants to get a better understanding of the internals of Kafka and Kafka Streams and how to use them in practice.