What is Apache Kafka?

Apache Kafka is an event-streaming software platform for handling real-time data feeds. It is based on a publish-subscribe messaging model, and is designed to be fault-tolerant, scalable, high-throughput, and low-latency. Kafka started as a project in LinkedIn and was later open-sourced to facilitate its adoption. It is written in Scala and Java, and it is part of the open-source Apache Software Foundation.

Any application that works with any type of data (logs, events, and more), and requires that data to be transferred, can benefit from Kafka.

Kafka is a good choice if you have data being generated by several systems and you need to get that data quickly to other systems. It is an especially capable solution whenever you are dealing with large volumes of data and require real-time processing to make that data available to others.

Publish-subscribe messaging

Publish-subscribe messaging means that the system that is generating the piece of data (a producer or publisher) is not sending the data directly to a receiver (a consumer or subscriber), but instead is sending the data to a Kafka (called a broker).

Decoupling the publishing and consuming steps allows a publisher to make its data available without needing to know how it will be used (consumed) by others. For example, a publisher could publish a stream of data that represents changes to a booking system. Immediately, a consumer could subscribe and use that information to update records in a second booking system. Later, another consumer could subscribe to the same data and use that information to provide alerts to customers without needing to have any further interaction with the publisher.

Kafka architecture

Kafka runs as a cluster of one or more Kafka brokers. Data is written to Kafka, and stored in order within a partition, such that it is ready to be read by a consumer.

High-level overview of Apache Kafka

High-level architectural overview of Apache Kafka

On top of the publish-subscribe messaging model, Kafka provides built-in resilience through the replication of data across brokers, so that it is highly available and fault tolerant even within a single Kafka cluster. In addition, you can configure your Kafka setup to be distributed, which provides further protection against failures.

Kafka maintains a history of events, allowing consumers to go back to a given record or point in time and replay events up to the most recent. This history of events is often used with analytics systems where they can replay the data and build a new view of the data on their own schedule, decoupled from when the data is created.

Scalability is also an important feature, giving you the option to dynamically scale Kafka for performance. For example, you can expand or shrink your Kafka clusters by adding or removing brokers as needed and as part of normal operation without requiring any outages.

Next steps

With this basic understanding about Apache Kafka, now you’re ready to learn more about how it works and the numerous ways you can use Kafka.