With the exploding interest and development of artificial intelligence (AI) and machine Learning, this year’s Spark Summit emphasized machine learning experiences and many of its developing techniques. Because of my exploration of deep learning with TensorFlow, I am starting a series of articles to summarize the lessons, tricks, and tips that I learned. This first article covers Apache Kafka.
Apache Kafka is a distributed streaming platform that generally can be used for two broad classes of applications:
- Building real-time streaming data pipelines that reliably get data between systems and applications
- Building real-time streaming applications that transform or react to the stream of data
To try out Apache Kafka with other downstream frameworks, such as Apache Spark or TensorFlow, you usually have to set up a cluster using a few (virtual) machines or a single node in local mode. Instead of installing those components to run Apache Kafka, this articles explains how to get an embedded Apache Kafka cluster running on your local machine so that you can focus more on developing the downstream applications.
The Apache Kafka cluster usually includes a few components:
- Zookeeper: A centralized service to maintain states between the nodes of the cluster. Here, I use Zookeeper running in embedded mode.
- Kafka brokers: Kafka brokers form the heart of the system and act as the pipelines where the data is stored and distributed.
- Producers: Producers publish data to the topics of their choice.
- Consumers: Consumers read data from the cluster.
Zookeeper is the service that stores key-value to maintain a server state. Kafka relies on Zookeeper to run, so the first task is to start a Zookeeper instance.
By using another Apache project, Apache Curator, you can start a
TestingServer provided by Curator. From its Javadoc, you should notice that
TestingServer is FOR TESTING PURPOSES ONLY, but it is sufficient for your use. By creating an instance of
TestingServer, you can easily make Zookeeper run in embedded mode and get connection information from that server. Because certain versions of Apache Curator only work with certain versions of Zookeeper, you must ensure that you are using the proper version of Apache Curator. Take a look at the version compatibility before you begin.
Before starting the Kafka server, you must set up a minimum number of configuration properties to run it, including the host name and port.
Kafka Test provides a very useful
TestUtils class that you use to create a Kafka server. Note that usually you won’t be able to find Kafka Test in your release package, but for testing purposes, you can add it by turning on the
<classifier>test</classifier> in your pom.xml file.
With the Kafka server running on your local machine in the embedded mode, you can start writing the code to create topics and put some data into that. Later on, you can also stop the server and do some cleanup.
Developers usually must set a cluster to test out Apache Kafka and program the downstream applications. But, by using some other open source projects and test utilities, you can avoid downloading and installing those components, speeding up the development process and letting you focus more on the applications. For more information and code, take a look at my GitHub repo.