You’ve probably heard of Apache Kafka. Streaming data and records for applications and data pipelines, Kafka can act as an enterprise messaging system given its fault tolerance. In this article I’m going to dive in a bit on what makes Apache Kafka so cool and how it can help your applications. I’ll also talk about the perfect candidate for applying Kafka (and assume that fits your application architecture). Then I’ll describe where you want your architecture to be, what is blocking the path to your getting there!

Here’s what you have in your big data environment

Lots of data sources

You have data sources. Lots of them. These sources are your applications: from payment processing systems, to customer-facing websites, to social media platforms, and so on. Not to mention all of the internal applications that provide administrative functions, management reports, and on and on.

These applications produce a lot of data. And they consume a lot of data. Or both. If your system architecture looks anything like it did at LinkedIn before they started working on Kafka, then Apache Kafka may be for you.

Data pressure (Big Data’s 3V)

In addition to lots of data sources, you also have the “Big Data Problem“: the (many and varied) sources of data in your enterprise are producing data at an ever-increasing rate, resulting in more and more (and more) data to deal with.

In other words, the data is pressuring you to come up with a solution (spoiler alert: there already is one).

And if that wasn’t bad enough, groups within each organization in your enterprise have (and vigorously defend) their own applications and data silos.

This naturally results in duplicate (and – due to lack of integration – effectively orphaned) and stale data, as well as inaccurate metrics.

User pressure

Despite the data pressure, internal and external users want faster response time, (more) complete views of the data, fresher data, and (oh yeah!) accurate metrics.

And so it goes.

Here’s what you want to happen with your big data

Systems that understand each other

What you want are systems that not just talk to each other, but understand each other, kind of like what they built at LinkedIn.

After all, systems that understand each other are truly integrated. Those that aren’t, well, you know.

Data in real-time

You need to provide real-time data to your applications. Users have come to expect it, and management wants it.

So when something happens in an application, that event (and the data that accompanies it) needs to be available to every other application in your enterprise for possible consumption.

Easy to maintain

As you integrate existing applications to understand one another’s data, and add new applications, the resulting code needs to be easy to maintain.

What is standing in your way

Lots of APIs

Every data producer has its own API. Fair enough. Let’s say you integrate application B with application A (i.e., B uses A’s API to get the data), and all is fine.

Now suppose you need to integrate application C with A. You’ve done this once already (with B, remember?), so maybe there’s some design or code reuse (maybe not), but in any case you get it done. Nice.

Then C needs to integrate with B (let’s assume B has a nice API for this, though maybe not). Whew. Now along comes applications D, E, and F need to integrate with all of these. This can quickly become an integration nightmare.

Consumers sometimes need to be producers

Some of your internal applications exist merely to consume data from one source, transform it, and write it to another.

This is not bad, but it’s another point of integration in your architecture.

Maybe you’re thinking an ETL solution is a perfect fit here, but it’s a batch solution.

Data in real-time is a complex problem

You need data in real time. And to achieve the kind of horizontal scalability you need to make real time data a reality, you know you have to decouple all those data sources (using best practices is just how you roll).

So maybe you’re thinking of using a traditional message broker like ActiveMQ to provide the decoupling and horizontal scalability. Thing is, traditional message brokers might not always scale.

How Kafka can help your big data environment

Kafka is a stream-based, publish/subscribe system that creates a sequence of time-ordered log records in real-time.

Wait, what now? Hm, let’s break it down.

Stream-based

Think of a stream as a sequence of events that results in changes in your application’s state in the database. The database – as the system of record for your application – always contains the current state, but not the events that led to that state.

These (change-inducing) events are like a story. The story of your application.

A major benefit of “big data” is the ability to capture these events so they can be analyzed. The story told by these events give you insight into how your application performs, and provides management with fodder for making decisions.

Publish/Subscribe

Kafka works by using a publish/subscribe architecture. Data producers publish records to a topic. Consumers, subscribed to that topic, receive the latest records and do their processing (which may involve some transformation, followed by publishing to another Kafka topic).

Using publish/subscribe like this results in a highly decoupled architecture. And thanks to Kafka’s super-fast append-only partitioned log, it provides incredible horizontal scalability.

Sequence of time-ordered log records

When I say “log” in the context of Kafka, I don’t mean “that file where log messages are stored.” That’s the application log.

In the context of Kafka, the log is the append-only, time ordered system of record.

The log contains the stream of events output by your applications: something happens in your application (the event), so you tell Kafka about it (publish), and other interested parties (subscribers, i.e., your applications) can take action.

This provides a number of benefits:

  • Makes the system as a whole more fault tolerant
  • Provides the ability to replay a sequence of events (undo, last-known-good reversion, etc)
  • The log stores (potentially) everything (not in practice, of course)

What’s next?

Please leave a comment! I’d love to hear what you think about Kafka.

Join The Discussion

Your email address will not be published. Required fields are marked *