Implement a distributed messaging system
To transition from a traditional ETL to an event-driven ETL, you¬†require a distributed messaging¬†system such as Apache Kafka or Apache Pulsar. Apache Kafka¬†is a fast, scalable, and durable real-time messaging engine from the Apache Software Foundation. Apache Pulsar is¬†another open-source distributed messaging system originally created at Yahoo and now part of the Apache Software Foundation. Both Apache Kafka and Apache Pulsar support publish-subscribe pattern, aka pub-sub. Alternatively, you can use a fully managed cloud-based messaging service like IBM Message Hub¬†and Amazon Kinesis Data Streams. IBM Message Hub is based on Apache Kafka and offered as Apache Kafka as a Service. Although¬†Amazon Kinesis Data Streams is inspired by Apache Kafka, it is not a drop-in replacement for Apache Kafka.
More importantly,¬†a fully managed cloud-based messaging service is a great way to get started with event-driven ETL architecture. Managed services free you from the tasks of installation, configuration, and maintenance of the messaging system, enabling you to focus on your transformation initiative. All you need is various APIs to interact with messaging services¬†particularly Producer APIs and Consumer APIs.¬†
Identify and capture entity life-cycle events
To implement event-driven ETL, we start¬†working with concepts such events, topics, event producers, and event consumers. According to Wikipedia, an event can be defined as “a significant change in state”. Alternatively, you can thinks events as state changes defining the life-cycle of an entity. Events can be raw or derived.¬† Producers are responsible for¬†publishing state changes or creating events. Consumer read these events, perform necessary transformations, and loads them into a data store. Both consumer and producer perform their tasks in real-time. A producer can also assign one or more topics to an event. A consumer can then read from one or more given topics.
At this stage, it is important to understand the anatomy of an event.¬†Every event consists of an event header and an event body. An event header typically includes event timestamp i.e. time at which the event occurred in the real world.¬†Every event consists of an event header and an event body. An event header typically includes event timestamp i.e. time at which the event occurred in the real world.¬†
Obviously, we need to identify and capture all important state changes associated with entities. For instance,¬†a typical car dealer‚Äôs system will have customer and car entities. On a high-level, we can identify following events for our car entity,
- A car was entered as a new inventory (state: “for sale”)
- The car was purchased by a customer¬† (state: “sold”)
- The car was delivered to a customer (state: “delivered”)
Refactor applications to act as event producers
In event-driven architecture, a producer is a process that publishes¬†events to one or more topics¬†of a messaging system for further processing. Generally speaking, producers¬†are nothing but existing refactored applications with to support for event publication.¬†For instance, when a consumer purchases a car,¬† car dealer’s system acts as an event producer¬†and logs this state change as an event¬†by including car’s state changes from “for sale” to “sold” in event body. To achieve this you have to change your existing applications,
- to integrate with the messaging system using Producer APIs or interfaces, and
- update application logic to¬†publish any or all state changes as events to the messaging system
In addition, there is a special type of producers called as source connectors. A source connector imports data from¬†an existing data store (e.g. a relational database etc)¬†into the messaging system. A source connector basically enables you to capture database changes as a stream – in many cases by extending existing change data capture process supported by your¬†data store.
As noted above, producer applications already exist in the ecosystem and they just need to be refactored.
Create event consumers to process data
A consumer is an application that subscribes to one or more topics and processes the stream of¬†events¬†published to them. Unlike producer applications, consumer applications are new in this ecosystem. To achieve this a consumer application integrates with the messaging system using Consumer APIs or interfaces. After reading events, a consumer can either perform data transformation and load them into the corresponding data store (ETL) or directly load raw data into a data store while escaping the transformation step (ELT). In addition, rather than loading data to a data store some consumers can write them back to the messaging system as a new topic.¬†
The role of the data store is quite important when considering ETL vs. ELT discussion. A consumer can load data into
- a data store associated with an application (relational databases such as MySQL, Postgres, etc. or non-relational databases such MongoDB, Cassandra, etc.)
- a data lake backed on backed by a cloud object storage service (Amazon S3, Google Cloud Storage) or Hadoop Distributed File System (part of Hadoop distribution)
- a traditional data warehouse (cloud-based such as Amazon Redshift or self-hosted such as SQL Server Data Warehouse)
A data lake inherently follows ELT model where transformation occurs¬†at run-time. A consumer should generally avoid any data transformation and keep the data as raw as possible when loading into data lake. If necessary, a consumer can perform standard data transformations such flattening or semi-flattening of nested structures (denormalization) in order the improve the query response time. Transformation is generally performed at run-time when a query is executed against the data lake using tools like Hive, Presto, Apache Spark, Apache Drill running on virtual machines running in the cloud.
Figure-2: An example of event-driven ETL architecture in an organization.
Buy vs. Build
Last but not least, you have complete¬†freedom to choose your own data integration strategy. You can either build a home-grown event-driven data architecture by going through high-level steps covered in this recipe or you can buy a data management-as-a-service which will offer fully managed end-to-end solution with ease of click and integrate. You may want to consider very targeted SaaS offerings such as Panoply or a more broad scope Integration platform as a service (iPaaS) solution.