Overview

Skill Level: Advanced

Extract, transform, and load (ETL) as we know it is quickly disappearing. The new breed of big data tools and streaming platforms provide the means to move beyond traditional ETL.

Ingredients

Extract, transform, and load (ETL) as we know it is quickly disappearing. The new breed of big data tools and streaming platforms provide the means to move beyond traditional ETL (for example, see this discussion about ETL vs ELT for a discussion of alternatives). One of the modern approaches is the event-driven ETL architecture which we are going to cover in this recipe.

Just for a second, imagine you are working with a complex ecosystem of applications in a large organization currently strongly tied with a traditional ETL model. These applications are either external customer facing with direct impact on customer experience or internal business process enablers with direct impact on operational efficiency. There are a large number of databases in this ecosystem and traditional ETL based approach is way too complex and difficult to maintain due to point-to-point data integration between these databases. How can you simplify this complexity while future-proofing their data architecture?

 

Traditional-ETL-1

Figure-1: An example of traditional ETL in an organization.

 

To implement event-driven ETL architecture, you have to start thinking events as first-class citizens. We will then discuss the role of a messaging system - a key building block which enables organizations to transition from a traditional ETL to an event-driven ETL architecture. An event-driven ETL architecture enables us to avoid point-to-point data integrations while ensuring that events are transformed and/or loaded in real-time.

Step-by-step

  1. Implement a distributed messaging system

    To transition from a traditional ETL to an event-driven ETL, you require a distributed messaging system such as Apache Kafka or Apache Pulsar. Apache Kafka is a fast, scalable, and durable real-time messaging engine from the Apache Software Foundation. Apache Pulsar is another open-source distributed messaging system originally created at Yahoo and now part of the Apache Software Foundation. Both Apache Kafka and Apache Pulsar support publish-subscribe pattern, aka pub-sub. Alternatively, you can use a fully managed cloud-based messaging service like IBM Message Hub and Amazon Kinesis Data Streams. IBM Message Hub is based on Apache Kafka and offered as Apache Kafka as a Service. Although Amazon Kinesis Data Streams is inspired by Apache Kafka, it is not a drop-in replacement for Apache Kafka.

    More importantly, a fully managed cloud-based messaging service is a great way to get started with event-driven ETL architecture. Managed services free you from the tasks of installation, configuration, and maintenance of the messaging system, enabling you to focus on your transformation initiative. All you need is various APIs to interact with messaging services particularly Producer APIs and Consumer APIs. 

  2. Identify and capture entity life-cycle events

    To implement event-driven ETL, we start¬†working with concepts such events, topics, event producers, and event consumers. According to Wikipedia, an event can be defined as “a significant change in state”. Alternatively, you can thinks events as state changes defining the life-cycle of an entity. Events can be raw or derived.¬† Producers are responsible for¬†publishing state changes or creating events. Consumer read these events, perform necessary transformations, and loads them into a data store. Both consumer and producer perform their tasks in real-time. A producer can also assign one or more topics to an event. A consumer can then read from one or more given topics.

    At this stage, it is important to understand the anatomy of an event. Every event consists of an event header and an event body. An event header typically includes event timestamp i.e. time at which the event occurred in the real world. Every event consists of an event header and an event body. An event header typically includes event timestamp i.e. time at which the event occurred in the real world. 

    Obviously, we need to identify and capture all important state changes associated with entities. For instance, a typical car dealer’s system will have customer and car entities. On a high-level, we can identify following events for our car entity,

    1. A car was entered as a new inventory (state: “for sale”)
    2. The car was purchased by a customer¬† (state: “sold”)
    3. The car was delivered to a customer (state: “delivered”)

     

  3. Refactor applications to act as event producers

    In event-driven architecture, a producer is a process that publishes¬†events to one or more topics¬†of a messaging system for further processing. Generally speaking, producers¬†are nothing but existing refactored applications with to support for event publication.¬†For instance, when a consumer purchases a car,¬† car dealer’s system acts as an event producer¬†and logs this state change as an event¬†by including car’s state changes from “for sale” to “sold” in event body. To achieve this you have to change your existing applications,

    1. to integrate with the messaging system using Producer APIs or interfaces, and
    2. update application logic to publish any or all state changes as events to the messaging system

     

    In addition, there is a special type of producers called as source connectors. A source connector imports data from an existing data store (e.g. a relational database etc) into the messaging system. A source connector basically enables you to capture database changes as a stream Рin many cases by extending existing change data capture process supported by your data store.

    As noted above, producer applications already exist in the ecosystem and they just need to be refactored.

  4. Create event consumers to process data

    A consumer is an application that subscribes to one or more topics and processes the stream of events published to them. Unlike producer applications, consumer applications are new in this ecosystem. To achieve this a consumer application integrates with the messaging system using Consumer APIs or interfaces. After reading events, a consumer can either perform data transformation and load them into the corresponding data store (ETL) or directly load raw data into a data store while escaping the transformation step (ELT). In addition, rather than loading data to a data store some consumers can write them back to the messaging system as a new topic. 

    The role of the data store is quite important when considering ETL vs. ELT discussion. A consumer can load data into

    1. a data store associated with an application (relational databases such as MySQL, Postgres, etc. or non-relational databases such MongoDB, Cassandra, etc.)
    2. a data lake backed on backed by a cloud object storage service (Amazon S3, Google Cloud Storage) or Hadoop Distributed File System (part of Hadoop distribution)
    3. a traditional data warehouse (cloud-based such as Amazon Redshift or self-hosted such as SQL Server Data Warehouse)

     

    A data lake inherently follows ELT model where transformation occurs at run-time. A consumer should generally avoid any data transformation and keep the data as raw as possible when loading into data lake. If necessary, a consumer can perform standard data transformations such flattening or semi-flattening of nested structures (denormalization) in order the improve the query response time. Transformation is generally performed at run-time when a query is executed against the data lake using tools like Hive, Presto, Apache Spark, Apache Drill running on virtual machines running in the cloud.

     

    Event-driven-Architecture

    Figure-2: An example of event-driven ETL architecture in an organization.

  5. Buy vs. Build

    Last but not least, you have complete freedom to choose your own data integration strategy. You can either build a home-grown event-driven data architecture by going through high-level steps covered in this recipe or you can buy a data management-as-a-service which will offer fully managed end-to-end solution with ease of click and integrate. You may want to consider very targeted SaaS offerings such as Panoply or a more broad scope Integration platform as a service (iPaaS) solution.

Join The Discussion