Many applications generate large volumes of streaming data that is ingested via Kafka and transferred to a storage or database. This article will show you how to achieve high volume message upload with Streaming Analytics using Data Historian pattern. Beside the high volume throughput, we want to accomplish guaranteed tuple processing and uploads with exactly once semantics.

Data historian – What’s it all about?

The data historian pattern is an efficient way to collect and store time series data from live data streams. Data collected can be the raw feed, only validated events or more elaborate subsets for partitioning or other purposes.
Stored data becomes the basis for things like periodic deep data analysis, syndicating all or subsets of that data to other systems, stores or projects for reuse. It can also be a source used for replaying events back into the live stream.
The data might come from production lines, transportation routes, network devices, satellites, and other devices. The data is stored with a time stamp and other identifying information such as device ID and location. You can build a dashboard to get information in real-time fashion, or you can store the data for offline analysis.

Examples of Data Historian

In addition to this example we also provide a simple example of a data historian pattern in the graphical Streams Flows experience in IBM Watson Studio. You will be able to quickly develop and understanding of how to
  • tap into data sources
  • do simple to elaborate processing
  • sink the data into one or more stores.
The example we are covering here drives deeper into one scenario
  • It digs into the aspects of the pattern that enable it to scale from small to extreme volumes.¬†
  • It covers the methods you use to ensure that, at-scale, you are able to guarantee each message processed is written exactly once to provide an accurate repository of data.¬†
  • It also covers the use of IBM Cloud Object Storage (COS) as a cost effective data store for very large data volumes while still providing accessible ways to query and extract data from the store for all downstream consumers.

Pattern details

This use case describes a scenario when input data is read from IBM Event Streams and is written to the IBM Cloud Object Storage (COS) with exactly once semantics. These objects created on COS can be queried, for example, with IBM SQL Query service.

The following IBM Cloud services are involved:

 

Within the Streaming Analytics service we want to achieve

  • Guaranteed tuple processing with exactly-once semantic
  • Transfer JSON messages stored in Event Streams to objects in parquet format stored in Cloud Object Storage (COS)
  • Ability to add custom business logic (e.g. filters, aggregates, lookups) on the streaming data before writing into COS
  • High throughput with scalable application configuration

 

 

In the diagram above, the Event Streams service on the left side contains one Kafka topic with 6 partitions and we take advantage of the Kafka Consumer groups. In the Streaming Analytics service 6 consumers are subscribed to this topic and ingest the messages to downstream operators, that are converting the JSON messages into a Tuple format. One “ingest” channel consists of one consumer and 2 JSON to Tuple converters. These 6 channels are connected to a single operator in the center, that receives all tuples from all channels and simulates the place for the custom business logic. After this “business logic” operator the data is split to 4 sink channels converting the tuples to parquet format and uploading the objects to Cloud Object Storage.

Depending on the volume of messages, throughput requirements and complexity of the business logic, the application setup is able to be adjusted, e.g. more Kafka partitions and more parallel channels inside the Streams application.

At-least once processing and upload with exactly-once semantics

While consistent region supports that tuples are processed at least once, the COS Sink operators creates objects in object storage with exactly once semantics.

  • Consistent region replays tuples in case of failures
  • Object is not visible before it is finally closed
  • Object with the same name is overwritten on object storage

Learn more about the scenario setup by following the demo tutorial.

Join The Discussion