Many applications generate large volumes of streaming data that is ingested via Kafka and transferred to a storage or database. This article will show you how to achieve high volume message upload with Streaming Analytics using Data Historian pattern. Beside the high volume throughput, we want to accomplish guaranteed tuple processing and uploads with exactly once semantics.
This¬†use case describes a scenario when input data is read from IBM Event Streams and is written to the IBM Cloud Object Storage (COS) with exactly once semantics. These objects created on COS can be queried, for example, with IBM SQL Query service.
Within the Streaming Analytics service we want to achieve
- Guaranteed tuple processing with exactly-once semantic
- Transfer JSON messages stored in Event Streams to objects in parquet format stored in Cloud Object Storage (COS)
- Ability to add custom business logic (e.g. filters, aggregates, lookups) on the streaming data before writing into COS
- High throughput with scalable application configuration
In the diagram above, the Event Streams service on the left side contains one Kafka topic with 6 partitions and we take advantage of the Kafka Consumer groups. In the Streaming Analytics service 6 consumers are subscribed to this topic and ingest the messages to downstream operators, that are converting the JSON messages into a Tuple format. One “ingest” channel consists of one consumer and 2 JSON to Tuple converters. These 6 channels are connected to a single operator in the center, that receives all tuples from all channels and simulates the place for the custom business logic. After this “business logic” operator the data is split to 4 sink channels converting the tuples to parquet format and uploading the objects to Cloud Object Storage.
Depending on the volume of messages, throughput requirements and complexity of the business logic, the application setup is able to be adjusted, e.g. more Kafka partitions and more parallel channels inside the Streams application.
At-least once processing and upload with exactly-once semantics
- Consistent region replays tuples in case of failures
- Object is not visible before it is finally closed
- Object with the same name is overwritten on object storage
Learn more about the scenario setup by following the demo tutorial.