Running a hybrid cloud data pipeline demo

Cloud computing offers modern businesses advantages. According to Gartner, cloud computing will be the centerpiece of new digital experiences, and by 2025 companies will deploy 95% of new digital workloads on cloud-native platforms. Meanwhile, mainframes are used by 71% of Fortune 500 companies, handle 68% of the world’s production IT workloads, and process 90% of all credit card transactions (see this Precisely blog post on mainframe statistics). Companies need to make data from different sources, whether on-premises or multi-cloud, accessible in their target cloud environment.

Data integration patterns include:

Batch data transfer (transfer batch data from source to cloud): A large volume of data is normally needed for data scientists to build AI models. Batch data processing and data transfer can be used.
Data synchronization (make data available to cloud in real time or near real time): If a system needs to react to data updates or events in real time or near real time, event-based data synchronization can be used.
Data virtualization (access data at source without moving data): To avoid maintaining multiple copies of the data and data synchronization issues, data virtualization can be used to access data in place.
Zero-trust data sharing (protect data security without leaking information): When sensitive data is involved, use techniques such as homomorphic encryption to further protect the data.
Bring compute closer to data (push compute closer to data on-premises): When data can’t be exposed to a cloud because of regulation, compliance, and security reasons, compute workloads can be brought closer to data. For example, in a federated learning case, each party can train the model independently and the trained models can be aggregated together.

In this article, we will drill down into the batch data transfer pattern and demonstrate how to build a batch data pipeline.

IBM Cloud for Financial Services

Cloud security is vital to organizations to keep its applications and data protected from bad actors. According to Gartner, by 2025 90% of the organizations that fail to control public cloud use will inappropriately share sensitive data.

There are many different factors to consider when designing a secure environment in cloud, including network security, identity and access control, application security, and data protection. IBM Cloud for Financial Services is designed to build trust and enable a transparent public cloud ecosystem with security, compliance, and resiliency features that financial institutions require. The Financial Services Cloud framework defines a comprehensive set of control requirements and provides automation and configuration of reference architectures. It not only addresses the needs of financial services institutions with regulatory compliance, security, and resiliency during the initial deployment phase but also efficiently and effectively monitors compliance, remediates issues, and generates evidence of compliance for ongoing operations. The framework is also informed by an industry council and Promontory Financial Group, an IBM subsidiary, to ensure that it’s current with new and updated regulations.

Batch data pipeline pattern

Artificial intelligence (AI) enables computer applications to learn from experience through iterative processing and algorithmic training. AI and machine learning are changing our world dramatically. Training a model is more convenient in a cloud setting, where the on-premises transaction data and other relevant third-party data are brought into a secure environment that's compliant with financial industry regulations. Large amounts of historical data are normally needed to train AI models. Sample use cases include training AI models for a credit card fraud detection model or training customer attrition model as part of a set of customer wealth management products.

This demo uses the sample data files from the Financial Market Customer Attrition accelerator from IBM Cloud Pak for Data to demonstrate how to make data available using the batch data pipeline. You can also use other data files to follow along the demo.

First, data is pre-processed on-premises and efficiently transferred to IBM Cloud using the Aspera service. It’s then securely stored in an IBM Cloud Object Storage (Object Storage) bucket protected by Hyper Protect Crypto Service keys.

Once the data is written in an Object Storage bucket, the subscribed IBM Cloud Code Engine job is triggered. Normally for batch data flow, data can be processed in batch mode and the results can be saved back directly to the Object Storage bucket or databases. However, in this demo IBM Event Streams is included in the data pipeline, and the data records from the batch file are processed and pushed to the Event Streams topic. Some ISVs have existing Kafka connectors and by pushing data to Event Streams topic, they can pick up the data using their existing Kafka connector without writing any new code to read data from the Object Storage bucket.

This demo also shows a sample data handler that takes data records from a Event Streams topic, processes the data if needed, and then saves the data to an output Object Storage bucket. This simulates the process for data provider to pick up the raw data from an Event Streams topic, further process the data, and generate the data products in Object Storage buckets, databases, or a processed data topic that a consumer can subscribe to.

Data flow in hybrid cloud pipeline demo

Here is the data flow for the demo. Different roles are involved, either from the same organization or different organizations:

The data vendor extracts the batch data and pre-processes the data on-premises.
The data vendor pushes the batch data to Object Storage bucket.
The data provider consumes the raw data and makes data products. The flow can be different for different data products. In this demo, the Code Engine job subscribes to Object Storage events and is triggered when a data file is written to the Object Storage bucket.
The Code Engine job processes the data, writes the processed data back to Object Storage or a database, or pushes data records to the event stream.
If the data records are written to the event stream, the event stream handler can further process the data. Data products can be made available in Object Storage buckets, databases, or Event Streams topics.
Data consumers (or sometimes data publishers) can build a data catalog. A data consumer can consume various data products based on their requirements.

Batch data pipeline demo

This batch data pipeline demo uses services in IBM Cloud and follows best practices for IBM Cloud for Financial Services.

The demo uses the following IBM Cloud services:

Aspera: Provides high-speed data transfer
Identity and Access Management (IAM): Authenticates and authorizes services
Hyper Protect Crypto Services: Manages customer keys securely
Cloud Object Storage: Stores batch data files
Cloud Code Engine: Provides a serverless platform to execute batch data processing jobs
Event Streams: Provides a platform to process information in data streams
Logging: Examines Code Engine logs for debugging and monitoring
Container Registry: Stores the Docker image used by the IBM Cloud Code Engine job; alternatively, you can pass your source code repository to IBM Cloud Code Engine and let it manage the build process

IBM Cloud setup

To run the demo, you first need to set up a few IBM Cloud services.

Aspera server on-premises

Aspera offers unrivaled performance for transferring large files and large collections of files across any distance to, from, or between clouds. IBM Aspera FASP (Fast, Adaptive, and Secure Protocol) is a bulk data transport technology implemented at the application layer that provides secure high-speed transfer while remaining compatible with existing network and infrastructure. Refer to Finding a better big data transport solution for more details.

When transferring data from on-premises to cloud, you can install an Aspera server onto the on-premises server. In that case, you can use a direct link or VPN for added in-flight data security.

Aspera Transfer Service data flow Source: High-Speed Data Migration to the Cloud on Asperasoft.com

Refer to the Aspera server deployment documentation for installation instructions.

Hyper Protect Crypto Services

Hyper Protect Crypto Services is a single-tenant hybrid cloud key management service. It is built on FIPS 140-2 Level 4 certified hardware. It supports Keep Your Own Key (KYOK) capability and enables key orchestration across multi-cloud environments with Unified Key Orchestrator.

Hyper Protect Crypto Services is optional for this demo, although we recommend that you set it up and create KYOK keys for Object Storage bucket encryption, especially for financial industry clients who handle sensitive data.

See the Hyper Protect Crypto Services documentation for instance setup.

When you have set up a Hyper Protect Crypto Services instance, you can create a new key that can be used to encrypt the Object Storage bucket.

Create an encryption key

IBM Cloud Object Storage bucket

This demo uses an IBM Cloud Object Storage bucket write event to trigger the Code Engine job to further process the data. You first need to set up the Object Storage bucket. You can create one input bucket and one output bucket; if you choose to use one bucket for both the input and output data files, be careful with the cloud trigger configuration. You want to ensure that writing the output file to the same bucket does not trigger data processing again; for example, be sure to use a different file name pattern.

If you don’t have an Object Storage instance, create one by searching for “Object Storage” in the catalog.

Create an Object Storage instance

You can then create an Object Storage bucket.

Create an Object Storage bucket

If you want to use your Hyper Protect Crypto Services key, be sure to click on Hyper Protect Crypto Services, use the existing instance, and choose the key you want to use to protect the Object Storage bucket.

Select the Hyper Protect Crypto Services key

IBM Event Streams

IBM Event Streams is an event-streaming platform that is built on open source Apache Kafka.

The Code Engine job reads the data file from the input Object Storage bucket, processes the data, and pushes data records to the IBM Event Streams topic for further processing. An IBM Event Streams instance must be created.

Search for “Event Stream” to create an instance.

Find the event stream

Now create a topic.

Create an Event Streams topic

Test data file

You can upload any data file to the Object Storage bucket to trigger the Code Engine. This demo uses a sample .csv data file from the IBM Cloud Pak for Data Financial Market Customer Attrition accelerator.

To look at the sample data file from the Cloud Pak for Data accelerator, either set up a Cloud Pak for Data instance in FS Secure Landing Zone or use Cloud Pak for Data as a Service. Once you set up the instance, you can log in to the Cloud Pak for Data platform.

Now create a project from a sample or file.

Create a project

You can create your project from the Financial Markets Customer Attrition accelerator. You will then see the sample .csv file in your data assets.

Find the sample CSV file

IBM Cloud Code Engine

IBM Cloud Code Engine is a fully managed serverless platform. It can manage and secure the underlying infrastructure for your container images, batch jobs, or source code.

In this demo, the data processing logic is written in Java. We will build a Docker image from it, push the image to the IBM Cloud Container Registry, and deploy it as an IBM Cloud Code Engine job that subscribes to Object Storage bucket write events.

Code Engine: Source codes

The demo-data-pipeline-w-code-engine GitHub repository contains the source code that reads data files from the Object Storage bucket and pushes the data records to IBM Event Streams topic. The Java source code is in the src directory. For this demo, only the event producer is needed for the Code Engine job. However, the event consumer code is included in the same repo, although it is not needed for the Code Engine job deployment, it is used as a data handler to read data from IBM Event Streams topic and save data to the output Object Storage bucket.

In the bin directory, there are sample command-line scripts to run producer and consumer commands manually. Make sure to update the settings in the scripts before you run them.

A Dockerfile is included in the repo. It is used to build the Docker container image for the Code Engine job.

Code Engine: Build container image

When you create a Code Engine application or job, you have the option to specify either a container image or a source code repository. In this demo, we will build a container image and push it to IBM Cloud Container Registry.

The Dockerfile included in the repo can be used to build the container image:

# set PROJECT_DIR to the project directory
cd $PROJECT_DIR
# Build docker image, it is tagged yytest_datapipeline, replace it with your tag
docker build -t yytest_datapipeline .
# view images
docker images

Push the Docker image to IBM Cloud Container Registry:

# login to ibmcloud
ibmcloud login -a cloud.ibm.com

# set resource group, in this demo, it is yytest-rg, replace it with your resource group
ibmcloud target -g yytest-rg

# log in to IBM container registry
ibmcloud cr login

# list namespaces
ibmcloud cr namespace-list

# create namespace in the registry, replace yytest with your namespace
MY_NAMESPACE=yytest
ibmcloud cr namespace-add $MY_NAMESPACE

# Tag the image
docker tag yytest_datapipeline icr.io/$MY_NAMESPACE/datapipeline:latest

# Push the image to the registry
docker push icr.io/$MY_NAMESPACE/datapipeline:latest

# list images
ibmcloud cr images

Code Engine: Create the Code Engine project

Following are the commands to create the Code Engine project. You can also create the project on the IBM Cloud UI:

# -------------------------------------------------------------------------------------
# Create code engine project
# -------------------------------------------------------------------------------------
# set code engine project name, replace with your project name
CODE_ENGINE_PROJECT_NAME=yytest

# list options with code-engine command / For help with code-engine command
ibmcloud code-engine project help

# List existing code engine project
ibmcloud code-engine project list

# To create a code engine project
ibmcloud code-engine project create -name $CODE_ENGINE_PROJECT_NAME

# To delete a code engine project, if needed
ibmcloud code-engine project delete -name $CODE_ENGINE_PROJECT_NAME

# select the code engine project as current project
ibmcloud code-engine project select --name $CODE_ENGINE_PROJECT_NAME

Code Engine: Set permissions

IBM Cloud Code Engine requires access to container registries to pull the container image for the job, or to build and store container images when the source code repository is provided. If your registry is public, you do not have to set up authorities to pull images. Note that although it is acceptable to pull images from a public registry while you are getting started with Code Engine, it is highly recommended to use a private registry for your enterprise workloads. We use IBM Cloud Container Registry in this demo; the service ID and API key are used to authorize Code Engine access to Container Registry.

Follow the steps in the IBM Cloud docs to authorize access to Container Registry with a service ID and create an API key. The docs also provide command line instructions if you prefer to use the CLI.

Authorize access to Container Registry

In this demo, the Code Engine job needs to access an Object Storage bucket and IBM Event Streams topic, so you must also grant permission to Object Storage and Event Streams to the service ID.

Provide access to Object Storage and Event Streams

Now the Code Engine project can be configured with the service ID created above. See Binding a service instance to a Code Engine app or job for details.

# Check the current ce project
ibmcloud ce project current

# Configuring a project with a custom service ID
ibmcloud ce project update --binding-service-id YOUR_SERVICE_ID

See Accessing container registries in the IBM Cloud docs for instructions.

Accessing container registries

Code Engine: Job deployment

You can integrate an IBM Cloud service instance to resources in a Code Engine project by using service bindings. In this demo, the Code Engine job needs to access Object Storage and Event Streams instances, so we need to create service bindings. With the service binding, the credentials of the service will be injected as environment variables with the given prefix at runtime. See Binding a service instance to a Code Engine app or job in the IBM Cloud docs for more information about service binding.

# create or update configmap, env variable settings
ibmcloud ce configmap create --name datapipelineconfigmap --from-env-file config.txt
ibmcloud ce configmap update --name datapipelineconfigmap --from-env-file config.txt

# check the content of the configmap
ibmcloud ce configmap get -n datapipelineconfigmap

# Deploy or update code engine job
ibmcloud code-engine job create --name datapipeline-job --image icr.io/yytest/datapipeline:latest --registry-secret ibm-container-registry --env-from-configmap datapipelineconfigmap
ibmcloud code-engine job update --name datapipeline-job --image icr.io/yytest/datapipeline:latest --registry-secret ibm-container-registry --env-from-configmap datapipelineconfigmap

# Bind service instance to code engine app or job
# Refer to https://cloud.ibm.com/docs/codeengine?topic=codeengine-bind-services

# list existing services
ibmcloud resource service-instances | grep yytest

# Bind to Object Storage service with prefix Object Storage
ibmcloud ce job bind --name datapipeline-job --service-instance yytest-cos --prefix COS --no-wait

# Bind to event Streams with prefix KAFKA
ibmcloud ce job bind --name datapipeline-job --service-instance yytest-Event-Streams --prefix KAFKA --no-wait

# get details of the job
ibmcloud code-engine job get --name datapipeline-job

Code Engine: Job invocation

To test the Code Engine job manually, click the Submit job button on the Code Engine job UI, or submit the job using the command line:

# submit job
ibmcloud code-engine jobrun submit --name datapipeline-jobrun --job datapipeline-job

# check jobrun status
ibmcloud code-engine jobrun get --name datapipeline-jobrun

# get logs
ibmcloud code-engine jobrun logs --instance <JOBRUN_INSTANCE_NAME>

Code Engine: Logging and monitoring

View the logs for the job using the command line, or click Launch logging and Launch monitoring to open the logging and monitoring windows, respectively. The logs are helpful for debugging issues with the job.

Launch the logging and monitoring windows in IBM Cloud Code Engine

Code Engine: Job automation

For the Code Engine job to subscribe to an Object Storage event, the notification manager role must be assigned to the Code Engine. See the instructions on assigning roles in Working with the IBM Cloud Object Storage event producer in the IBM Cloud docs.

Assign a role to IBM Cloud Code Engine

We can now create an event subscription in the Code Engine project.

Create an event subscription in IBM Cloud Code Engine project

See Subscribing to IBM Cloud Object Storage events for an application in the IBM Cloud docs for instructions.

Subscribing to IBM Cloud Object Storage events

Data handler

The Code Engine job reads the data file from the Object Storage bucket and pushes data records to an Event Streams topic. A data handler is also included in the demo, which consumes events from Event Streams topic, further processes the data if needed, and writes the data back to output Object Storage bucket. You can simply run the sample script at bin/script_consumer.sh to start the event consumer. The event consumer gets no message until the event producer or Code Engine job is run.

# Build project
cd $PROJECT_DIR
mvn clean compile assembly:single

# start consumer locally and listen for messages (update the settings first)
$PROJECT_DIR/bin/script_consumer.sh

Here is the sample output. Be careful not to print or log the messages if they contain sensitive information.

Sample output from IBM Cloud Object Storage

Sample data usage

The batch data file has now gone through the entire journey, from on-premises, to being saved in the input Object Storage bucket in the cloud and being processed by Code Engine job, pushed to Event Stream topic, consumed by the data handler, and eventually persisted on an output Object Storage bucket. It is ready to be used by various applications, whether to build AI models, generate reports, combine with data from other sources, or provide new data or trained models back to on-premises systems.

In this demo, we use the sample data from the IBM Cloud Pak for Data Financial Markets Customer Attrition accelerator. We can continue with the accelerator now to create a data catalog or build models with the data.

Summary

To retain the core strengths and attributes of the IBM mainframe platform and to leverage extensive cloud services, security and regulatory compliance programs of IBM Cloud, a hybrid cloud strategy can be adopted.

Depending on the use cases, concrete requirements such as data format, frequency, volume, replication factor, and security can vary. Data from on-premises can be made available to the cloud using different data integration patterns.