Create a hybrid cloud data architecture using IBM Cloud Pak for Data and AWS

Four tutorials cover different features of the IBM services on the AWS environment

By

Sharath Kumar RK,

Arpit Nanavati

In this section, you learn how to create a hybrid cloud data architecture using IBM Cloud Pak for Data and AWS. This section includes four tutorials covering different features of the IBM services in the AWS environment.

Data fabric is a highly scalable, distributed data architecture comprising shared data assets and streamlined data integration and governance capabilities that can be used to tackle modern data challenges. A typical data fabric solution consists of multiple components, such as Data Catalog, Data Integration, Data Governance, and Data Visualization.

In tutorials that follow, you learn how to solve the challenges faced by different personas in data and AI:

  • Data scientists spend 80% of their time discovering, curating, and cleansing the data. How can you provide them with quality data for building AI-based solutions?
  • Data engineers face lots of challenges while integrating data from multiple data sources. How can they quickly and efficiently collect and integrate data?
  • Data stewards deal with data privacy and protection challenges. How can you ensure that the data is being governed and no sensitive information is being shared with data consumers?

These gaps can be addressed by the governed data fabric architecture using IBM Cloud Pak for Data.

After completing this section, you will understand how to:

  • Create a connection between external data sources and IBM Cloud Pak for Data
  • Ingest data from multiple data sources
  • Clean, filter, and reshape data
  • Query data from multiple data sources without copying or moving the data
  • Create a data integration pipeline to transform and integrate data from heterogeneous data sources
  • Protect sensitive data (such as PII) to be shared with data consumers
  • Schedule a job to periodically run a data integration pipeline

Architecture

Architecture flow diagram

Flow

  • Create an external connection between external data sources (such as Amazon S3 or Amazon Aurora PostgreSQL) and IBM Cloud Pak for Data.
  • Use IBM Data Virtualization to query data from multiple data sources without creating a data replica.
  • Use IBM DataStage to create an ETL pipeline.
  • Use IBM Data Refinery Flow to clean and filter the data.
  • Use IBM Watson Knowledge Studio to profile and govern the data.
  • Supply the data to an AI-based predictive system such as Amazon SageMaker or Jupyter Notebook to create machine learning models.

Video demo

For a quick introduction, start with our video introducing data access and governance using IBM Cloud Pak for Data on AWS.

Included components

Here are the components and services that are included in this section:

  • IBM Cloud Pak for Data: A data and AI platform with a data fabric that makes all data available for AI and analytics on any cloud.
  • IBM DataStage: An integration tool that helps you design, develop, and run jobs that move and transform data.
  • IBM Data Refinery: A cloud service that provides a self-service data preparation client to transform raw data into data that's ready for analytics.
  • IBM Watson Knowledge Catalog: Activate business-ready data for AI and analytics with intelligent cataloging, backed by active metadata and policy management.
  • Amazon Simple Storage Service (Amazon S3): An object storage service that offers industry-leading scalability, data availability, security, and performance.
  • Amazon Redshift: Accelerate your time to insights with fast, easy, and secure cloud data warehousing at scale.
  • Amazon Aurora: Designed for unparalleled high performance and availability at global scale with full MySQL and PostgreSQL compatibility.
  • Data Fabric: An architectural approach to simplifying data access in an organization to facilitate self-service data consumption.
  • Analytics: Uncover insights with data collection, organization, and analysis.
  • Data Management: Organize and maintain data processes throughout the information lifecycle.
  • Data Privacy: Ensures that user data is used responsibly.

Data access and governance use cases

This section covers 2 use cases as illustrated in the following diagram:

Data access and governance use cases

In next tutorial, you learn how to solve data silo challenges without copying or moving data using the Data Virtualization service offered by IBM Cloud Pak for Data.