Data integration tutorial

Establish a connection between an external data source and IBM Cloud Pak for Data, and create an ETL pipeline using IBM DataStage

By

Sharath Kumar RK,

Arpit Nanavati

The Extract, Transform, Load (ETL) process breaks down data silos and empowers enterprises to integrate data from multiple data sources into a single repository of well-formatted data. IBM DataStage on IBM Cloud Pak for Data enables you to create, manage, load, and run DataStage jobs to perform seamless data integration from disparate data sources to yield insightful information. DataStage utilizes common ETL concepts such as flows, jobs, projects, connections, and environments.

This tutorial shows you how to :

  • Make a connection between an external data source and IBM Cloud Pak for Data
  • Create an ETL pipeline using IBM DataStage
  • Create a scheduling rule to create a data integration pipeline

Prerequisites

To complete this tutorial, you need to have the following installed on your system:

  • IBM Cloud Pak for Data
  • IBM DataStage
  • External Data Sources (Amazon S3, Amazon Aurora PostgreSQL)
  • IBM Watson Knowledge Catalog

Estimated time

It should take you about 30 minutes to complete this tutorial.

Steps

This tutorial consists of the following steps:

Step 1. Log in to IBM Cloud Pak for Data

First, you need to log in to IBM Cloud Pak for Data with valid credentials (username and password).

Login screen

Step 2. Create a new project

Click on the navigation menu and expand Projects and click All projects.

Click "All projects"

Then click on New project + and create a new project as shown in the image below.

Select "Analytics project"

Click Create an empty project.

Create an empty project

Give the project a name (such as Data_Fabric_Project).

New project - define details

After the project has been created, you should be redirected to the project page as shown below.

Data-Fabric-Project created

Step 3. Create new connections with external data sources

  1. Select Add to project + and choose Connection as the asset type.

    Choose asset type

  2. Choose Amazon S3 as the connection type.

    Add connection - Amazon S3

  3. Provide the connection details to create the connection between Amazon S3 and IBM Cloud Pak for Data.

    Amazon S3 connection details

    Credentials - enter provided key

  4. Click Test connection to validate the connection.

    Test connection

    If validation is successful, click Create to create the S3 connection.

    Create Amazon S3 connection

  5. Similarly, perform the same steps to create a connection for the Amazon Redshift and the Amazon RDS for PostgreSQL asset types.

    For Amazon Redshift, choose the Connection asset type.

    Choose asset type

    Then select the Amazon Redshift connector.

    Add connection - Amazon RedShift

    Give a name to the connection (such as "Amazon Redshift Connection").

    Amazon RedShift connection details

    And similarly, for Amazon RDS for PostgreSQL, go to your project page, click Add to project + and choose Connection.

    Choose asset type

    Select the Amazon RDS for PostgreSQL connector.

    Add connection - Amazon RDS for PostgreSQL

    Provide the connection details.

    Amazon RDS for PostgreSQL connection details

  6. Now that you have created the data source connection, you can import the data from the connected data source. Click Add to project + and then click Connected data.

    Choose asset type - Connected data

    Click Select source.

    Select data source

    Select apotheca_healthcare_personnel_data.csv.

    Select connection source - Amazon S3

    Give the asset a unique name.

    Specify asset name - Amazon S3

    In the Data_Fabric_Project page, you should see the recently added asset.

    View data asset in project - Amazon S3

  7. Similarly, collect data from the RedShift data source.

    Choose asset type - Connected data

    Find and select actavis_pharma_healthcare_personnel_table.

    Select connection source - RedShift

    Specify a unique name for the asset.

    New connected data asset - RedShift

    Verify that the asset has been added to the project.

    View data asset in project - RedShift

  8. Similarly, import the data from the Amazon Aurora PostgreSQL database.

    Choose asset type - Connected data

    Find and select mylan_specialty_personnel_data_table.

    Select connection source - Aurora PostgreSQL

    Specify a unique name for the asset.

    New connected data asset - Aurora PostgreSQL

    Verify that the asset has been added to the project.

    View data asset in project - Aurora PostgreSQL

  9. To create the integration pipeline, click Add to project + and then DataStage flow.

    Choose asset type - DataStage flow

  10. Enter the DataStage flow name and click Create to create a new DataStage flow.

    Create new DataStage flow

  11. On the DataStage homepage, you should see 3 options:

    • Connectors to import or output to the data source
    • Stages to perform ETL operations
    • Quality to enhance the quality of the data

    DataStage homepage

  12. Click Connectors to expand the menu, and then drag and drop Asset browser to the DataStage canvas:

    Asset browser

  13. Now you need to select your data assets to create the integration pipeline. Select Data asset and then click all three of the data assets that were imported in the previous step, and then click Add.

    Select data assets

  14. You should then see all three data assets as shown below:

    View assets

  15. Next, use the search box to search on "funnel," and then drag and drop the Funnel stage to the canvas:

    Drag and drop Funnel stage to canvas

  16. Create links from all of the data assets to the Funnel stage, as shown in this image:

    Link assets to Funnel stage

  17. Double-click on all three data assets one by one, then click the Output tab, and then click Edit to verify the data type, length, and nullability of all columns.

    Verify data type, length, and nullability of all columns

    Verify S3 information

    Verify RedShift information

    Verify Aurora information

  18. Search for the Remove Duplicates stage and drag and drop it to the canvas:

    Drag and drop Remove Duplicates stage to canvas

  19. Double click on the Remove Duplicates stage to select the criteria:

    Remove Duplicates stage - select criteria

    Adding a Remove Duplicates key

  20. Search on "sort" and drag and drop the Sort stage to the canvas. Double-click on the Sort stage and select sort criteria by ID.

    Drag and drop Sort stage to canvas

    Sort criteria

  21. Search on "rds" and then drag and drop the Amazon RDS for PostgreSQL connector to the canvas:

    Drag and drop Amazon RDS for PostgreSQL stage to canvas

  22. Double click on the RDS connector to specify the data source and table name.

    RDS connector - data source and table name

  23. Compile the DataStage pipeline and run it if the compile is successful.

    Compile Datastage pipeline

    Datastage pipeline - compile successful

    Run Datastage pipeline

  24. Now that you've integrated the data that's available in Amazon RDS for PostgreSQL, let's import the data from the data source.

    Choose asset type - connected data

    Find and select the integrated table (for example, healthcare_personnel_integrated_data_table_v1).

    Select connection source

    Specify a name for the asset.

    Data asset name

    And finally, verify that the data asset is present in the project.

    View data asset in project - Amazon RDS for PostgreSQL

Summary

In this tutorial, you learned how to create a connection with Amazon S3, RDS for PostgreSQL, and Redshift, and how to collect data from these data sources. You also learned that how to create an ETL pipeline using IBM DataStage.

The next tutorial in this series shows you how to clean and reshape data using IBM Data Refinery flow.