About cookies on this site Our websites require some cookies to function properly (required). In addition, other cookies may be used with your consent to analyze site usage, improve the user experience and for advertising. For more information, please review your options. By visiting our website, you agree to our processing of information as described in IBM’sprivacy statement. To provide a smooth navigation, your cookie preferences will be shared across the IBM web domains listed here.
The Extract, Transform, Load (ETL) process breaks down data silos and empowers enterprises to integrate data from multiple data sources into a single repository of well-formatted data. IBM DataStage on IBM Cloud Pak for Data enables you to create, manage, load, and run DataStage jobs to perform seamless data integration from disparate data sources to yield insightful information. DataStage utilizes common ETL concepts such as flows, jobs, projects, connections, and environments.
This tutorial shows you how to :
- Make a connection between an external data source and IBM Cloud Pak for Data
- Create an ETL pipeline using IBM DataStage
- Create a scheduling rule to create a data integration pipeline
Prerequisites
To complete this tutorial, you need to have the following installed on your system:
- IBM Cloud Pak for Data
- IBM DataStage
- External Data Sources (Amazon S3, Amazon Aurora PostgreSQL)
- IBM Watson Knowledge Catalog
Estimated time
It should take you about 30 minutes to complete this tutorial.
Steps
This tutorial consists of the following steps:
Step 1. Log in to IBM Cloud Pak for Data
First, you need to log in to IBM Cloud Pak for Data with valid credentials (username and password).
Step 2. Create a new project
Click on the navigation menu and expand Projects and click All projects.
Then click on New project + and create a new project as shown in the image below.
Click Create an empty project.
Give the project a name (such as Data_Fabric_Project
).
After the project has been created, you should be redirected to the project page as shown below.
Step 3. Create new connections with external data sources
Select Add to project + and choose Connection as the asset type.
Choose Amazon S3 as the connection type.
Provide the connection details to create the connection between Amazon S3 and IBM Cloud Pak for Data.
Click Test connection to validate the connection.
If validation is successful, click Create to create the S3 connection.
Similarly, perform the same steps to create a connection for the Amazon Redshift and the Amazon RDS for PostgreSQL asset types.
For Amazon Redshift, choose the Connection asset type.
Then select the Amazon Redshift connector.
Give a name to the connection (such as "Amazon Redshift Connection").
And similarly, for Amazon RDS for PostgreSQL, go to your project page, click Add to project + and choose Connection.
Select the Amazon RDS for PostgreSQL connector.
Provide the connection details.
Now that you have created the data source connection, you can import the data from the connected data source. Click Add to project + and then click Connected data.
Click Select source.
Select apotheca_healthcare_personnel_data.csv.
Give the asset a unique name.
In the Data_Fabric_Project page, you should see the recently added asset.
Similarly, collect data from the RedShift data source.
Find and select actavis_pharma_healthcare_personnel_table.
Specify a unique name for the asset.
Verify that the asset has been added to the project.
Similarly, import the data from the Amazon Aurora PostgreSQL database.
Find and select mylan_specialty_personnel_data_table.
Specify a unique name for the asset.
Verify that the asset has been added to the project.
To create the integration pipeline, click Add to project + and then DataStage flow.
Enter the DataStage flow name and click Create to create a new DataStage flow.
On the DataStage homepage, you should see 3 options:
- Connectors to import or output to the data source
- Stages to perform ETL operations
- Quality to enhance the quality of the data
Click Connectors to expand the menu, and then drag and drop Asset browser to the DataStage canvas:
Now you need to select your data assets to create the integration pipeline. Select Data asset and then click all three of the data assets that were imported in the previous step, and then click Add.
You should then see all three data assets as shown below:
Next, use the search box to search on "funnel," and then drag and drop the Funnel stage to the canvas:
Create links from all of the data assets to the Funnel stage, as shown in this image:
Double-click on all three data assets one by one, then click the Output tab, and then click Edit to verify the data type, length, and nullability of all columns.
Search for the Remove Duplicates stage and drag and drop it to the canvas:
Double click on the Remove Duplicates stage to select the criteria:
Search on "sort" and drag and drop the Sort stage to the canvas. Double-click on the Sort stage and select sort criteria by ID.
Search on "rds" and then drag and drop the Amazon RDS for PostgreSQL connector to the canvas:
Double click on the RDS connector to specify the data source and table name.
Compile the DataStage pipeline and run it if the compile is successful.
Now that you've integrated the data that's available in Amazon RDS for PostgreSQL, let's import the data from the data source.
Find and select the integrated table (for example, healthcare_personnel_integrated_data_table_v1).
Specify a name for the asset.
And finally, verify that the data asset is present in the project.
Summary
In this tutorial, you learned how to create a connection with Amazon S3, RDS for PostgreSQL, and Redshift, and how to collect data from these data sources. You also learned that how to create an ETL pipeline using IBM DataStage.
The next tutorial in this series shows you how to clean and reshape data using IBM Data Refinery flow.