Collect, cleanse, and enhance your data

In this Tutorial, we will perform data engineering operations on multiple datasets using IBM Watson Data Refinery on IBM Cloud Pak for Data or Watson Studio on IBM Cloud.

A data scientist cannot directly build a model based on the dataset. The data collection and analysis is very essential before building a model. In this tutorial, we demonstrate how you can easily collect data from databases, analyze the data, and enhance the data according to requirements with the help of IBM Watson Data Refinery on IBM Cloud Pak for Data or Watson Studio on IBM Cloud.

Learning objectives

When you have completed this tutorial, you will understand how to:

  • Create a set of ordered steps to cleanse, shape, and enhance data
  • Create a connection with any database and Watson Data Refinery
  • Prepare datasets specific to your ML Model
  • Save the datasets in any database of your choice

Prerequisites

  1. Any SQL Database.

In this Tutorial we have demonstrated with Db2 on IBM Cloud Pak for Data and Db2 on Cloud.

  1. IBM Cloud Account – If you prefer to deploy on IBM Cloud.

Estimated time

Completing this tutorial should take about 30 minutes.

Steps

Following this tutorial you can deploy on either IBM Cloud Pak for Data or IBM Cloud.

Steps to deploy on IBM Cloud Pak for Data

Step 1: Download the data

In this Tutorial we are going to use Brazilian E-Commerce Public Dataset by Olist from Kaggle. Download the dataset from the link given below.

After Downloading, Extract the brazilian-ecommerce.zip file.

We’ll be using the following files:

  1. brazilian-ecommerce/olist_orders_dataset.csv : This is the core dataset. From each order you might find all other information.

  2. brazilian-ecommerce/olist_order_items_dataset.csv : This dataset includes data about the items purchased within each order.

  3. brazilian-ecommerce/olist_products_dataset.csv : This dataset includes data about the products sold by Olist.

  4. brazilian-ecommerce/olist_sellers_dataset.csv : This dataset includes data about the sellers that fulfilled orders made at Olist.

Step 2: Load the data into tables in Db2

NOTE: We are Assuming you have already Provisioned a Db2 Instance in your IBM Cloud Pak for Data. If you do not have Db2 Instance Provisioned you can also use other on-prem, public or private Databases of your choice and load the datasets.

  • Open the Db2 Instance and click on load data.

click load data image

  • Select the olist_orders_dataset.csv file and select next.

Select the olist_orders_dataset image

  • Choose your namespace and create a table named ORDERS and select next.

Note: Make sure you have selected the default schema of your database. In case of Db2 your default Schema is your username.

Create a new table image

  • You can preview the metadata of the table and select next.

Preview metadata image

  • Click on Begin Load to import the downloaded .csv file into your Db2.

Click Begin load image

  • Wait for the upload to finish.

Wait for upload to finish image

  • Once the table is created, click on Load More Data to add the other three datasets.

Load more data image

  • Load the olist_order_items_dataset.csv and name the table ORDERITEMS, load olist_products_dataset.csv and name the table PRODUCTS & finally load olist_sellers_dataset.csv and name the table SELLERS by repeating the above steps.

Step 3: Create a Project in IBM Cloud Pak for Data

Once the Database is ready, we will start using the database in IBM Cloud Pak for Data.

  • Create a Project in IBM Cloud Pak for Data choose an Empty Project.

Choose an empty project image

  • Once The Project is Created you will see the below page.

Project is created image

Step 4: Add Db2 connection to the project

Now that we have created a project, we will start adding components to our project. We will start by adding the Db2 Connection to our project first.

Click on Add to Project and select Connection. If you have followed step 2 (from above), select Db2 from the list and add the credentials of your provisioned Db2 Instance. If you have a different database then you can select that and fill in the credentials.

Create connection image

  • After filling the credentials click on Test Connection to make sure you have entered correct credentials. Finally select Create.

Test connection image

NOTE: The Database Credentials will be provided by your Database administrator. If you have provisioned a Db2 instance on Cloud Pak for Data then you can follow the steps here to get the credentials.

Step 5: Add Data Refinery to the project and perform Data Engineering Operations

5.1 Add Data Refinery to the project

We will add Data Refinery Flow in the similar way.

  • Click on Add to Project and select Data Refinery Flow.

Data refinery flow image

Under Assets click on Connections and then click on the connection that you created in step 4 (from above), click on the schema of your Database and select the table ORDERS and finally click on ADD.

Select orders image

  • You will now see the Data Refinery Dashboard.

Data refinery dashboard image

5.2 Perform Data Engineering Operations

5.2.1 We will be performing the Join in this tutorial. Click on Operation on the top left and click on Join.

Join operation image

  • 5.2.2 Select the Inner Join and add the second dataset from our db2 by clicking the button shown.

Add dataset image

5.2.3 We will first join the ORDERS table with ORDERITEMS table from db2. Under Assets click on Connections and then click on the connection that you created in step 4 (from above), click on the schema of your Database and select the table ORDERITEMS and finally click on APPLY.

Select order items image

5.2.4 Select the JOIN KEYS for ORDERS and ORDERITEMS as order_id and click NEXT.

Key order image

5.2.5 Click on APPLY to apply the Join Operation.

Apply operation image

  • Repeat the steps 5.2.1 to step 5.2.5 to keep joining data to the original by product id and seller id.

  • Select the JOIN KEYS for ORDERS and PRODUCTS as product_id.

Key product image

  • Select the JOIN KEYS for ORDERS and SELLERS as seller_id.

Key seller ID image

Step 6: Save the Enhanced Dataset to a table in Db2 and Run the Job

  • Once the operations are performed its time to save the result in a table. By Default, the resulting table will be saved as a .csv file in the project, but we will change the output path to the Db2 database.

6.1 Save the Enhanced Dataset to a table in Db2

  • Click on the Edit button on the top right as shown.

Edit button image

Then click on the Pencil button as shown.

Edit pencil image

  • Click on Change Location, under Assets click on Connections and then click on the connection that you created in step 4 (from above), click on the schema of your Database and finally click on SAVE LOCATION.

select DB location image

  • Name the Dataset DERIVEDDATA and click on done.

  • NOTE: Use Uppercase naming only, as Db2 stores in Uppercase.

Tick button image

6.2 Run the Data Refinery Job

  • Click on the Save and create a Job as shown.

create a job image

  • Give a name to the Job and finally click on Create and Run.

create run image

The Job will start running and it will take approximately 4-5 min to complete.

running job image

Once The Job Status becomes Completed, you can check your database to see a new table with a name four_tables_merged with the result.

Steps to deploy on IBM Cloud

Step 1: Download the data

In this Tutorial we are going to use Brazilian E-Commerce Public Dataset by Olist from Kaggle. Download the dataset from the link given below.

After Downloading, Extract the brazilian-ecommerce.zip file.

We’ll be using the following files:

  1. brazilian-ecommerce/olist_orders_dataset.csv : This is the core dataset. From each order you might find all other information.

  2. brazilian-ecommerce/olist_order_items_dataset.csv : This dataset includes data about the items purchased within each order.

  3. brazilian-ecommerce/olist_products_dataset.csv : This dataset includes data about the products sold by Olist.

  4. brazilian-ecommerce/olist_sellers_dataset.csv : This dataset includes data about the sellers that fulfilled orders made at Olist.

Step 2: Load the data into tables in Db2

NOTE: You can Skip this step if you do not want to use Db2 Instance as you can use other on-prem, public or private Databases of your choice and load the datasets.

db2 resource image

  • Once the Resource is ready click on Service Credentials on the left panel and then click view credentials.

View credentials image

NOTE: Copy these credentials as it will be used in Step 4 (shown above).

  • Now click on Manage on the left panel and then click on Open Console to open the Db2 Console.

Open console image

  • Once the Db2 Console is opened, click on load data.

Load data image

Select the olist_orders_dataset.csv file and select next.

Browse files image

  • Choose your namespace and create a table named ORDERS and select next.

Note: Make sure you have selected the default schema of your database. In case of Db2 your default Schema is your username.

Create table image

  • You can preview the metadata of the table and select next.

View table meta image

  • Click on Begin Load to import the downloaded .csv file into your Db2.

Begin load image

  • Wait for the upload to finish.

Wait for upload image

  • Once the table is created, click on Load More Data to add the other three datasets.

Load more data image

  • Load the olist_order_items_dataset.csv and name the table ORDERITEMS, load olist_products_dataset.csv and name the table PRODUCTS and finally load olist_sellers_dataset.csv and name the table SELLERS by repeating the above steps.

Step 3: Create a Watson Studio Service

Once the Database is ready, we will start using the database in our Watson Studio on IBM Cloud.

Create Watson Studio service image

  • Then click on Get Started.

  • In Watson Studio click Create a project > Create an empty project and name it Retail.

Create Watson Studio project image

Step 4: Add Db2 connection to the project

Now that we have created a project, we will start adding components to our project. We will start by adding Db2 Connection to our project first.

  • Click on Add to Project and select Connection. If you have followed step 2, select Db2 from the list and add the credentials of your provisioned Db2 Instance. If you have a different database then you can select that and fill in the credentials.

Create connection image

After filling the credentials click on Create.

Click on connection image

NOTE: The Database Credentials are generated in Step 2.

Step 5: Add Data Refinery to the project and perform Data Engineering Operations

5.1 Add Data Refinery to the project

We will add Data Refinery Flow in the similar way.

  • Click on Add to Project and select Data Refinery Flow.

Select data refinery flow image

  • Under Assets click on Connections and then click on the connection that you created in step 4, click on the schema of your Database and select the table ORDERS and finally click on ADD.

select orders image

  • You will now see the Data Refinery Dashboard.

data refinery image

5.2 Perform Data Engineering Operations

5.2.1 We will be performing the Join in this tutorial. Click on Operation on the top left and click on Join.

Join operation image

5.2.2 Select the Inner Join and add the second dataset from our db2 by clicking the button shown.

Add dataset image

5.2.3 We will first join the ORDERS table with ORDERITEMS table from db2. Under Assets click on Connections and then click on the connection that you created in step 4, click on the schema of your Database and select the table ORDERITEMS and finally click on APPLY.

select order items image

5.2.4 Select the JOIN KEYS for ORDERS and ORDERITEMS as order_id and click NEXT.

Order ID image

5.2.5 Click on APPLY to apply the Join Operation.

Apply operation image

  • Repeat the steps 5.2.1 to step 5.2.5 to keep joining data to the original by product id and seller id.

  • Select the JOIN KEYS for ORDERS and PRODUCTS as product_id.

Key product ID image

  • Select the JOIN KEYS for ORDERS and SELLERS as seller_id.

Key seller ID image

Step 6: Save the Enhanced Dataset to a table in Db2 and Run the Job

Once the operations are performed its time to save the result in a table. By Default the resulting table will be saved as a .csv file in the project but we will change the output path to the Db2 database.

6.1 Save the Enhanced Dataset to a table in Db2

  • Click on the Edit button on the top right as shown.

Edit button image

  • Then click on the Pencil button as shown.

Click pencil button image

Click on Change Location, under Assets click on Connections and then click on the connection that you created in step 4, click on the schema of your Database and finally click on SAVE LOCATION.

Save location image

  • Name the Dataset DERIVEDDATA and click on done.

NOTE: Use Uppercase naming only, as Db2 stores in Uppercase.

Name dataset image

6.2 Run the Data Refinery Job

  • Click on the Save and create a Job as shown.

Create job image

Give a name to the Job and finally click on Create and Run.

Create and run image

  • The Job will start running and it will take approximately 4-5 min to complete.

Create and run image

Once The Job Status becomes Completed, you can check your database to see a new table with a name four_tables_merged with the result.

Summary

A data scientist cannot directly build a model based on the dataset. The data collection and analysis is very essential before building a model. This tutorial allows data scientists to perform data engineering operations to any data and reduces the time spent on a data engineering operations. This helps a data scientist to focus mainly on building a model. The main advantage of the Data Refinery capabilities of IBM Cloud Pak for Data is creating a set of ordered steps to cleanse, shape, and enhance data.

Manoj Jahgirdar
Smruthi Raj Mohan
sri kanth
Manjula G Hosurmath