In the life-cycle of data science, data preparation is one of the most important stages. Data scientists spend 80% of their time cleansing, shaping and formatting data before doing any analysis. IBM Data Refinery, an intuitive cloud-based data preparation service, helps you quickly source, shape and share your data sets. This tutorial is a short introduction for data wrangling and will introduce you to IBM Data Refinery’s capabilities and how can you utilize it to prepare your data.

The use-case of this tutorial is Titanic data set. It has 12 columns of type integer, double and string. Some columns need shaping or cleaning operations to fully make use of the data. Mostly, we will fill the missing values with different approaches.

Learning objectives

After completing this tutorial, you’ll understand how to:

  • Acquire the skills of data engineering and cleaning
  • Format and shape your data according to project requirements

Prerequisites

In order to complete this tutorial, you will need the following:

  • IBM Cloud account: An account must exist to use the platform.
  • Watson Studio service instance: A service instance must exist to be able to use IBM Data Refinery.

Estimated time

The tutorial will take approximately 10 minutes to complete.

Steps

Step 1. Create an IBM cloud account

If you do not have an IBM Cloud account, create an account here.

  • A Lite account is a free of charge. Make sure to set the region to U.S. South.

Step 2. Create a Watson Studio service instance

If you don’t have a watson Studio instance, do the following:

  1. Select Catalog found at the top right of the page.
  2. Click on Watson from the menu on the left, which you can find under Platform services.
  3. Select Browse Services under Watson Services.
  4. Choose watson Studio instance

Create Watson Studio service instance

  • Once the main page of the service appears, click on Get Started. This will redirect your browser to the Watson Studio platform. You may be asked to confirm IBM Cloud organization and space information.

Step 3. Create a standard project in Watson Studio

From the Get Started page, select Create a project

Create a project in Watson Studio

Then choose a Standard plan

Choose a standard plan

  • Make sure a cloud storage instance exists, or add a new IBM Cloud Object Storage instance by clicking on Add under Select storage service.

Make sure cloud storage instance exist

Step 4. Upload data set

The operations will be done using Titanic dataset which can be downloaded here. Save the csv file to apply the following steps.

Under the Asset tab in the project, choose this icon choose this icon on the right to upload the dataset to the platform.

  • Click browse to navigate your folders where the dataset set can be found, and select file train.csv.
  • After it’s uploaded, it will be listed in the Data assets.

Upload data set

Step 5. Open data refinery

To start the process, press the Action Menu (triple dot) in the right side of the train.csv bar to open Refine.

Open Data Refinery

Step 6. Convert column type

All the columns initially are of type string, for better shaping, convert those integer values columns from string to integer. From the Action menu that appears in the right side of each column, select Convert Column type and choose the type. In Titanic’s use case, the columns that are converted to integer are Survived, PClass, Sibsp and Parch. The columns are converted to decimal are Age and Fare.

Convert Column Type

Step 6. Fill missing values

The columns that have missing values in Titanic dataset are Age, Cabin and Embarked. The methods to fulfill the missing values are different for each attribute depending on the purpose of the attribute.

So, to fill the missing values in the Embarked attribute, we only fill it with ‘S’ knowing that the passengers actually embarked at Southampton.

For the Cabin attribute, we’ll create an additional column that has 1 for a passenger who’s cabin exists and 0 if it does not exist. Relating to the accident, known passenger’s cabin indicates they survived. To do that, follow the steps below:

  • Press the Actions menu in the Cabin column.
  • Select Conditional Replace under the Organize Category

Fill missing values for Cabin attribute

  • Add two conditions in the Cabin’s column: if the value is empty replace it with 0, if it’s not empty replace with 1.

Add Conditions to cabin column

  • Choose Cabin column, in the option of Replace put 1

Choose cabin column

For age attribute, calculate the mean of the column values and place it in the null values. To replace missing values by the mean of the column, do the following:

  1. From the Actions menu choose Filter condition.

Choose filter condition

  1. Select column Age and Is not empty under operator for the Filteration condition.

Select column age

  1. From the Operation Bar, select Summarize operator.

Select summarize operator

  1. Fill in the operation command the required variables like this Summarize(newVarName=operator(column))

    summarize(newAge= mean(``Age``))
    

    Fill in operation command 1 Fill in operation command 2

  2. Copy the generated value to use after, and undo the last two actions from the backward arrow above, since the filteration and the new summarized value is now useless.

Copy the generated value

  1. Select Age column again, from the Action menu choose Replace missing values.

Select age

  1. Insert the mean value to be replaced with and press Apply button.

Insert the mean value

Step 7. Remove duplicates

The Titanic data set does not have sensitive information that should be unique except for the passenger ID. Simply select the Action menu in Passenger Id column, and choose Remove duplicates.

Remove duplicates

Summary

In this tutorial, you learned the first stage of data science. The outcome of this stage determines the success of futher stages. You’ve also learned how IBM Data Refinery can help you gain a fast approach for rough data cleaning — with no coding requirements.