In the life-cycle of data science, data preparation is one of the most important stages. Data scientists spend 80% of their time cleansing, shaping and formatting data before doing any analysis. IBM Data Refinery, an intuitive cloud-based data preparation service, helps you quickly source, shape and share your data sets. This tutorial is a short introduction for data wrangling and will introduce you to IBM Data Refinery’s capabilities and how can you utilize it to prepare your data.
The use-case of this tutorial is Titanic data set. It has 12 columns of type integer, double and string. Some columns need shaping or cleaning operations to fully make use of the data. Mostly, we will fill the missing values with different approaches.
After completing this tutorial, you’ll understand how to:
- Acquire the skills of data engineering and cleaning
- Format and shape your data according to project requirements
In order to complete this tutorial, you will need the following:
- IBM Cloud account: An account must exist to use the platform.
- Watson Studio service instance: A service instance must exist to be able to use IBM Data Refinery.
The tutorial will take approximately 10 minutes to complete.
Step 1. Create an IBM cloud account
If you do not have an IBM Cloud account, create an account here.
- A Lite account is a free of charge. Make sure to set the region to U.S. South.
Step 2. Create a Watson Studio service instance
If you don’t have a watson Studio instance, do the following:
- Select Catalog found at the top right of the page.
- Click on Watson from the menu on the left, which you can find under Platform services.
- Select Browse Services under Watson Services.
- Choose watson Studio instance
- Once the main page of the service appears, click on Get Started. This will redirect your browser to the Watson Studio platform. You may be asked to confirm IBM Cloud organization and space information.
Step 3. Create a standard project in Watson Studio
From the Get Started page, select Create a project
Then choose a Standard plan
- Make sure a cloud storage instance exists, or add a new IBM Cloud Object Storage instance by clicking on Add under Select storage service.
Step 4. Upload data set
The operations will be done using Titanic dataset which can be downloaded here. Save the csv file to apply the following steps.
Under the Asset tab in the project, choose this icon on the right to upload the dataset to the platform.
- Click browse to navigate your folders where the dataset set can be found, and select file train.csv.
- After it’s uploaded, it will be listed in the Data assets.
Step 5. Open data refinery
To start the process, press the Action Menu (triple dot) in the right side of the train.csv bar to open Refine.
Step 6. Convert column type
All the columns initially are of type string, for better shaping, convert those integer values columns from string to integer. From the Action menu that appears in the right side of each column, select Convert Column type and choose the type. In Titanic’s use case, the columns that are converted to integer are Survived, PClass, Sibsp and Parch. The columns are converted to decimal are Age and Fare.
Step 6. Fill missing values
The columns that have missing values in Titanic dataset are Age, Cabin and Embarked. The methods to fulfill the missing values are different for each attribute depending on the purpose of the attribute.
So, to fill the missing values in the Embarked attribute, we only fill it with ‘S’ knowing that the passengers actually embarked at Southampton.
For the Cabin attribute, we’ll create an additional column that has 1 for a passenger who’s cabin exists and 0 if it does not exist. Relating to the accident, known passenger’s cabin indicates they survived. To do that, follow the steps below:
- Press the Actions menu in the Cabin column.
- Select Conditional Replace under the Organize Category
- Add two conditions in the Cabin’s column: if the value is empty replace it with 0, if it’s not empty replace with 1.
- Choose Cabin column, in the option of Replace put 1
For age attribute, calculate the mean of the column values and place it in the null values. To replace missing values by the mean of the column, do the following:
- From the Actions menu choose Filter condition.
- Select column Age and Is not empty under operator for the Filteration condition.
- From the Operation Bar, select Summarize operator.
Fill in the operation command the required variables like this Summarize(newVarName=operator(column))
Copy the generated value to use after, and undo the last two actions from the backward arrow above, since the filteration and the new summarized value is now useless.
- Select Age column again, from the Action menu choose Replace missing values.
- Insert the mean value to be replaced with and press Apply button.
Step 7. Remove duplicates
The Titanic data set does not have sensitive information that should be unique except for the passenger ID. Simply select the Action menu in Passenger Id column, and choose Remove duplicates.
In this tutorial, you learned the first stage of data science. The outcome of this stage determines the success of futher stages. You’ve also learned how IBM Data Refinery can help you gain a fast approach for rough data cleaning — with no coding requirements.