Meaad ALRHSOUD | Published January 22, 2019
Data scienceObject Storage
In the life-cycle of data science, data preparation is one of the most important stages. Data scientists spend 80% of their time cleansing, shaping and formatting data before doing any analysis. IBM Data Refinery, an intuitive cloud-based data preparation service, helps you quickly source, shape and share your data sets. This tutorial is a short introduction for data wrangling and will introduce you to IBM Data Refinery’s capabilities and how can you utilize it to prepare your data.
The use-case of this tutorial is Titanic data set. It has 12 columns of type integer, double and string. Some columns need shaping or cleaning operations to fully make use of the data. Mostly, we will fill the missing values with different approaches.
After completing this tutorial, you’ll understand how to:
In order to complete this tutorial, you will need the following:
The tutorial will take approximately 10 minutes to complete.
If you do not have an IBM Cloud account, create an account here.
If you don’t have a watson Studio instance, do the following:
From the Get Started page, select Create a project
Then choose a Standard plan
The operations will be done using Titanic dataset which can be downloaded here.
Save the csv file to apply the following steps.
Under the Asset tab in the project, choose this icon on the right to upload the dataset to the platform.
To start the process, press the Action Menu (triple dot) in the right side of the train.csv bar to open Refine.
All the columns initially are of type string, for better shaping, convert those integer values columns from string to integer. From the Action menu that appears in the right side of each column, select Convert Column type and choose the type. In Titanic’s use case, the columns that are converted to integer are Survived, PClass, Sibsp and Parch. The columns are converted to decimal are Age and Fare.
The columns that have missing values in Titanic dataset are Age, Cabin and Embarked. The methods to fulfill the missing values are different for each attribute depending on the purpose of the attribute.
So, to fill the missing values in the Embarked attribute, we only fill it with ‘S’ knowing that the passengers actually embarked at Southampton.
For the Cabin attribute, we’ll create an additional column that has 1 for a passenger who’s cabin exists and 0 if it does not exist. Relating to the accident, known passenger’s cabin indicates they survived. To do that, follow the steps below:
For age attribute, calculate the mean of the column values and place it in the null values. To replace missing values by the mean of the column, do the following:
Fill in the operation command the required variables like this Summarize(newVarName=operator(column))
Copy the generated value to use after, and undo the last two actions from the backward arrow above, since the filteration and the new summarized value is now useless.
The Titanic data set does not have sensitive information that should be unique except for the passenger ID. Simply select the Action menu in Passenger Id column, and choose Remove duplicates.
In this tutorial, you learned the first stage of data science. The outcome of this stage determines the success of futher stages. You’ve also learned how IBM Data Refinery can help you gain a fast approach for rough data cleaning — with no coding requirements.
test podcast excerpt
April 23, 2019
May 6, 2019
Back to top