Learn more >
by Sana Mushtaq | Published June 14, 2019
Data scienceMachine learning
The probability of anomalous data has increased in today’s data due to its humongous size and its origin for heterogenous sources. Considering the fact that high quality data leads to better models and predictions, data preprocessing has become vital–and the fundamental step in the data science/machine learning/AI pipeline. In this article, we’ll talk about the need to process data and discuss different approaches to each step in the process.
While gathering data, one might come across three main factors that would contribute to the quality of data:
Accuracy: Erroneous values that deviate from the expected. The causes for inaccurate data can be various, which include:
Completeness: Lacking attribute/feature values or values of interest. The dataset might be incomplete due to:
Consistency: Aggregation of data is inconsistent.
Some other features that also affect the data quality include timeliness (the data is incomplete until all relevant information is submitted after certain time periods), believability (how much the data is trusted by the user) and interpretability (how easily the data is understood by all stakeholders).
To ensure high quality data, it’s crucial to preprocess it. To make the process easier, data preprocessing is divided into four stages: data cleaning, data integration, data reduction, and data transformation.
Data cleaning refers to techniques to ‘clean’ data by removing outliers, replacing missing values, smoothing noisy data, and correcting inconsistent data. Many techniques are used to perform each of these tasks, where each technique is specific to user’s preference or problem set. Below, each task is explained in terms of the techniques used to overcome it.
In order to deal with missing data, multiple approaches can be used. Let’s look at each of them.
Noise is defined as a random variance in a measured variable. For numeric values, boxplots and scatter plots can be used to identify outliers. To deal with these anomalous values, data smoothing techniques are applied, which are described below.
Since data is being collected from multiple sources, data integration has become a vital part of the process. This may lead to redundant and inconsistent data, which could result in poor accuracy and speed of data model. To deal with these issues and maintain the data integrity, approaches such as tuple duplication detection and data conflict detection are sought after. The most common approaches to integrate data are explained below.
The purpose of data reduction is to have a condensed representation of the data set which is smaller in volume, while maintaining the integrity of original. This results in efficient yet similar results. A few methods to reduce the volume of data are:
The final step of data preprocessing is transforming the data into form appropriate for Data Modeling. Strategies that enable data transformation include:
Despite having multiple approaches to preprocessing data, it’s still an actively researched field due to the amount of incoherent data being generated each day. To facilitate, IBM Cloud provides a platform for data scientists called IBM Watson Studio, which includes services that enable data scientists to preprocess data using drag and drop services–in addition to the conventional method of programming. To explore more about Watson Studio, and how it can help with the data science lifecycle, please visit.
Go through the process of preparing data and building a predictive model using IBM SPSS Modeler to solve a real-world…
This tutorial will introduce you to IBM Data Refinery's capabilities and how can you utilize it to prepare your data.
Data scienceObject Storage+
Get the Code »
Data scienceJupyter Notebook+
Back to top