The probability of anomalous data has increased in today’s data due to its humongous size and its origin for heterogenous sources. Considering the fact that high quality data leads to better models and predictions, data preprocessing has become vital–and the fundamental step in the data science/machine learning/AI pipeline. In this article, we’ll talk about the need to process data and discuss different approaches to each step in the process.
While gathering data, one might come across three main factors that would contribute to the quality of data:
Accuracy: Erroneous values that deviate from the expected. The causes for inaccurate data can be various, which include:
- Human/computer errors during data entry and transmission
- Users deliberately submitting incorrect values (called disguised missing data)
- Incorrect formats for input fields
- Duplication of training examples
Completeness: Lacking attribute/feature values or values of interest. The dataset might be incomplete due to:
- Unavailability of data
- Deletion of inconsistent data
- Deletion of data deemed irrelevant initially
Consistency: Aggregation of data is inconsistent.
Some other features that also affect the data quality include timeliness (the data is incomplete until all relevant information is submitted after certain time periods), believability (how much the data is trusted by the user) and interpretability (how easily the data is understood by all stakeholders).
To ensure high quality data, it’s crucial to preprocess it. To make the process easier, data preprocessing is divided into four stages: data cleaning, data integration, data reduction, and data transformation.
Data cleaning refers to techniques to ‘clean’ data by removing outliers, replacing missing values, smoothing noisy data, and correcting inconsistent data. Many techniques are used to perform each of these tasks, where each technique is specific to user’s preference or problem set. Below, each task is explained in terms of the techniques used to overcome it.
In order to deal with missing data, multiple approaches can be used. Let’s look at each of them.
- Removing the training example: You can ignore the training example if the output label is missing (if it is a classification problem). This is usually discouraged as it leads to loss of data, as you are removing the attribute values that can add value to data set as well.
- Filling in missing value manually: This approach is time consuming, and not recommended for huge data sets.
- Using a standard value to replace the missing value: The missing value can be replaced by a global constant such as ‘N/A’ or ‘Unknown’. This is a simple approach, but not foolproof.
- Using central tendency (mean, median, mode) for attribute to replace the missing value: Based on data distribution, mean (in case of normal distribution) or median (for non-normal distribution) can be used to fill in for the missing value.
- Using central tendency (mean, median, mode) for attribute belonging to same class to replace the missing value: This is the same as method 4, except that the measures of central tendency are specific to each class.
- Using the most probable value to fill in the missing value: Using algorithms like regression and decision tree, the missing values can be predicted and replaced.
Noise is defined as a random variance in a measured variable. For numeric values, boxplots and scatter plots can be used to identify outliers. To deal with these anomalous values, data smoothing techniques are applied, which are described below.
- Binning: Using binning methods smooths sorted value by using the values around it. The sorted values are then divided into ‘bins’. There are various approaches to binning. Two of them are smoothing by bin means where each bin is replaced by the mean of bin’s values, and smoothing by bin medians where each bin is replaced by the median of bin’s values.
- Regression: Linear regression and multiple linear regression can be used to smooth the data, where the values are conformed to a function.
- Outlier analysis: Approaches such as clustering can be used to detect outliers and deal with them.
Since data is being collected from multiple sources, data integration has become a vital part of the process. This may lead to redundant and inconsistent data, which could result in poor accuracy and speed of data model. To deal with these issues and maintain the data integrity, approaches such as tuple duplication detection and data conflict detection are sought after. The most common approaches to integrate data are explained below.
- Data consolidation: The data is physically bought together to one data store. This usually involves Data Warehousing.
- Data propagation: Copying data from one location to another using applications is called data propagation. It can be synchronous or asynchronous and is event-driven.
- Data virtualization: An interface is used to provide a real-time and unified view of data from multiple sources. The data can be viewed from a single point of access.
The purpose of data reduction is to have a condensed representation of the data set which is smaller in volume, while maintaining the integrity of original. This results in efficient yet similar results. A few methods to reduce the volume of data are:
- Missing values ratio: Attributes that have more missing values than a threshold are removed.
- Low variance filter: Normalized attributes that have variance (distribution) less than a threshold are also removed, since little changes in data means less information.
- High correlation filter: Normalized attributes that have correlation coefficient more than a threshold are also removed, since similar trends means similar information is carried. Correlation coefficient is usually calculates using statistical methods such as Pearson’s chi-square value etc.
- Principal component analysis: Principal component analysis, or PCA, is a statistical method which reduces the numbers of attributes by lumping highly correlated attributes together. With each iteration, the initial features are reduced to principal components, with greater variance than the original set on the condition that they are uncorrelated with the preceding components. This method, however, only works for features with numerical values.
The final step of data preprocessing is transforming the data into form appropriate for Data Modeling. Strategies that enable data transformation include:
- Attribute/feature construction: New attributes are constructed from the given set of attributes.
- Aggregation: Summary and Aggregation operations are applied on the given set of attributes to come up with new attributes.
- Normalization: The data in each attribute is scaled between a smaller range e.g. 0 to 1 or -1 to 1.
- Discretization: Raw values of the numeric attributes are replaced by discrete or conceptual intervals, which can in return be further organized into higher level intervals.
- Concept hierarchy generation for nominal data: Values for nominal data are generalized to higher order concepts.
Despite having multiple approaches to preprocessing data, it’s still an actively researched field due to the amount of incoherent data being generated each day. To facilitate, IBM Cloud provides a platform for data scientists called IBM Watson Studio, which includes services that enable data scientists to preprocess data using drag and drop services–in addition to the conventional method of programming. To explore more about Watson Studio, and how it can help with the data science lifecycle, please visit.