Data, structure, and the data science pipeline

Data is a commodity, but without ways to process it, its value is questionable. Data science is a multidisciplinary field whose goal is to extract value from data in all its forms. This article explores the field of data science through data and its structure as well as the high-level process that you can use to transform data into value.

Data science is a process. That’s not to say it’s mechanical and void of creativity. But, when you dig into the stages of processing data, from munging data sources and data cleansing to machine learning and eventually visualization, you see that unique steps are involved in transforming raw data into insight.

The steps that you use can also vary (see Figure 1). In exploratory data analysis, you might have a cleansed data set that’s ready to import into R, and you visualize your result but don’t deploy the model in a production environment. In another environment, you might be dealing with real-world data and require a process of data merging and cleansing in addition to data scaling and preparation before you can train your machine learning model.

Figure 1. The data science pipeline
Flow from raw to data visualization

Let’s start by digging into the elements of the data science pipeline to understand the process.

Data and its structure

Data comes in many forms, but at a high level, it falls into three categories: structured, semi-structured, and unstructured (see Figure 2). Structured data is highly organized data that exists within a repository such as a database (or a comma-separated values [CSV] file). The data is easily accessible, and the format of the data makes it appropriate for queries and computation (by using languages such as Structured Query Language (SQL) or Apache™ Hive™). Unstructured data lacks any content structure at all (for example, an audio stream or natural language text). In the middle is semi-structure data, which can include metadata or data that can be more easily processed than unstructured data by using semantic tagging. This data is not fully structured because the lowest-level contents might still represent data that requires some processing to be useful.

Figure 2. Models of data
Examples of structured, semi-structured, and unstructured data

Structured data is the most useful form of data because it can be immediately manipulated. The rule-of-thumb is that structured data represents only 20% of total data. Most of the data in the world (80% of available data) is unstructured or semi-structured.

Note that much of what is defined as unstructured data actually has structure (such as a document that has metadata and tags for the content), but the content itself lacks structure and is not immediately usable. Therefore, it is considered unstructured.

Data engineering

A survey in 2016 found that data scientists spend 80% of their time collecting, cleaning, and preparing data for use in machine learning. The remaining 20% they spend mining or modeling data by using machine learning algorithms. Although it’s the least enjoyable part of the process, this data engineering is important and has ramifications for the quality of the results from the machine learning phase.

I split data engineering into three parts: wrangling, cleansing, and preparation. Given the drudgery that is involved in this phase, some call this process data munging.

Data wrangling

Data wrangling, simply defined, is the process of manipulating raw data to make it useful for data analytics or to train a machine learning model. This part of data engineering can include sourcing the data from one or more data sets (in addition to reducing the set to the required data), normalizing the data so that data merged from multiple data sets is consistent, and parsing data into some structure or storage for further use. Consider a public data set from a federal open data website. This data might exist as a spreadsheet file that you would need to export into a format more acceptable to data science languages (CSV or JavaScript Object Notation). The data source might also be a website from which an automated tool scraped the data. Finally, the data could come from multiple sources, which requires that you choose a common format for the resulting data set.

This resulting data set would likely require post-processing to support its import into an analytics application (such as the R Project for Statistical Computing, the GNU Data Language, or Apache Hadoop). Data wrangling, then, is the process by which you identify, collect, merge, and preprocess one or more data sets in preparation for data cleansing.

Data cleansing

After you have collected and merged your data set, the next step is cleansing. Data sets in the wild are typically messy and infected with any number of common issues, including missing values (or too many values), bad or incorrect delimiters (which segregate the data), inconsistent records, or insufficient parameters. In some cases, the data cannot be repaired and so must be removed; in other cases, it can be manually or automatically corrected.

When your data set is syntactically correct, the next step is to ensure that it is semantically correct. In a data set that contains numerical data, you’ll have outliers that require closer inspection. You can discover these outliers through statistical analysis, looking at the mean and averages as well as the standard deviation. Searching for outliers is a secondary method of cleansing to ensure that the data is uniform and accurate.

Data preparation

The final step in data engineering is data preparation (or preprocessing). This step assumes that you have a cleansed data set that might not be ready for processing by a machine learning algorithm. Here are a couple of examples where this preparation could apply.

In some cases, normalization of data can be useful. Using normalization, you transform an input feature to distribute the data evenly into an acceptable range for the machine learning algorithm. This task can be as simple as linear scaling (from an arbitrary range given a domain minimum and maximum from -1.0 to 1.0). You can also apply more complicated statistical approaches. Data normalization can help you avoid getting stuck in a local optima during the training process (in the context of neural networks).

Another useful technique in data preparation is the conversion of categorical data into numerical values. Consider a data set that includes a set of symbols that represent a feature (such as {T0..T5}). As a string, this isn’t useful as an input to a neural network, but you can transform it by using a one-of-K scheme (also known as one-hot encoding).

In this scheme (illustrated in Figure 3), you identify the number of symbols for the feature — in this case, six — and then create six features to represent the original field. For each symbol, you set just one feature, which allows a proper representation of the distinct elements of the symbol. You pay the price in increased dimensionality, but in doing so, you provide a feature vector that works better for machine learning algorithms.

Figure 3. Transforming a string into a one-hot vector
Image showing conversion of a string into a vector

An alternative is integer encoding (where T0 could be value 0, T1 value 1, and so on), but this approach can introduce problems in representation. For example, in a real-valued output, what does 0.5 represent?

Machine learning

In this phase, you create and validate a machine learning model. Sometimes, the machine learning model is the product, which is deployed in the context of an application to provide some capability (such as classification or prediction). In other cases, the machine learning algorithm is just a means to an end. In these cases, the product isn’t the trained machine learning algorithm but rather the data that it produces.

Model learning

The meat of the data science pipeline is the data processing step. In one model, the algorithm can process the data, with a new data product as the result. But, in a production sense, the machine learning model is the product itself, deployed to provide insight or add value (such as the deployment of a neural network to provide prediction capabilities for an insurance market).

Machine learning approaches are vast and varied, as shown in Figure 4. This small list of machine learning algorithms (segregated by learning model) illustrates the richness of the capabilities that are provided through machine learning.

Figure 4. Machine learning approaches
Machine learning approaches based on data structure

Supervised learning, as the name suggests, is driven by a critic that provides the means to alter the model based on its result. Given a data set with a class (that is, a dependent variable), the algorithm is trained to produce the correct class and alter the model when it fails to do so. The model is trained until it reaches some level of accuracy, at which point you could deploy it to provide prediction for unseen data.

In contrast, unsupervised learning has no class; instead, it inspects the data and groups it based on some structure that is hidden within the data. You could apply these types of algorithms in recommendation systems by grouping customers based on the viewing or purchasing history.

Finally, reinforcement learning is a semi-supervised learning algorithm that provides a reward after the model makes some number of decisions that lead to a satisfactory result. This type of model is used to create agents that act rationally in some state/action space (such as a poker-playing agent).

Model validation

After a model is trained, how will it behave in production? One way to understand its behavior is through model validation. A common approach to model validation is to reserve a small amount of the available training data to be tested against the final model (called test data). You use the training data to train the machine learning model, and the test data is used when the model is complete to validate how well it generalizes to unseen data (see Figure 5).

Figure 5. Training versus test data for model validation
Two boxes showing the difference between model learning and validation

The construction of a test data set from a training data set can be complicated. A random sampling can work, but it can also be problematic. For example, did the random sample over-sample for a given class, or does it provide good coverage over all potential classes of the data or its features? Random sampling with a distribution over the data classes can be helpful for avoiding overfitting (that is, training too closely to the training data) or underfitting (that is, doesn’t model the training data and lacks the ability to generalize).


Operations refers to the end goal of the data science pipeline. This goal can be as simple as creating a visualization for your data product to tell a story to some audience or answer some question created before the data set was used to train a model. Or, it could be as complex as deploying the machine learning model in a production environment to operate on unseen data to provide prediction or classification. This section explores both scenarios.

Model deployment

When the product of the machine learning phase is a model that you’ll use against future data, you’re deploying the model into some production environment to apply to new data. This model could be a prediction system that takes as input historical financial data (such as monthly sales and revenue) and provides a classification of whether a company is a reasonable acquisition target.

In scenarios like these, the deployed model is typically no longer learning and simply applied with data to make a prediction. There are good reasons to avoid learning in production. In the context of deep learning (neural networks with deep layers), adversarial attacks have been identified that can alter the results of a network. In an image processing deep learning network, for example, applying an image with a perturbation can alter prediction capabilities of the image such that instead of “seeing” a tank, the deep learning network sees a car. Adversarial attacks have grown with the application of deep learning, and new vectors of attack are part of active research.

Model visualization

In smaller-scale data science, the product sought is data and not necessarily the model produced in the machine learning phase. This scenario is the most common form of operations in the data science pipeline, where the model provides the means to produce a data product that answers some question about the original data set. Options for visualization are vast and can be produced from the R programming language, gnuplot, and D3.js (which can produce interactive plots that are highly engaging).

You can learn more about visualization in the next article in this series.

Going further

This article explored a generic data pipeline for machine learning that covered data engineering, model learning, and operations. The next article in this series will explore two machine learning models for prediction using public data sets.