Pay attention to the data to get the most out of artificial intelligence, machine learning, and cognitive computing

When you think about artificial intelligence (AI), what comes to mind? You probably think of broad categories of techniques, such as rule-based expert systems, machine learning, and natural language processing. Perhaps you have enough of a background to go a bit deeper, in which case you probably think of forward or backward chaining, neural networks, bayesian logic, clustering, and other classifier systems.

If the emphasis of your thinking about AI is along the lines I’ve stated, then you’re far from alone. Most people, from complete computer amateurs to skilled developers, think of the algorithms and mechanisms when they think of AI. These are of course very important, but they leave out a crucial factor that separates successful from unsuccessful AI projects: the data. AI requires data, usually a great deal of data, and ideally a great deal of high-quality data. One of the reasons AI has started to gain more mainstream success, and that we’re pushing into the age of cognitive computing, is that the web makes it much easier to find large bodies of data.

This should not be surprising when you consider that AI techniques originated as ways to translate modes of human perception into computer processes. Think of how much input data a child is exposed to as he or she develops such perceptions. How many objects they see before they can safely navigate the world’s environments by sight. How many utterances they have heard before they can speak. How much text they have seen before they can read. How much education with how many lessons, examples, and reinforcing tests they have to work through before they are an expert in any topic. This vast body of information that humans absorb over time is key to our own intelligence, and the same is true of artificial intelligence.

Almost everything we admire about human intelligence comes down to pattern matching, a search through possible perceptions to determine which best corresponds to the stimulus at hand. It all comes down to our neuron-based ability to perform staggeringly complicated statistical analyses in very little time. It’s always good perspective to remember that there really isn’t any special dividing line between statistical methods and AI. AI is just a way to apply what we can figure out of our own innate, clever optimizations as adapted in human evolution for applying statistical methods in pattern matching.

In the early days of computing the ideas for how computers should do things were paramount, and the pioneers of AI such as Alan Turing, John von Neumann, and John McCarthy focused on how to create the programs that could lead to AI. They came up with great ideas, many of which are still the basis of AI today, but they quickly found that bold predictions that powerful AI lay right around the corner were overstated. This caused serious problems for the credibility of their research and an oft-repeated pattern was set, where someone would come up with a new set of promising techniques and when these didn’t work as well as expected, an “AI winter” of skepticism would follow. To avoid this problem in your own work, it’s important to be well versed in the role of data in AI. This tutorial puts together the picture for you.

Historical limitations

In retrospect, we understand that one reason for the difficulties encountered by the early AI pioneers was the difficulty of funneling large quantities of data into their clever algorithms. Computer random access memory (RAM) was called “core” for a long time, and still is sometimes, because it originated as magnetized cores of metal rings connected through circuitry. This required elaborate engineering, and even after the revolution of semiconductor chips, RAM remained a luxury. It’s only been about 30 years since Bill Gates made his infamous proclamation that “640 K of memory ought to be enough for anyone,” which must seem quaint now that many of us expect our computers to come with no less than 8 GB, more than 13 thousand times the Bill Gates proclamation.

Modest availability of RAM originally meant that AI had to do most of its work on narrow slices of information that were recombined into broader answers. Such approaches hampered efficiency. The next problem was in storing large quantities of data representing things such as training corpora for machine learning. AI pioneers had to deal with very limited systems such as punched cards, and even with the advent of magnetic drives storage has always been at a premium until very recently. The following images show the trend of costs for memory and hard storage over the decades.

Trend of cost reduction for memory over decades Trend of cost reduction for hard storage over decades

The success and vast scale of the web has come as costs of RAM and durable storage hit previously unimaginable lows. This provides a great deal of incentive for people and institutions to digitize vast quantities of data representing human activity, and thanks to movements toward open availability of data such resources are available for developers to tap to a degree that makes AI increasingly practical.

Starting with perceptrons

One of the first machine learning algorithms was the perceptron, a system for implementing binary classifiers. A binary classifier is a function that takes an input and determines whether it is or is not a member of some class. Examples include a function that takes a set of characteristics of an object and determines whether it is a boat or that takes a spectral analysis of an audio file and determines whether it is the sound of a guitar. These answers (“boat,” “non-boat,” “guitar,” and “non-guitar”) from the classification are called labels.

The first perceptron was developed for recognizing images and implemented entirely in hardware. It was trained by exposure to a training set of images, each of which had an expected output, the label. You could feed it some images and their labels, in which case your process is what is known as supervised learning.

There are also machine learning techniques where the labels are not provided in the training set and the algorithm has to use sophisticated statistical techniques to identify clusters that would indicate the resultant labels. This is known as unsupervised learning, and is a more advanced topic. Considerations of data are similar in either case, so for simplicity the discussion of machine learning in this tutorial focuses on supervised learning. In fact, most of the discussion in this tutorial uses supervised machine learning as an example, but the broad considerations are similar for many other AI algorithms.

Here is the tricky bit about perceptrons: all those images in the training set are data. You might not immediately think of them as such because they were originally managed in analog form. For example, the images might be a stack of photo prints of things to be classified as boats or not-boats, carefully annotated with the expected perceptron labels. These would not have taken up RAM or disk storage at the time, but they are data, and the fact that they didn’t was an accident of the situation’s practicalities.

Modern machine learning and training data

These days, RAM and storage have become cheap, and what’s more, there are many billions of images on the web that could be used as a training set for a visually perceptive neural network. The lowered cost of high-volume data has led to an increased availability of high-volume data, and in most modern AI apps, you will be using data obtained in some form or another from the web.

The trick is understanding that if you just acquire a billion JPGs of boats and assorted non-boat items, you won’t end up with a proper training set. Control and quality are important. Do you need to clean up noise in the images? Do you need to change the zooming or boost resolution? If there are many pictures that are not of boats, but are fed to the neural network annotated as if they were of boats, the resulting neural network will be a poorer classifier than it could.

Having the right algorithms is important for neural networks, but preparing a training set is even more important, and in many ways harder. You could get a panel of trusted and highly trained experts to pore over your billion JPGs and label them for training, but this would obviously be expensive and time-consuming. You could crowd-source the effort by having many less sophisticated people classify them in some way, and this might get you a higher volume of training data at lower cost, but you might be less certain of the quality. You might have a higher proportion of non-boats that are annotated as boats.

After you have one AI algorithm you trust above a certain degree of confidence, you could use it to prepare the training corpus for another program. For example, if you are confident that neural network A is good at classifying boats, you could run your raw collection of images to it, and use its classifications as annotations to turn it into a proper training set for neural network B. This seems like it could give you an attractive combination of volume and quality. The main problems here are the bootstrapping problem and the error feed-forward function.

The bootstrapping problem comes from the fact that neural networks tend to be pretty rigidly adapted for a narrow purpose. You couldn’t use a good airplane classifier to develop a training set for a good boat classifier, so you’re always needing to prepare new training corpora from scratch.

The feed-forward problem is that neural network A might not be as good as you think it is at classifying boats. Perhaps its weaknesses are hidden in the high volume of work it’s able to do. If so, it will tend to transmit its weaknesses to neural network B. This error feed-forward is actually one of the biggest problems in the application of AI. In the more generalized sense, it can manifest whenever multiple automated systems work in any sort of coordination, and perceptive weaknesses can often be amplified. As I’ll discuss later, such weaknesses can have serious social and economic costs.

The iris flower data set

Recognition directly from images is one sort of machine learning, but it’s more common to find practical uses of machine learning where the inputs are numerical descriptions of features. Rather than have an algorithm work directly on the pixels of an image of a face to identify a person, it could work on a set or vector of numbers representing features such as the size and color of the eyes, nose, mouth, and face outline.

In 1935, botanist Edgar Anderson undertook a field study and published a paper “The irises of the Gaspé Peninsula.” In this paper were 150 measured observations of features of iris flower samples from three different species, iris setosa, iris virginica, and iris versicolor. Specifically, the length and the width of sepals and petals. In 1936, statistician and biologist Ronald Fisher used Anderson’s data set to illustrate statistical classifier techniques that would later feature in computer machine learning.

iris setosa, iris virginica, and iris versicolor

The iris flower data set has become iconic in AI circles and is frequently used as a training set in introducing novice developers to machine learning. It supports supervised learning because Anderson included the iris species (the label) with the feature measurements (the samples).

The Iris Data Set is famous not just because of its early, historical role in machine learning, but also because of the difficulties of the problems it starts to illustrate. For example, iris setosa is linearly separable from the other two species, but the other two are clustered so linear separation is not possible. This is a technical detail, but it led to a very important understanding about the limitations of different types of classifiers. Perceptrons are strictly linear classifiers, so they could work for some aspects of the Iris Data Set but not others.

iris data plot

Many improvements in machine learning technology were made possible in the presence of such scientifically sound and statistically comprehended sample data. For example, it was discovered that adding multiple layers of neural networks was a way around many of the limitations that were encountered in applying perceptions even using such high-quality training sets as the iris data. Very few computing algorithms are as sensitive to care and feeding as machine learning, and this fact continues, so that even today experts need data of such high quality to get the best out of their machines.

From irises to eyeballs

Neural networks and the data sets that train them have certainly become more sophisticated since the Perceptron age, but one of the big reasons for the increasing dominance of AI technology is that more people are channeling the activity of their real life neurons on the Internet.

Each time you do a web search, each time you make a social media post or interaction, each time you even click one part of a site rather than another, you are contributing live data about your habits, interests, and preferences to algorithms. Most of these algorithms are not at the top echelons of AI sophistication—straightforward statistical models such as Bayesian methods are very popular among those mining online interactions for marketplaces—but they don’t need to be. Experience shows that the greater quantity of reliable data that algorithms can access, the less sophistication that is required.

Establishing the context for actions is a key part of establishing the value of data that feeds such algorithms. The fact that one person was on a fashion site when they clicked a certain ad while another person was on a sports site becomes a signal incorporated into the corpus. Such application of context is important, though in different ways, when considering other sorts of data that feeds AI. For example, if you were able to add information about the geographical location of an iris flower to its petal and sepal measurements, you can probably train an even more accurate classifier.

However, there are limits to this. If you provide too many details per data item, each of which is technically called a dimension, you might fall prey to what’s been called the curse of dimensionality. This is a general term that covers a lot of problems, but the common idea is that when you have too many dimensions, examples you expect to be similar could become so far separated in the space the algorithm is searching that it never organizes them effectively into clusters. One consequence of this is summed up by the thought that as you add dimensions, the number of training examples that are required grows exponentially.

A common step in data preparation for AI is to reduce the number of dimensions in the search space by seeking a manifold, which is a mathematical pattern within a subset of the dimensions where clustering occurs more readily. The most common technique for doing so is called Principal Component Analysis (PCA).

Reduction of dimensionality from 1024 (image pixels) to 2 (rotation and scale)

On the left side are images of the letter A for visual recognition. If each image is 32 by 32 pixels, and is strictly black and white, it would be naively represented as a vector of 1024 pixels, which is mathematically a 1024-dimensional space. Of course, as a person, when you look at that, you realize instantly that these are really the same image rotated and scaled. You could take the degree of rotation and scale as two separate dimensions, and end up with a reduction of dimensionality from 1024 to 2, as illustrated on the right side. What you are able to do in your mind is a reduction of dimensionality from the full pixel-by-pixel detail to two intrinsic variables of rotation and scale.

Of course, this is a very simplified example, and in practice reduction of dimensionality can get quite tricky, but it comes down as much to mathematical sophistication as it does a deep understanding of the data itself.

Monkeys at the typewriter

One of the places from which we best understand context, intuitively or in science and technology, is in natural language. Having a computer understand language, generate language, and even perform language translation has long been one of the main goals of AI. In fact, the first and best known test for AI success, the Turing Test, is passed when the computer can have a convincing conversation (that is, understand language and generate language in response) with a person.

For most of the history of natural language processing, the relevant branch of AI, the approach has focused on the algorithms, programming systems of grammar, and vocabulary management, all with very little success. Again more recently, as oceans of data on human language and even audio speech have become available online, the techniques have changed to absorb this data in bulk, and success has followed. We’ve all seen how rapidly chatbots have improved, as has machine translation and the mobile agents on our phones. These are trained using vast corpora of text and speech data, annotated with as much context as possible.

Just to give you an idea of this, one of the oldest curiosities in NLP has been to use matrices of statistical letter correlation to generate what seems like natural text. The idea derives from the old philosophical idea that enough monkeys banging at a typewriter could eventually generate the works of Shakespeare. Of course, the likelihood of this exact scenario is so small that we would not expect it to happen in the entire known life of the universe. However, if we stack the monkey’s typewriter with statistical models observed from language, even though the monkey’s actions are still considered random, it becomes feasible for the monkey to produce engaging communication.

As you might have anticipated, the point is that building these statistical models requires data, for example, how often certain letters or words occur relative to each other. A common model for this is called the n-gram, where you have a frequency matrix that shows how often in the training data each sequence of letters occurs. For example, a 3-gram from a typical English text would show a very high value for the sequence “ING” and a low one for the sequence “HMF.”

The following image is a chart of the top 20 bigrams (2-dimensional n-grams) in Shakespeare’s Hamlet, ignoring punctuation and spaces, and treating all letters as lowercase.

Chart of top 20 bigrams in Hamlet

The source and nature of the training data matters. If you take statistical models from the original works of Dante, you would expect the resulting monkey to sound very Italian, with a much smaller chance of producing the works of Shakespeare in a certain amount of time. You would have similar effects if you took the statistics from the Hindu epics and holy books, or from the works of Shakespeare themselves.

If you made the mistake of creating statistical models from all three mentioned literary traditions, you would probably end up with an unintelligible monkey, or robot, as actually implemented, that takes even longer to produce anything worth communicating. Thus, for this case it’s not enough to just feed the algorithm more and more blind data. You must pay attention to its source and understand how the input affects the output.

The perils of bias

The problem with trying to gather such data with clear context at scale is that it is very hard to gauge the unintended consequences. There have been famous cases where online advertisement targeting based on algorithms has displayed trends that could be interpreted in terms of sexism, racism, or other proscribed biases, landing their owners in hot water in a public relations sense.

There are multiple reasons why this can happen. In some cases, the bias could be inherent in the data that trains the algorithms. This is a common problem more broadly, even in science and policy-making. It is important to ward against such biases and to question all assumptions of the people or systems that gathered the data, but this is also incredibly difficult. Constant vigilance is the only solution.

An even trickier problem emerges when the data misses some human aspect of its context, which creates entirely unintended biases. Something about an advertisement’s presentation on a site frequented by one demographic group could lead to its greater success, which then becomes data that’s codified in the algorithm in a way that seems discriminatory. Again, there is no solution except for constant vigilance, and review of patterns that have emerged over time because humans are still better detectors of problems in social context than algorithms are.

There are again interesting parallels in NLP. Microsoft famously unveiled an AI bot on Twitter, and though its algorithms were not revealed it became clear that it used the corpus of user’s interactions with it to develop its models of language. After a barrage of unsavory comments by users it soon adapted to making virulently anti-social posts of its own, and before long the creators had to take it offline.

You do not want to be the developer who is lauded on day one for releasing a powerful new AI application, only to be excoriated on day six, or even lose your job, after that application turns out to have absorbed and then exhibited enough biases to cause your employers a PR calamity.


You have learned how important data is to the creation of artificial intelligence and cognitive applications, and how this importance has been constant throughout the history of the discipline, and has been connected to its historical successes and failures. You have also learned how the vast amount of digitized data available online has revived AI from its many winter periods and pushed its achievements to the mainstream in ways that affect many lives and much of business. The current period of success and excitement comes with its own perils as anomalies and unforeseen bias in the data that feeds AI can lead to terrible effects, social, and even regulatory backlash.

This might all be interesting and useful background, but how is the software development lifecycle (SDLC) applied to AI data in practice? I’ll answer this question in the next tutorial, going into detail about the SDLC and illustrating how you can most effectively apply it to address the issues I’ve discussed so far.