by Joshua Allen, Andrew Freed, Swami Chandrasekaran | Updated December 12, 2017 - Published November 15, 2017
Cognitive systems present exciting opportunities for building new kinds of applications with powerful intelligence behind them. These new applications require a new way of thinking about the development process. Traditional application development has been enhanced by the idea of DevOps, which forces operational considerations into development time, execution, and process. In this tutorial, we outline a “cognitive DevOps” process that refines and adapts the best parts of DevOps for new cognitive applications. Specifically, we cover applying DevOps to the training process of cognitive systems including training data, modeling, and performance evaluation.
A cognitive or artificial intelligence (AI) system fundamentally exhibits capabilities such as understanding, reasoning, and learning from data. At a deeper level, the system is built upon a combination of various types of cognitive tasks, which, when combined, make up a part of the overall cognitive application. These tasks include:
The science upon which a cognitive system is built includes, but is not limited to, machine learning (ML) including deep learning and natural language processing. These are individual components that can demonstrate one or more capabilities of a cognitive system (such as understanding, reasoning, learning, and interacting). These cognitive systems leverage both structured and unstructured data from internal, third-party, purchased and open sources, to unearth actionable insights and knowledge.
Unlike structured data, which is easy to organize and sift through in databases, unstructured data has traditionally required humans to understand. Examples of unstructured data include documents written in natural language, audio recordings, images, and even social media posts among many others. These types of (unstructured) data are something we deal with on a daily basis within an enterprise including research reports, loan documents, memos, call center recordings, or product reviews.
These cognitive or AI systems are trained using supervised learning techniques with labeled ground truth created by one or many subject matter experts (SME). The ground truth represents the “gold standard” data to which the respective learning algorithms fit or adapt. The process of creating the ground truth is extremely crucial to train as well as test the cognitive system. As part of the ground truth creation process, the feature engineering step also occurs in parallel. When using a deep learning-based approach or when using and training the platform APIs from IBM Watson, the features are auto-selected for you based on the ground truth.
After the models and systems have been trained and deployed, it does not necessarily mean the job is complete. The cognitive systems must be kept up to date and learning from the new data observations and interactions. In addition to adding more training data, you are likely to modify the code and model your AI system uses. You will create new machine learning features as hypotheses. Some of these hypotheses will work out and some will not. This will be an iterative process requiring some trial and error.
Traditionally, computers have been programmed by explicitly encoding a set of steps for the computer to follow: for example, “If A > 0, then do X.” We call a logical set of steps that perform a specific task an algorithm. Most software has been created in this way.
Encoding this set of steps is similar to teaching humans the procedure to follow to perform a task. For example, you can teach a human to deal cards, telling them to deal one card a time from the top of the deck starting at the left and following clockwise until all the cards have been dealt. While humans can be taught such steps through verbal commands, computers require their steps to be encoded using a programming language. Although programming languages have become higher level and easier to use over time, they still require humans skilled in software development to implement them.
Another approach to teaching computers to perform tasks is to use machine learning. Rather than using a set of steps, ML approaches problem solving by using a set of examples (the ground truth) to train the computer to perform a task. This is similar to the way a human learns to recognize an animal. We can show a young child several pictures of dogs, and he will quickly learn to recognize a dog. In the same way, we can train computers to recognize a dog through a set of examples.
Like traditional programming, machine learning has historically been the realm of computer and data scientists. While some parts of ML are still best left to computer scientists, recent improvements in user interfaces in specific areas of machine learning allow SMEs to train a system instead. An example of such a user interface is the Watson Knowledge Studio, which was designed specifically for use by subject matter experts and oncologists (those familiar with language constructs). By using ML, these individuals can collaborate with software engineers to build cognitive systems.
The most important aspect of developing a cognitive application is the availability of training data. The amount of training data that is required to train a cognitive system depends on several factors. Two of the most important factors are the variability of the data and the accuracy level wanted. SMEs are the people who are best suited to create training data because they are the ones most familiar with the subject.
To understand the lifecycle of training AI systems, we will consider Cross-Industry Standard Process for Data Mining) (CRISP-DM). CRISP-DM provides a standardized methodology that we can adopt to creating various types of models that power and make up cognitive systems. The lifecycle model (figure below) consists of six phases with arrows indicating the most important and frequent dependencies between phases. The sequence of the phases is not necessarily strict. The training nuances and steps differ according to the type of AI task or workload, but the fundamentals and the overall phases remain the same.
Another general view of the process looks like the following figure. In this process, we include the additional steps of monitoring and capturing feedback in the cycle. These important steps help us to assess the system and improve it over time.
Machine learning can be categorized into supervised learning and unsupervised learning. The difference is whether the model is given the answers that it must learn to predict. In supervised ML, the training data contains answers (called “labeled” data). This allows the algorithm to predict the combinations of inputs that produce certain answers. Some of the most widely used supervised learning algorithms include Support Vector Machines, Random Forest, Linear Regression, Logistic Regression, Naive Bayes, and Neural Networks (multilayer perceptron).
In unsupervised ML, the training data is not labeled, and the algorithm is limited to determining the groups of data that are most similar. Even with deep learning, labeled data sets are needed to train the models, but the feature engineering step for the most part is automated.
Reinforcement learning is also becoming a very popular approach where the model or the algorithm learns through a feedback system. Reinforcement learning is most widely used in self-driven cars, drones, and other robotics applications.
Let’s consider a system that predicts the sale price of a house from a number of factors. When ML needs to predict an output from a series of input values, a series of inputs and labeled outputs is needed.
A good rule of thumb would be to take the number of input columns, multiply by 50, and provide that many rows as labeled training data. Our example has four inputs (Size, Bedrooms, Bathrooms, Acreage), so for a robust model you would want 200 rows of training data.
Now let’s look at a natural language processing exercise, where you extract certain types of data from plain text. Your training data includes several text snippets, the type of data to be extracted, and the location of that data.
A good rule of thumb here would be to provide 50 positive examples and 50 negative examples for each type as the ground truth. You want enough variability in your training data that your model is able to learn all the patterns you want to extract. Also, the collected and prepared ground truth is usually split as an 80:20 ratio of train and test data (other ratios including 60:40 and 70:20:10 are not uncommon either). Having a greater proportion of data as test data ensures a better validation of model performance. Too little training data provides less data for the model to learn from and results in underfitting (algorithm shows low variance but high bias). When the ratio of training set is higher than the test set, it results in overfitting (algorithm shows low bias but high variance). Both overfitting and underfitting lead to poor model predictions and performance on new data sets. Hence, the selection of representative ground truth is absolutely critical to training cognitive systems.
A useful analogy is to think of your AI system as a college student. College students learn subject matter from homework where the answers are in the back of the book. Students work the problems and look up the answers as they go along, while improving their mental models of the subject material. Before midterms, students take practice exams and grade their performance on a separate set of questions, though these questions are generally similar to their homework. Finally, the students take their midterms, on questions they have never seen before. The midterm performance gives the best indication of how well the student can apply his or her knowledge. In this analogy, homework problems are the training set, practice exams are the test set, and the midterm is the blind set.
For a discussion about how much data to get, with examples, see the article “Why does machine learning require so much training data?” For some thoughts on why training data curation takes so long, see “Machine learning is just the tip of the iceberg—5 dangers lurking below the surface.”
The initial process for training an NLP-based system includes the steps depicted in the following figure and listed below. More specifically, the figure depicts the steps involved when training entity extraction models that represent one aspect of a cognitive or AI system, which we discussed earlier.
After analysis, the cycle continues. The output of the analysis is used to determine the next steps, such as revising the type system and adding more data to the training corpus.
A wider view of training the system over time incorporates feedback from users as they use a pilot or production system that typically includes a custom user interface. These steps are shown in the following figure and outlined below.
You know that AI systems get better over time, and you want to be explicit about how much better the system is performing. You need to select the appropriate measure for the AI system. The two primary measurement techniques are accuracy and F-measure (usually F1).
Accuracy is a very simple measurement. Simply stated, it is the number of right answers divided by the number of opportunities. If you ask a system ten questions and it gets nine right, it has 90% accuracy. The benefit of using accuracy is that it is a simple measure that everyone understands. However, it lacks sophistication, which might be needed to fully understand system performance.
F-measure includes classifications of why the system made the right (or wrong) prediction. Right and wrong predictions are each split into two categories:
To compute F-measure, you first compute precision (the accuracy of predictions the system makes as defined by the incidence of false positives) and recall (how many predictable values the system predicts as determined by the incidence of false negatives). F1 is the most common F-measure and is a harmonic mean of precision and recall. Precision and recall have a natural tension: Liberalizing the model usually increases recall at the expense of precision, and tightening the model increases precision at the expense of recall.
Incidentally, accuracy can be rewritten as (TP+TN)/(TP+TN+FP+FN). An additional benefit of F-measure is that it does not include true negatives, which are the easiest and least interesting cases. You should use an F-measure when you want to classify your error types, or when your data has a lot of true negatives. Consider a rare attribute that you want to extract from source text, one that appears in only 1 of 10,000 sentences. By not writing an annotator, you continuously predict the attribute is not in every sentence you see, a 99.99% accurate prediction. By using F-measure, you see this “do-nothing” annotator has 0% recall and, therefore, has no value.
Let’s measure the accuracy and F1 of an annotator that marks the instances of dogs in natural language.
In this example, precision is 33% (rows 2, 3, and 6), recall is 50% (rows 2 and 4), F1 is 39.8, and accuracy is 50% (all rows). The detailed breakdown afforded by F1 gives a better indication of where you need to improve the model.
Now let’s measure the performance of a classification system. Our hypothetical system classifies images as being about dogs, cats, or birds. A traditional way to measure classification performance is a confusion matrix.
In our example, there were 30 images, 10 of each animal. However, the system classified the images as 11 dogs, 18 cats, and 1 unknown.
By reading across the rows, you can measure the system’s precision for each class: Precision(dog) = 7/(7+4+0). Reading across columns yields recall per class: Recall(dog) = 7/(7+3+0).
A confusion matrix shows you what kinds of mistakes the system makes and what new training data might be required. Perhaps not surprisingly, the system often confuses cats and dogs for each other, but never for birds, thus more cat and dog pictures are required for training. For a detailed example on measuring accuracy with NLP focus, see “Cognitive system testing: Overall system accuracy testing.”
AI systems learn as they encounter more training data. It is important to test your model early and often to verify that it is extracting insights from your data. You should measure the performance of your system regularly as you add relevant training data. These regular measurements help you determine when model performance is not improving, which indicates either a need to refine your model or reach a logical stopping point. It is also advisable to go beyond the traditional “F1, Precision and Recall” metrics and start to go a level deeper to the map model performance to other key criteria to get a deeper understanding on why the model is performing in a certain way. These could include:
In addition to adding more training data, you are likely to modify the code and model your AI system uses. You will create new ML features as hypotheses. Some of these hypotheses will work out and some will not. By tracking every iteration of your system, you are quickly able to see which revisions work and which ones do not.
Every artifact related to your AI system should be tracked in source control. This includes code, models, and the training data itself. Software developers are very familiar with the virtues of version control, which allows you to know exactly what changes have been introduced into a system and when. These benefits extend to your models and training data. Models will change as new hypotheses are developed, and training data might change as mistakes are discovered and corrected. (That’s right: you are likely to discover that your AI system is suffering because the original data you gave it contains too many errors!)
Here is the accuracy of a Watson Knowledge Studio model, tracked over time as training data was added. When accuracy plateaued from 12,000 – 20,000 words, we knew that the model had learned as much of the variability as it could from the training set and that adding new training data was unlikely to improve the model.
In a full-fledged deployment and integration of modeling results, your data modeling work might be ongoing and continuous. For example, if a model is trained and deployed to increase customer retention among high-value customers, it will likely need to be tweaked after a particular level of retention is reached. The model might then be modified and re-used to retain customers at a lower but still profitable level on the value pyramid.
When people encounter new scenarios and situations, they adapt and learn. Cognitive systems must continuously learn and evolve as the system is put to use and encounters new observations. If it does not learn and adapt, then the trained models, and, therefore, the cognitive systems will start degrading soon after they get deployed to production. To be able to continuously learn, there should be a set of tools, processes, instrumentation, and governance mechanisms in place. In the following figure, we expand on the Deploy through Capture Feedback phases of the overall method to depict activities that must continuously learn using the resulting feedback, including having a governance process that carefully decides what the system must learn to achieve business objectives.
This tutorial has outlined several ways to apply DevOps-style thinking to cognitive or artificial intelligence systems. It can serve as a starting point for your process development and will need to be tweaked for your specific cognitive application. The important takeaway is to remember to continuously measure your cognitive system and to treat all of its components as first-class artifacts in your development process. DevOps thinking has been a wonderful boon to traditional application development, and your cognitive systems need the same boost from DevOps for cognitive systems!
Get the Code »
April 22, 2019
Artificial intelligenceMobile development
June 8, 2019
Back to top