Cognitive systems present exciting opportunities for building new kinds of applications with powerful intelligence behind them. These new applications require a new way of thinking about the development process. Traditional application development has been enhanced by the idea of DevOps, which forces operational considerations into development time, execution, and process. In this tutorial, we outline a “cognitive DevOps” process that refines and adapts the best parts of DevOps for new cognitive applications. Specifically, we cover applying DevOps to the training process of cognitive systems including training data, modeling, and performance evaluation.

Types of cognitive tasks

A cognitive or artificial intelligence (AI) system fundamentally exhibits capabilities such as understanding, reasoning, and learning from data. At a deeper level, the system is built upon a combination of various types of cognitive tasks, which, when combined, make up a part of the overall cognitive application. These tasks include:

  • Entity extraction
  • Passage retrieval
  • Text classification
  • Tone and emotion detection
  • Extraction of knowledge
  • Language translation
  • Speech transcription
  • Computer vision

    Figure 1. Cognitive tasks
    Components & ingredients of an AI system

The science upon which a cognitive system is built includes, but is not limited to, machine learning (ML) including deep learning and natural language processing. These are individual components that can demonstrate one or more capabilities of a cognitive system (such as understanding, reasoning, learning, and interacting). These cognitive systems leverage both structured and unstructured data from internal, third-party, purchased and open sources, to unearth actionable insights and knowledge.

Unlike structured data, which is easy to organize and sift through in databases, unstructured data has traditionally required humans to understand. Examples of unstructured data include documents written in natural language, audio recordings, images, and even social media posts among many others. These types of (unstructured) data are something we deal with on a daily basis within an enterprise including research reports, loan documents, memos, call center recordings, or product reviews.

Figure 2. Various types of unstructured data within and outside an enterprise
Different types of unstructured data used by AI systems

These cognitive or AI systems are trained using supervised learning techniques with labeled ground truth created by one or many subject matter experts (SME). The ground truth represents the “gold standard” data to which the respective learning algorithms fit or adapt. The process of creating the ground truth is extremely crucial to train as well as test the cognitive system. As part of the ground truth creation process, the feature engineering step also occurs in parallel. When using a deep learning-based approach or when using and training the platform APIs from IBM Watson, the features are auto-selected for you based on the ground truth.

After the models and systems have been trained and deployed, it does not necessarily mean the job is complete. The cognitive systems must be kept up to date and learning from the new data observations and interactions. In addition to adding more training data, you are likely to modify the code and model your AI system uses. You will create new machine learning features as hypotheses. Some of these hypotheses will work out and some will not. This will be an iterative process requiring some trial and error.

Machine learning

Traditionally, computers have been programmed by explicitly encoding a set of steps for the computer to follow: for example, “If A > 0, then do X.” We call a logical set of steps that perform a specific task an algorithm. Most software has been created in this way.

Encoding this set of steps is similar to teaching humans the procedure to follow to perform a task. For example, you can teach a human to deal cards, telling them to deal one card a time from the top of the deck starting at the left and following clockwise until all the cards have been dealt. While humans can be taught such steps through verbal commands, computers require their steps to be encoded using a programming language. Although programming languages have become higher level and easier to use over time, they still require humans skilled in software development to implement them.

Another approach to teaching computers to perform tasks is to use machine learning. Rather than using a set of steps, ML approaches problem solving by using a set of examples (the ground truth) to train the computer to perform a task. This is similar to the way a human learns to recognize an animal. We can show a young child several pictures of dogs, and he will quickly learn to recognize a dog. In the same way, we can train computers to recognize a dog through a set of examples.

Like traditional programming, machine learning has historically been the realm of computer and data scientists. While some parts of ML are still best left to computer scientists, recent improvements in user interfaces in specific areas of machine learning allow SMEs to train a system instead. An example of such a user interface is the Watson Knowledge Studio, which was designed specifically for use by subject matter experts and oncologists (those familiar with language constructs). By using ML, these individuals can collaborate with software engineers to build cognitive systems.

The most important aspect of developing a cognitive application is the availability of training data. The amount of training data that is required to train a cognitive system depends on several factors. Two of the most important factors are the variability of the data and the accuracy level wanted. SMEs are the people who are best suited to create training data because they are the ones most familiar with the subject.

The training lifecycle

To understand the lifecycle of training AI systems, we will consider Cross-Industry Standard Process for Data Mining) (CRISP-DM). CRISP-DM provides a standardized methodology that we can adopt to creating various types of models that power and make up cognitive systems. The lifecycle model (figure below) consists of six phases with arrows indicating the most important and frequent dependencies between phases. The sequence of the phases is not necessarily strict. The training nuances and steps differ according to the type of AI task or workload, but the fundamentals and the overall phases remain the same.

Figure 3. The CRISP-DM lifecycle model
Graphic showing the CRISP-DM lifecycle model

Another general view of the process looks like the following figure. In this process, we include the additional steps of monitoring and capturing feedback in the cycle. These important steps help us to assess the system and improve it over time.

Figure 4. The CRISP-DM lifecycle model showing additional steps
Graphic showing the CRISP-DM lifecycle model with additional steps

Training data

Machine learning can be categorized into supervised learning and unsupervised learning. The difference is whether the model is given the answers that it must learn to predict. In supervised ML, the training data contains answers (called “labeled” data). This allows the algorithm to predict the combinations of inputs that produce certain answers. Some of the most widely used supervised learning algorithms include Support Vector Machines, Random Forest, Linear Regression, Logistic Regression, Naive Bayes, and Neural Networks (multilayer perceptron).

In unsupervised ML, the training data is not labeled, and the algorithm is limited to determining the groups of data that are most similar. Even with deep learning, labeled data sets are needed to train the models, but the feature engineering step for the most part is automated.

Reinforcement learning is also becoming a very popular approach where the model or the algorithm learns through a feedback system. Reinforcement learning is most widely used in self-driven cars, drones, and other robotics applications.

Examples of training data

Let’s consider a system that predicts the sale price of a house from a number of factors. When ML needs to predict an output from a series of input values, a series of inputs and labeled outputs is needed.

Size (sq ft) Bedrooms Bathrooms Acreage Sale Price
2000 3 2 0.3 250,000
1500 2 2 0.2 200,000
1600 2 1.5 1.2 280,000

A good rule of thumb would be to take the number of input columns, multiply by 50, and provide that many rows as labeled training data. Our example has four inputs (Size, Bedrooms, Bathrooms, Acreage), so for a robust model you would want 200 rows of training data.

Now let’s look at a natural language processing exercise, where you extract certain types of data from plain text. Your training data includes several text snippets, the type of data to be extracted, and the location of that data.

Text Annotations
The quick brown fox jumped over the lazy dog. Animalfox
The cow jumped over the moon. Animalcow
The dish ran away with the spoon.

A good rule of thumb here would be to provide 50 positive examples and 50 negative examples for each type as the ground truth. You want enough variability in your training data that your model is able to learn all the patterns you want to extract. Also, the collected and prepared ground truth is usually split as an 80:20 ratio of train and test data (other ratios including 60:40 and 70:20:10 are not uncommon either). Having a greater proportion of data as test data ensures a better validation of model performance. Too little training data provides less data for the model to learn from and results in underfitting (algorithm shows low variance but high bias). When the ratio of training set is higher than the test set, it results in overfitting (algorithm shows low bias but high variance). Both overfitting and underfitting lead to poor model predictions and performance on new data sets. Hence, the selection of representative ground truth is absolutely critical to training cognitive systems.

Training methodology at a high level

A useful analogy is to think of your AI system as a college student. College students learn subject matter from homework where the answers are in the back of the book. Students work the problems and look up the answers as they go along, while improving their mental models of the subject material. Before midterms, students take practice exams and grade their performance on a separate set of questions, though these questions are generally similar to their homework. Finally, the students take their midterms, on questions they have never seen before. The midterm performance gives the best indication of how well the student can apply his or her knowledge. In this analogy, homework problems are the training set, practice exams are the test set, and the midterm is the blind set.

For a discussion about how much data to get, with examples, see the article “Why does machine learning require so much training data?” For some thoughts on why training data curation takes so long, see “Machine learning is just the tip of the iceberg—5 dangers lurking below the surface.”

Initial process for training a natural language processing system

The initial process for training an NLP-based system includes the steps depicted in the following figure and listed below. More specifically, the figure depicts the steps involved when training entity extraction models that represent one aspect of a cognitive or AI system, which we discussed earlier.

Figure 5. NLP training process for entity extraction
Graphic showing the NLP Training Process
  1. Type System Design. The entities and relationships that need to be extracted are defined and organized. These are based on the business objectives and might use an industry standard or organization-based ontology as the basis.
  2. Corpus Import including pre-processing. This step describes collecting and importing representative samples of the natural language text that needs to be processed using NLP to extract information. This step also includes tasks involved in pre-processing the documents to be used as ground truth including format conversion and chunking.
  3. Dictionary Creation. Dictionaries of similar terms are defined. This is similar to a thesaurus. For example, if you are looking for the concept of money, you might define a money dictionary. In it, you would place terms that are related to money, such as “dollars,” “cents,” and “USD.”
  4. Pre-Annotation. You apply the pre-defined dictionaries and any other rules to the corpus. This creates a baseline of training data.
  5. Human Annotation. Humans review the documents from the corpus. Because the documents have been pre-annotated according to dictionaries and rules, they will already contain annotations. The reviewers need to correct anything that was incorrectly annotated and add any annotations that the system missed. This allows the system to have accurate training data for the next step and provides a way for the humans to teach the system about when to mark an entity based on context. You might also have to perform training data conflicts where different human annotators are annotating overlapping documents consistently. Someone has to play the role of reviewer, look at the inter-annotator agreement (IAA) scores, and adjudicate conflicts in annotated documents.
  6. Train Machine Learning Model. In this step, you actually train the machine learning-based annotation model that can extract the entities, relationships, and attributes. The step might involve identification of the right set of features. If you are using a tool like IBM Watson Knowledge Studio, the feature selection is handled for you automatically. In this step, you select the document sets that you want to use to train it. You also specify the percentage of documents that are to be used as training data, test data, and blind data. Only documents that became ground truth through approval or adjudication should be used to train the machine-learning annotator.
  7. Identify and Create Rules Model. Deterministic rules are defined to annotate entities that appear in the corpus. These rules should be accurate at least most of the time. The rules do not have to be one hundred percent accurate for a couple of reasons: you never reach complete accuracy for NLP, and you have a chance to adjust your training data in a later step when the rule does not apply.
  8. Model Analysis. In this step, you review the trained model performance to determine whether any adjustments must be made to the annotator to improve its ability to find valid entity mentions, relation mentions, and co-references in the documents. Metrics are reviewed to determine the accuracy of the system. Two important metrics are the F-measure and accuracy, which are discussed below. One would typically analyze statistics that are presented in a confusion matrix including the recall, precision, and F1 scores. Then based on the results you can take steps to improve the machine learning annotator performance.

After analysis, the cycle continues. The output of the analysis is used to determine the next steps, such as revising the type system and adding more data to the training corpus.

Continuous integration process for a cognitive system

A wider view of training the system over time incorporates feedback from users as they use a pilot or production system that typically includes a custom user interface. These steps are shown in the following figure and outlined below.

Figure 6. Training continuous integration
Graphic showing the Training continuous integration
  1. The initial corpus is uploaded.
  2. SMEs establish the type system.
  3. The corpus is annotated by SMEs (supervised learning).
  4. The training occurs and model version 1 is created.
  5. The model is used by the application and results are shown in the application UI.
  6. End users view the results in the application UI and provide feedback.
  7. Feedback is collected and stored from many users across many interactions with the system.
  8. At a given threshold, the feedback is incorporated back into the system. a. Optionally, the feedback can be reviewed by SMEs and corrected or annotated. b. The batch set of feedback is used, along with the initial annotated corpus, to retrain. c. The model v-next is produced.
  9. Repeat steps 5-8 until the wanted accuracy levels are achieved or whenever new variability is introduced that the system needs to be trained against.

Evaluating models

You know that AI systems get better over time, and you want to be explicit about how much better the system is performing. You need to select the appropriate measure for the AI system. The two primary measurement techniques are accuracy and F-measure (usually F1).

Accuracy is a very simple measurement. Simply stated, it is the number of right answers divided by the number of opportunities. If you ask a system ten questions and it gets nine right, it has 90% accuracy. The benefit of using accuracy is that it is a simple measure that everyone understands. However, it lacks sophistication, which might be needed to fully understand system performance.

F-measure includes classifications of why the system made the right (or wrong) prediction. Right and wrong predictions are each split into two categories:

  • True positive: System makes a correct prediction.
  • True negative: System correctly does not make a prediction.
  • False positive: System makes a prediction, but should not have (Type I error).
  • False negative: System fails to make a prediction, but should have (Type II error).

To compute F-measure, you first compute precision (the accuracy of predictions the system makes as defined by the incidence of false positives) and recall (how many predictable values the system predicts as determined by the incidence of false negatives). F1 is the most common F-measure and is a harmonic mean of precision and recall. Precision and recall have a natural tension: Liberalizing the model usually increases recall at the expense of precision, and tightening the model increases precision at the expense of recall.

Incidentally, accuracy can be rewritten as (TP+TN)/(TP+TN+FP+FN). An additional benefit of F-measure is that it does not include true negatives, which are the easiest and least interesting cases. You should use an F-measure when you want to classify your error types, or when your data has a lot of true negatives. Consider a rare attribute that you want to extract from source text, one that appears in only 1 of 10,000 sentences. By not writing an annotator, you continuously predict the attribute is not in every sentence you see, a 99.99% accurate prediction. By using F-measure, you see this “do-nothing” annotator has 0% recall and, therefore, has no value.

Example 1

Let’s measure the accuracy and F1 of an annotator that marks the instances of dogs in natural language.

# Sentence Extraction Result
1 It was a bright sunny day. True negative
2 The young dog found a bone. dog True positive
3 A young cat lay in the sun. cat False positive
4 The puppy chased the cat. False negative
5 There was chaos everywhere. True negative
6 A boy took the bone away. boy False positive

In this example, precision is 33% (rows 2, 3, and 6), recall is 50% (rows 2 and 4), F1 is 39.8, and accuracy is 50% (all rows). The detailed breakdown afforded by F1 gives a better indication of where you need to improve the model.

Example 2

Now let’s measure the performance of a classification system. Our hypothetical system classifies images as being about dogs, cats, or birds. A traditional way to measure classification performance is a confusion matrix.

Dog (actual) Cat (actual) Birds (actual)
Dog (predicted) 7 4 0
Cat (predicted) 3 6 0
Birds (predicted) 0 0 10

In our example, there were 30 images, 10 of each animal. However, the system classified the images as 11 dogs, 18 cats, and 1 unknown.

By reading across the rows, you can measure the system’s precision for each class: Precision(dog) = 7/(7+4+0). Reading across columns yields recall per class: Recall(dog) = 7/(7+3+0).

A confusion matrix shows you what kinds of mistakes the system makes and what new training data might be required. Perhaps not surprisingly, the system often confuses cats and dogs for each other, but never for birds, thus more cat and dog pictures are required for training. For a detailed example on measuring accuracy with NLP focus, see “Cognitive system testing: Overall system accuracy testing.”

Publish, monitor, and operate

AI systems learn as they encounter more training data. It is important to test your model early and often to verify that it is extracting insights from your data. You should measure the performance of your system regularly as you add relevant training data. These regular measurements help you determine when model performance is not improving, which indicates either a need to refine your model or reach a logical stopping point. It is also advisable to go beyond the traditional “F1, Precision and Recall” metrics and start to go a level deeper to the map model performance to other key criteria to get a deeper understanding on why the model is performing in a certain way. These could include:

  • F1 scores by entity type
  • Number of ground truth (training and test) examples used per entity and relationship
  • Number of records in the ground truth by document type to understand the stratified distribution of the training and test data used to train and evaluate the models
  • Time to extract key entities, relationships, and attributes
  • Size of population you can sample, process, and test

In addition to adding more training data, you are likely to modify the code and model your AI system uses. You will create new ML features as hypotheses. Some of these hypotheses will work out and some will not. By tracking every iteration of your system, you are quickly able to see which revisions work and which ones do not.

Every artifact related to your AI system should be tracked in source control. This includes code, models, and the training data itself. Software developers are very familiar with the virtues of version control, which allows you to know exactly what changes have been introduced into a system and when. These benefits extend to your models and training data. Models will change as new hypotheses are developed, and training data might change as mistakes are discovered and corrected. (That’s right: you are likely to discover that your AI system is suffering because the original data you gave it contains too many errors!)


Here is the accuracy of a Watson Knowledge Studio model, tracked over time as training data was added. When accuracy plateaued from 12,000 – 20,000 words, we knew that the model had learned as much of the variability as it could from the training set and that adding new training data was unlikely to improve the model.

Figure 7. Accuracy of a Watson Knowledge Studio model
Graphic showing the Accuracy of a Watson Knowledge Studio model

Continuous Learning

In a full-fledged deployment and integration of modeling results, your data modeling work might be ongoing and continuous. For example, if a model is trained and deployed to increase customer retention among high-value customers, it will likely need to be tweaked after a particular level of retention is reached. The model might then be modified and re-used to retain customers at a lower but still profitable level on the value pyramid.

When people encounter new scenarios and situations, they adapt and learn. Cognitive systems must continuously learn and evolve as the system is put to use and encounters new observations. If it does not learn and adapt, then the trained models, and, therefore, the cognitive systems will start degrading soon after they get deployed to production. To be able to continuously learn, there should be a set of tools, processes, instrumentation, and governance mechanisms in place. In the following figure, we expand on the Deploy through Capture Feedback phases of the overall method to depict activities that must continuously learn using the resulting feedback, including having a governance process that carefully decides what the system must learn to achieve business objectives.

Figure 8. Continuous learning loop
Graphic showing how to continuously capture of feedback and keeping the models upto date


This tutorial has outlined several ways to apply DevOps-style thinking to cognitive or artificial intelligence systems. It can serve as a starting point for your process development and will need to be tweaked for your specific cognitive application. The important takeaway is to remember to continuously measure your cognitive system and to treat all of its components as first-class artifacts in your development process. DevOps thinking has been a wonderful boon to traditional application development, and your cognitive systems need the same boost from DevOps for cognitive systems!