Taxonomy Icon

Artificial Intelligence

As members of the Watson™ Applied Research team, we are often consulted to create fresh, new, innovative, solutions to solve historically difficult problems that hurt our collective productivity. Last summer, one of our clients came to us with an all too familiar problem: their chatbot was not living up to their expectation. They wanted to know whether there a way to improve the solution using the IBM Watson Assistant API? We accepted their challenge—and their chatbot logs—and built a new system with Watson. In this tutorial, we discuss our approach, our metrics, and how much “smarter” the new bot became.

Show me the data

The first thing any data scientist asks for, is well, the data. By definition, supervised machine learning models need labeled data, so the absolute best place you can get this data for your models is from the deployed solution. At the beginning of the project, this can cause a chicken and the egg problem. Some people use synthetic training, while we prefer deploying the solution without using the actual machine learning (ML) algorithm, but rather a more deterministic version, then swapping out for the ML model after enough people have interacted with the system.

Thankfully for us, with this client we have tens of thousands of human interactions with their old chatbot system, which is nicknamed “Chat Bot Zero.” A data science gold mine.

Measure it

Performance

Measuring system performance is such an easy thing for us to all agree needs to be done, but when it comes down to the details, what does it mean exactly? By the very nature of the natural language expression, there isn’t one way to express concepts. And, more than one concept could theoretically apply to a given situation.

Our approach is to always look at the problem from the user’s perspective, and ask our human judges the question, “Is this a good response?” when shown the customer and the chatbot’s exchange. We used our annotation army* to answer “yes”, or “no.”

Annotation Army*

An annotation army is the concept of a group of people who are educated enough of the domain of the chatbot’s purview, that their opinion is of high value and considered the ultimate arbitrator of the chatbot’s correctness. Note, in practice, there is always inner-annotation disagreement. For example, if you show 10 experts the same information, all 10 will not agree on the correct outcome almost all of the time (the definition of almost all depends on how complicated the data set is). So, for our purposes we ask 5 humans per judgment and use the judgment that 3 of the 5 experts agree on.

Using humans to ultimately measure the system helps push our solutions closer to a human acceptable bot. Using results on “golden data,” that has a known acceptable solution can put yourself at risk of the bias of the person who created the data set. Machine learning in the field is optimized for how it’s used, not how it was designed.

Answer frequency

When analyzing the output of a bot that was created by someone else, you might naturally ask yourself, “What were these people thinking this bot should know?” Having been in the business, we realize that every engineering team that produces a running system must do a few things right, and it is just as important to learn from that as it is from the things that went wrong. But how many things are we talking about here? My favorite way to visualize this is to sort all of the answers by how often they were returned in the logs. Using a cumulative sum histogram graph, it then makes it very easy to read statements like, “The top 100 answers accounted for 70% of the log data,” as you can see in the log coverage plot in Figure 1.

Figure 1. Log coverage plot: Cumulative sum histogram of most popular responses. 97% of the logs contain one of 500 responses from Chat Bot Zero
Log coverage plot

Rebuilding with Watson

Version 1.0

After studying the results of the log coverage plot, and considering my own human limitations, we decide that the first version of our Watson Assistant will have 60 intents. Intents are classifications of human utterances that my Watson system will attempt to recognize. We realize that Chat Bot Zero had over 500 types of responses, so we don’t expect much from my 60-intent system, but we are starting small to get a feel for the system, and familiarize ourselves with the domain.

The process to create the data set for the Watson Assistant machine learning training is shown in Figure 2. We begin with the large cache of log data and take a random sample. We send that sample to humans to judge whether Chat Bot Zero gave a good response. If yes, and the response belonged to the top 60 responses, this is added as training to an intent matching that response. If no, then the human utterance is judged by a human to see whether it is an example of one of the top 60 intents and added to the training if it is.

Figure 2. Processing the data for Watson Assistant intent training
Processing the data for Conversation intent training

Note that we are equating intents and responses, although in a system this isn’t necessarily true. Responses can be changed with context, for example, a user asking for the address of your closest store can have many, many correct responses, but all have the same intent of finding the closest store. We applied some natural language processing techniques to the data set to try to de-duplicate the responses that are context specific, to a normalized response. For example, “your closest retail outlet is […]” removes the address so a unique string count group all those responses together.

Blind testing: Using the secret stash

Remember when we told you that we had loads of log data? We’re going to pull the standard trick, and even though we got 4 months of data, we’re going to pretend that we didn’t get the last month of data and hold it out as our test set. Now, we can mimic deploy and test using the last month of data. Sure, users might type something in that was from the train set, but we’re testing user experience on a model trained with older exchanges.

Testing is as simple as feeding chatbot exchanges to the annotation army without indication of which system it came from. Watson Assistant is now trained, and the responses from Watson can replace the responses from Chat Bot Zero. See Figure 3.

Figure 3. Create a secret stash from the last month of log data
Create a Secret Stash from the last month of log data, and use it to test the performance of a Watson Assistant model trained only with older conversations

Version 1.1

This is an exciting time in the project. The results are in, and now we know officially that the Watson Assistant bot is doing worse than Chat Bot Zero. However, version 1.0 was only meant to be a primer coat on which we could build upon. From here, there are two main types of data we can add to the model, examples for existing intents and new intents. And each time we augment the data, we take a second look at each intent/example pair to verify it is where it should be.

As a minor version, we first decided we would focus on expanding the number of intents. We use the top intents of the correct Chat Bot Zero responses, and some human intuition from sorting through all the data. Data is added to the training set, and the testing is repeated.

Version 2.0

Using all the intents from before, add more training examples to each intent class. Use the Chat Bot Zero and the new Watson Assistant bot in parallel to ask the annotation army about the response to the human.

Rinse and repeat

Now that we had the major/minor pattern, we iterated a couple of times, and then on the last iteration, Version 3.0, we did a bit of both because it was the last hurrah for the project.

Improved++

Human/bot interactions

Figure 4. Relative chatbot improvement from Chat Bot Zero to Watson Assistant
Relative Chat Bot Improvement from Chat Bot Zero to Watson Assistant. Version 3.0 Watson Assistant was just over 15% more accurate than the baseline system

In Figure 4, we express my results as a delta of performance over the baseline system, Chat Bot Zero. The Version 1.0 Watson Assistant did 15% worse than the original system, but from there it quickly got better. With just the next iteration of training, Version 1.1, Watson Assistant’s performance was near equal to Chat Bot Zero. With the last iteration in this report, we achieved slightly over 15% improvement by using Version 3.0 Watson Assistant.

Maintenance

Besides just the raw accuracy improvement, you might have noticed that the Version 3.0 Watson Assistant only has about 130 intents. That’s more accuracy in less variation than Chat Bot Zero, making it easier for the copy editors on the chatbot responses to maintain.

Addendum: Nothing is ever done

Because Watson Assistant is an example of a technology where these is only ever a probability of a correct answer rather than a perfect system, there will always be room to move closer to the perfect system. Watson Assistant offers all kinds of great tools, like entity recognition, conversational slots to collect a group of information at once, and dialog flows. All of these tools add dimension to your chatbot that can be tailored to your use case.

Happy chatbotting!