IBM Cloud Satellite: Run and manage services anywhere Learn more

IBM Developer Blog

Follow the latest happenings with IBM Developer and stay in the know.

Learn about computing and evaluating performance metrics for the trained machine learning system powering the chatbot.


Despite the exponential growth in chatbots across various applications and messaging platforms, there are still several challenges to overcome when delivering a successful chatbot. One of the key challenges is the ability of the chatbot to understand the wide variety of inputs from the users. In an earlier blog on training a chatbot, we describe a methodology for training chatbots and evaluating their performance. In the rest of this article, we focus on computing and evaluating performance metrics for the trained machine learning system powering the chatbot.

Machine Learning Performance Metrics

Training a cognitive solution such as a chatbot is an iterative process. However, developers need guidance on when they can release their cognitive application to their end-users. To do so, it is recommended to measure performance metrics and when certain targets are achieved, then that would signal readiness for the cognitive application to be released.

A variety of metrics such as accuracy, precision, recall, and AUC are commonly used for measuring the performance of a machine learning system. In the rest of this blog, we’ll describe how to compute accuracy, precision, and recall of a classification solution. Furthermore, we will discuss what a confusion matrix is, how to generate it for a classification solution, and how to use it in better diagnosing the performance of your machine-learning based cognitive solution.

To help with the definition of these metrics, we refer to Table 1 which shows the confusion matrix that compares the number of actual positive and negative intent utterances to the number of predicted positive and negative intent utterances by a binary classifier. Given a set of N total utterances, let NAP be the number of actual positively labeled utterances, NAN be the number of actual negatively labeled utterances. These constitute the “groundtruth” for the system. On the other hand, NPP is the number of predicted positive utterances and NPN is the number of predicted negative utterances by the trained classifier.

Total Number of Samples (N) Predicted Positive (NPP) Predicted Negative (NPN)
Actual Positive (NAP) True Positive (NTP) False Negative (NFN)
Actual Negative (NAN) False Positive (NFP) True Negative (NTN)

Table 1: Comparison of Actual Labels to Predicted Labels by a Binary Classifier

With these parameters, we define accuracy, precision, and recall:

  • Accuracy: Of all the predicted utterances, how many are correct? Accuracy is defined as the total number of utterances predicted correctly by the cognitive system divided by the total number of utterances (N). This includes all the utterances which are actually positive and the system predicted as positive (NTP) as well as all the utterances which are actually negative and the system predicted as negative (NTN).

    Accuracy

  • Precision: Of all the utterances predicted to be of a certain class, how many are actually labeled with that class? Precision is computed by considering all the utterances predicted as positive and checking which of those are actually positive. Effectively, precision is defined as the total number of true positive utterances divided by the total number of utterances predicted as positive.

    Precision

  • Recall: Of all the utterances that are actually labeled with a certain class, how many did the system correctly predict to be of that class? Recall is computed by considering all the utterance that are labeled as positive and checking how many of those the system predicted correctly. Effectively, recall is defined as the total number of true positive utterances divided by the total number of utterances which are actually labeled as positive.

    Recall

For example, consider a sentiment classifier that is trying to decide if an utterance indicates positive sentiment. For a test set of 1000 utterances (N=1000), assume the number of actual and predicted positive and negative sentiment labels are as shown in Table 2.

Total Number of Samples (N=1000) Predicted Positive (NPP) Predicted Negative (NPN)
Actual Positive True Positive (NTP=50) False Negative (NFN=100)
Actual Negative False Positive (NFP=150) True Negative (NTN=700)

Table 2: Example Confusion Matrix for Sentiment Classification

With the values given in Table 2, we can computer accuracy, precision, and recall as follows:

  • Accuracy = 750/1000 = 75%
  • Precision = 50/200 = 25%
  • Recall = 50/150 = 33%

As you can see from this example, when measuring the performance of a cognitive solution, it is important to consider other metrics such as precision and recall in addition to accuracy.

Putting knowledge into practice – how it works

To expand our description to cognitive classifying solutions with multiple classes, we will use real examples from the Watson Business Coach application which consisted of a chat component powered by Watson Conversation service. One of the requirements for that application was to understand which business need intent referenced by the user. The identified business need intent would trigger a corresponding dialog with the user to capture other parameters which would then serve the most relevant client references.

For simplicity, we will focus on a handful of business need intents, specifically, we will build an intent classifier (as a component of Watson Conversation service) which takes as input a user’s utterance (short text) and classifies that text into one of six intents:

  • improve_customer_service
  • improve_decision_making
  • innovate
  • knowledge_sharing
  • improve_productivity
  • personalize_user_experience

After collecting real end user utterances, we mapped those utterances into these six intents, randomly split these labeled utterances (70% for training and 30% for testing) and trained the Watson Conversation Service. Once the service completed training, we wrote confusionMatrix code to compute accuracy, precision, recall and generate the confusion matrix for our classifier.

Figure 2 shows the initial confusion matrix for our classifier with the six different intents. The confusion matrix is extremely helpful in better diagnosing the performance of the system. While the overall accuracy of the solution is 58.5%, it is clear from the confusion matrix that the system does a better job classifying improve_customer_service and innovate intents than it does classifying personalize_user_experience or improve_decision_making intents.

Watson Business Coach Initial Confusion Matrix

Figure 2: Watson Business Coach Initial Confusion Matrix

Also, reviewing the confusion matrix, you can identify which intents may be most confusing for the system to classify. In Figure 2, you can see that improve_decision_making and innovate intents may be confusing for the system to classify correctly.

Once you’ve computed the accuracy, precision, and recall metrics, next you can focus on improving the performance of the system by following these guidelines:

  • To improve precision of an intent classification, review the training data to make sure there is consistent mapping of utterances to that intent.
  • To improve recall of an intent classification, add more training utterances that map to that intent.

When judging the performance of a cognitive solution, it is critical to rely on well-defined machine learning performance metrics (accuracy, precision, recall, AUC, etc…) and not get stuck trying to explain one-off inconsistencies. Relying on measurable metrics help guide the delivery and updates of your cognitive chatbot solution.