Despite the exponential growth in chatbots across various applications and messaging platforms, there are still several challenges to overcome when delivering a successful chatbot. One of the key challenges is the ability of the chatbot to understand the wide variety of inputs from the users. In an earlier blog on training a chatbot, we describe a methodology for training chatbots and evaluating their performance. In the rest of this article, we focus on computing and evaluating performance metrics for the trained machine learning system powering the chatbot.

Machine Learning Performance Metrics

Training a cognitive solution such as a chatbot is an iterative process. However, developers need guidance on when they can release their cognitive application to their end-users. To do so, it is recommended to measure performance metrics and when certain targets are achieved, then that would signal readiness for the cognitive application to be released.

A variety of metrics such as accuracy, precision, recall, and AUC are commonly used for measuring the performance of a machine learning system. In the rest of this blog, we’ll describe how to compute accuracy, precision, and recall of a classification solution. Furthermore, we will discuss what a confusion matrix is, how to generate it for a classification solution, and how to use it in better diagnosing the performance of your machine-learning based cognitive solution.

To help with the definition of these metrics, we refer to Table 1 which shows the confusion matrix that compares the number of actual positive and negative intent utterances to the number of predicted positive and negative intent utterances by a binary classifier. Given a set of N total utterances, let NAP be the number of actual positively labeled utterances, NAN be the number of actual negatively labeled utterances. These constitute the “groundtruth” for the system. On the other hand, NPP is the number of predicted positive utterances and NPN is the number of predicted negative utterances by the trained classifier.

Table 1: Comparison of Actual Labels to Predicted Labels by a Binary Classifier

Total Number of Samples (N) Predicted Positive (NPP) Predicted Negative (NPN)
Actual Positive (NAP) True Positive (NTP) False Negative (NFN)
Actual Negative (NAN) False Positive (NFP) True Negative (NTN)


With these parameters, we define accuracy, precision, and recall:

  • Accuracy: Of all the predicted utterances, how many are correct? Accuracy is defined as the total number of utterances predicted correctly by the cognitive system divided by the total number of utterances (N). This includes all the utterances which are actually positive and the system predicted as positive (NTP) as well as all the utterances which are actually negative and the system predicted as negative (NTN).


  • Precision: Of all the utterances predicted to be of a certain class, how many are actually labeled with that class? Precision is computed by considering all the utterances predicted as positive and checking which of those are actually positive. Effectively, precision is defined as the total number of true positive utterances divided by the total number of utterances predicted as positive.


  • Recall: Of all the utterances that are actually labeled with a certain class, how many did the system correctly predict to be of that class? Recall is computed by considering all the utterance that are labeled as positive and checking how many of those the system predicted correctly. Effectively, recall is defined as the total number of true positive utterances divided by the total number of utterances which are actually labeled as positive.



For example, consider a sentiment classifier that is trying to decide if an utterance indicates positive sentiment. For a test set of 1000 utterances (N=1000), assume the number of actual and predicted positive and negative sentiment labels are as shown in Table 2.

Table 2: Example Confusion Matrix for Sentiment Classification

Total Number of Samples (N=1000) Predicted Positive (NPP) Predicted Negative (NPN)
Actual Positive True Positive (NTP=50) False Negative (NFN=100)
Actual Negative False Positive (NFP=150) True Negative (NTN=700)


With the values given in Table 2, we can computer accuracy, precision, and recall as follows:

Accuracy = 750/1000 = 75%

Precision = 50/200 = 25%

Recall = 50/150 = 33%

As you can see from this example, when measuring the performance of a cognitive solution, it is important to consider other metrics such as precision and recall in addition to accuracy.

Putting knowledge into practice – how it works

To expand our description to cognitive classifying solutions with multiple classes, we will use real examples from the Watson Business Coach application which consisted of a chat component powered by Watson Conversation service. One of the requirements for that application was to understand which business need intent referenced by the user. The identified business need intent would trigger a corresponding dialog with the user to capture other parameters which would then serve the most relevant client references.

For simplicity, we will focus on a handful of business need intents, specifically, we will build an intent classifier (as a component of Watson Conversation service) which takes as input a user’s utterance (short text) and classifies that text into one of six intents:

  • improve_customer_service
  • improve_decision_making
  • innovate
  • knowledge_sharing
  • improve_productivity
  • personalize_user_experience

After collecting real end user utterances, we mapped those utterances into these six intents, randomly split these labeled utterances (70% for training and 30% for testing) and trained the Watson Conversation Service. Once the service completed training, we wrote confusionMatrix code (available on github) to compute accuracy, precision, recall and generate the confusion matrix for our classifier.

Figure 2 shows the initial confusion matrix for our classifier with the six different intents. The confusion matrix is extremely helpful in better diagnosing the performance of the system. While the overall accuracy of the solution is 58.5%, it is clear from the confusion matrix that the system does a better job classifying improve_customer_service and innovate intents than it does classifying personalize_user_experience or improve_decision_making intents.

Figure 2: Watson Business Coach Initial Confusion Matrix

Figure 2: Watson Business Coach Initial Confusion Matrix

Also, reviewing the confusion matrix, you can identify which intents may be most confusing for the system to classify. In Figure 2, you can see that improve_decision_making and innovate intents may be confusing for the system to classify correctly.

Once you’ve computed the accuracy, precision, and recall metrics, next you can focus on improving the performance of the system by following these guidelines:

  • To improve precision of an intent classification, review the training data to make sure there is consistent mapping of utterances to that intent.
  • To improve recall of an intent classification, add more training utterances that map to that intent.

When judging the performance of a cognitive solution, it is critical to rely on well-defined machine learning performance metrics (accuracy, precision, recall, AUC, etc…) and not get stuck trying to explain one-off inconsistencies. Relying on measurable metrics help guide the delivery and updates of your cognitive chatbot solution.


Learn more about creating your own chatbot

Ready to go deeper? Access code patterns and learn how to hook it all together. Take it to the next step by learning how to create a building a configurable, retail-ready chatbot. You can also access the IBM Bot Asset Exchange, a community-driven chatbot development hub, to see bots that others have created and use them as a foundation for your own chatbot.






11 comments on"Is your chatbot ready for prime-time?"

  1. what is the current market standard for a bot’s response accuracy?

  2. Hi Neha, I am not aware of a well accepted market standard for bot’s response accuracy. However, greater than 70% would be typical for an initial beta release with the plan to increase that through iterative training to over 90%.

  3. Malcolm Robbins October 03, 2017

    Interesting blog. Am I right to understand that in order to check the performance metrics I need to use or adapt the code referred to in the blog? Just wondering why the service doesn’t have this in the UI. I’m new to this so I may be ignorant of the UI capabilities but it seems to me that unless it is part of the UI it would not be obvious to many people that checking these metrics is important – they’d just assume the conversation service is doing the classification “perfectly”. To me it seems that to use this in a use case other than a “toy” measuring the performance metrics is pretty important/crucial.

  4. Joe Kozhaya October 05, 2017

    Hi Malcolm, whenever you train a machine learning solution, you need to understand its performance as these statistical solutions don’t give 100% accurate results, and the blog/code we present here show how you can do that and the relevant metrics to look for. As to why it is not in the UI for the service, I can’t really comment on that but it is probably because there is a need to maintain a balance between complexity and usability in the service offering.

  5. Thank you! I loved it!The post is very didactic, congratulations!

  6. Appreciate any help.. for last step “Provide as argument a Java properties file and press Run” we get Apr 18, 2018 6:13:52 PM confusionMatrix.ConvTest main
    SEVERE: no arguments specified, properties file name needed
    any ideas what might be causing this? Thanks!

  7. Joe Kozhaya April 20, 2018

    Hi CJ,
    Did you provide a Java properties file that specifies the parms as described in this file https://github.com/joekozhaya/confusionmatrix/blob/master/sample.properties:

    numIntents=NUM_INTENTS (for example 10)
    test_csv_filename=PATH_TO_CSV_FILE_WITH_UTTERANCES_AND_INTENTS (need the header to be “text” and “class”)
    confmatrix_filename=CONV MATRIX FILE NAME TO WRITE TO

    Also, for reference, we have published a simpler Python based set of notebooks for evaluating performance:
    The notebook specific for Conversation (Watson Assistant) : https://github.com/joe4k/wdcutils/blob/master/notebooks/ConversationPerformanceEval.ipynb

    If this doesn’t work for you, feel free to email me directly and I can help you.

  8. Is there a sample csv file for “csv file with your text utterances and corresponding intent labels”

    • Joe Kozhaya April 30, 2018

      Hi Tiphanie,
      You can create a spreadsheet in Excel which consists of two columns, the first column is the sample text users may say and the second column would be the label that best matches that text. For example, if you’re trying to train a classifier for sentiment analysis, here would be a simple start-of-the-csv-file:
      text class
      It is a great day today. Positive
      It was a horrible experience. Negative
      We had lunch in the park today. Neutral
      You will need more examples for each of the labels. This is the same format as you’d upload for training the classifier.
      Hope this helps.

Join The Discussion

Your email address will not be published. Required fields are marked *