Typical Machine Learning Performance Use Case
One common use case pattern involves understanding consumer sentiment towards a brand or a product. With the increased popularity in social media platforms, consumers have been leveraging such platforms to share their opinions, thoughts, and sentiment towards brands, products and services. As our team works closely with clients developing such applications using Watson Natural Language Understanding (NLU) sentiment analysis feature, we get some recurring questions and comments such as:
“Our sentiment analysis results are not good.” “Sentiment analysis is not returning expected results.” “How can we improve our sentiment results?”Typically, clients would provide one or two examples (or maybe a few) to explain why they think the sentiment analysis results were unexpected or not accurate. While this is understandable since we’ve been trained to expect certain outcomes in a consistent manner for all examples, it is not the recommended approach to evaluate and understand the performance of AI services. The machine learning (ML) models which empower these AI services are statistical in nature and as such, it is important to use well defined machine learning metrics, such as accuracy, precision, recall, F1 score, AUC and confusion matrix in evaluating their performance. Once you understand the rationale and importance of these metrics, you can now accurately and mathematically evaluate your sentiment analysis solution rather than going on “gut feel”.
Methodology to Evaluate Your Machine Learning SolutionIn the rest of this blog, we offer a methodology that can help developers better understand the performance of the Watson AI services. Furthermore, the methodology will guide developers through the decision process to understand whether a pre-trained machine learning model is sufficient for their needs or whether they need to train their own model that works better for their domain. The methodology will reference two of the Watson AI services, namely Natural Language Understanding (NLU) and Natural Language Classifier (NLC). However, the methodology applies in general for other AI services. As a quick reminder, NLU offers pre-trained ML models to extract several useful features from text including sentiment, emotion, keywords, entities, relations, concepts, categories, semantic roles, and meta-data. NLC, on the other hand, offers a powerful capability for easily training a custom ML model for short text classification. Our recommended methodology is described in Figure 1 and can be outlined as follows:
- First and foremost, collect a set of at least 50 representative examples (the more examples the better). We will refer to this set as the test In the context of sentiment use case, the examples would consist of text utterances shared either via social media platforms or directly with enterprises via chat messages or email.
- Label the collected examples in the test set with the correct sentiment label. Typically, sentiment labels consist of positive, negative, and neutral. For this step, human experts are needed to associate a label with each of the text utterances. An interesting challenge sometimes arises where humans don’t agree on the correct label. This is less likely to happen with sentiment labels but is possible in general when associating labels with text utterances. When humans disagree on the correct label, it would be natural to understand why the AI services may be confused.
- Run the collected text utterances in the test set through the NLU service for sentiment analysis results. Record the results that come back and compare with the labels defined in step 2. Given the predicted labels returned by NLU and the true labels, specified by humans, you can compute accuracy, precision, recall, f1, and AUC. We provide a sample NLU-sentiment Python notebook that runs through these steps and returns the machine learning metrics of interest.
- Given the ML performance metrics computed in step 3, we now have a quantitative understanding of the performance of the pre-trained ML models in NLU. Reviewing these results typically leads to one of two scenarios:
- Good performance results: In this scenario, we find that while NLU doesn’t return correct sentiment for all the example text utterances, the ML metrics indicate good performance for the given application; for example, we get accuracy, precision, recall and f1 scores higher than 70%.
- Poor performance results: If the computed metrics show poor performance, proceed to step 5 for options.
- Review the specified labels on the text utterances in the test set. If the initial results from the service are poor, the first step is to review the human specified labels. More often than not, we find there are some incorrectly labeled examples, either because of typos, or because there is no consensus among the human experts on the correct labeling.
- Train a custom sentiment analysis model for your domain. At this point, you’ve identified the resulting performance metrics do not meet your application requirements and you’ve verified that the labels by human experts are correct. This means that the pre-trained model in NLU is not adequate for the domain of your use case. The NLU sentiment analysis works well on text utterances from the general domain as that is what it is trained on. For certain domains, like medical or legal, it may not deliver the required accuracy. For example, testing the phrase “My doctor was great” on the NLU demo app returns neutral sentiment while most humans would agree it should be a more positive sentiment. Most likely, the reason sentiment came back as neutral is that any reference to “doctor” would indicate a negative sentiment in general and thus, the overall phrase sentiment is neutral. In this case, our methodology recommends training a custom sentiment analysis model using a service like Natural Language Classifier (NLC) which involves the following steps:
- Collect training data: NLC requires training data in the form of text utterances and the corresponding correct label for each utterance. For sentiment analysis, the labels would be positive, negative, and neutral. The format of the training data is similar to the test data set but the actual text utterances should be different. Also, typically your need more data for training; at least 30-50 examples for each label.
- Train an NLC classifier using the training data set (this is a simple REST API call).
- Test the performance of your custom trained ML model by repeating the steps above. We’ve published a Python notebook for computing the performance metrics for an NLC trained classifier.
- Update your training data and repeat to further improve your results as outlined in the following blog.
Conclusion – Rely on Established ML Performance MetricsIn this blog, we’ve outlined some common challenges developers struggle with as they adopt AI services in their applications and offered a methodology for evaluating and improving the performance of your application. As you incorporate AI services into your applications, it is important to adopt well established techniques for evaluating the performance of machine learning models which empower these AI services. This will help guide you through leveraging the most relevant AI services for your application and delivering the best performance for your AI solutions.