Advice on training and evaluating the performance of your custom machine learning model on Watson Developer Cloud

 

IBM Watson Developer Cloud (WDC) services put the power of machine learning technology in the hands of developers to extract insights from unstructured data (text, speech, and images). To serve developers and enable them to tackle a wide spectrum of applications ranging from general consumer applications to various enterprise-specific applications, the IBM Watson team offers several pre-trained services as well as a rich set of customization capabilities.

Watson Developer Cloud Customization Capabilities

For the pre-trained services, the IBM Watson team has taken on the responsibility of acquiring the right data to train these services, generating trained machine learning (ML) models and providing out-of-the-box functionality for developers. Natural Language Understanding (NLU), Personality Insights (PI), Tone Analyzer (TA), Speech-to-text (STT), Language Translator (LT), and Visual Recognition (VR) are some of the pre-trained WDC services. Developers like these services because they’re intuitive, easy-to-use, require no extra ML training effort and work well for applications tackling a general domain such as enriching web URLs, image tagging or analyzing sentiment of social media posts.

However, for other applications that involve specific domains (such as legal, medical…) or private enterprise data, developers have a need to train and deploy custom ML models. To address that need, the WDC services offer several customization capabilities. Natural Language Classifier (NLC), Watson Conversation, and Visual Recognition services allow developers to train custom ML models by providing example text utterances (NLC and Conversation) and example images (VR) for a defined set of classes (or intents). The Watson Speech to Text (STT) service has a beta offering for training custom language models. Furthermore, for custom entity and relation extraction from text, IBM Watson offers Watson Knowledge Studio, a SaaS solution designed to enable Subject Matter Experts (SMEs) to train custom statistical machine learning models for extracting domain-specific entities and relations from text. Once a custom model is trained with WKS, it can be deployed to NLU or Watson Discovery Service which developers can call at runtime to extract relevant entities and relations.

Performance Evaluation of Trained ML Models

When dealing with custom models, some common questions from developers are “how much training should I do” and “when is my model trained well enough to release”? To help address these questions and enable our partners and clients to exercise the full power of WDC customization capabilities, we’ve published WDC Jupyter notebooks that report commonly used machine learning performance metrics to judge the quality of a trained model. Specifically, the WDC Jupyter notebooks report machine learning metrics that include accuracy, precision, recall, f1-score, and confusion matrix. If you’re interested in more details on these various metrics, please consult the “Is your chatbot ready for primetime?” blog.

These WDC Jupyter notebooks for NLC, Conversation and Visual Recognition help developers evaluate how well their trained models are performing before releasing their application updates to production.

Data Requirements

To leverage these notebooks, you need to have a custom trained model as well as a test dataset. The general recommended approach is to start with the groundtruth, a dataset that consists of example input data and the corresponding correct label. For NLC and Conversation, the input data consists of example text utterances and the label is the intent (or class) that best represents that utterance. For Visual Recognition, the input data consists of example images and the label is the class that best represents that image.

This groundtruth is then randomly split into 2 sets, one for training and the other for test with a typical split being 70% of the data for training and 30% of the data for test. The training set is used to build the custom models by uploading the example data and associated labels to WDC services via REST APIs or via the interactive tooling for NLC, Conversation and Visual Recognition. Once a trained model is ready, you can then leverage the WDC notebooks to evaluate the performance of the model. To do so, you need to provide information to authenticate to your WDC service instance and associated trained model as well the test dataset to evaluate. The notebooks are designed to capture this information via a json file as follows:

NLC example json file:
{
url“: “https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers”,
user“:”YOUR NLC instance username”,
password“: “YOUR NLC instance password”,
nlc_id“:”YOUR NLC instance classifier id”,
test_csv_file“: “YOUR Test csv file path”,
results_csv_file“: “YOUR output results csv file”,
confmatrix_csv_file“: “YOUR confusion matrix csv file”
}

url, user and password parameters: you get these parameters when you provision a Natural Language Classifier (NLC) service on Bluemix.
nlc_id: you get nlc_id parameter when you train an NLC classifier by uploading the training data.
test_csv_file: you specify this csv file which includes test dataset for evaluation. Note this csv file should include a header row with the columns “text” (this column includes the text utterances) and “class” (this column includes the correct label for each utterance).
results_csv_file and confmatrix_csv_file: for these parameters, specify file paths to write the results to. Once you run through the notebook, the results_csv_file captures the text utterance (from test_csv_file), the correct label (as specific in test_csv_file), the predicted label by NLC, and the confidence of that predicted label as reported by NLC. The confmatrix_csv_file writes out the confusion matrix to a csv file.

Conversation example json file
{
url“: “https://gateway.watsonplatform.net/conversation/api”,
user“:”YOUR CONVERSATION instance username”,
password“: “YOUR CONVERSATION instance password”,
workspace_id“:”YOUR WORKSPACE id”,
test_csv_file“: “YOUR Test csv file path”,
results_csv_file“: “YOUR output results csv file”,
confmatrix_csv_file“: “YOUR confusion matrix csv file”
}

url, user and password parameters: you get those when you provision a Watson Conversation service on Bluemix.
workspace_id: this is the id for your Watson Conversation workspace which consists of defined intents, entities and dialog flow. You upload your training data for intents to this workspace.
test_csv_file: you specify this csv file which includes test dataset for evaluation. Note this csv file should include a header row with the columns “text” (this column includes the text utterances) and “class” (this column includes the correct label for each utterance).
results_csv_file and confmatrix_csv_file: for these parameters, specify file paths to write the results to. Once you run through the notebook, the results_csv_file captures the text utterance (from test_csv_file), the correct label (as specific in test_csv_file), the predicted label by Conversation, and the confidence of that predicted label as reported by Conversation. The confmatrix_csv_file writes out the confusion matrix to a csv file.

Visual Recognition example json file
{
url“: “https://gateway-a.watsonplatform.net/visual-recognition/api”,
apikey“:”YOUR apikey for visual recognition”,
vr_id“:”YOUR Visual Recognition custom classifier id”,
test_csv_file“: “YOUR test csv file path “,
results_csv_file“: “YOUR output results csv file”,
confmatrix_csv_file“: “YOUR output confusion matrix csv file”
}

url and apikey parameters: you get those when you provision a Watson Visual Recognition service on Bluemix.
vr_id: this is the id for your Watson Visual Recognition custom classifier which you’ve trained by uploading your training data.
test_csv_file: you specify this csv file which includes test dataset for evaluation. Note this csv file should include a header row with the columns “image” (this column includes the full path to the image to be classified) and “class” (this column includes the correct label for each image).
results_csv_file and confmatrix_csv_file: for these parameters, specify file paths to write the results to. Once you run through the notebook, the results_csv_file captures the image (file path from test_csv_file), the correct label (as specific in test_csv_file), the predicted label by Visual Recognition custom classifier, and the confidence of that predicted label as reported by Visual Recognition. The confmatrix_csv_file writes out the confusion matrix to a csv file.

Interpreting Results

After you’ve run the notebooks on your data, here are some tips to interpret the results and plan how to improve the performance of your trained custom models.

First, some definitions of important metrics for judging the performance of machine learning models:

Accuracy: Accuracy measures how many of the predicted labels were correct.

Precision: Of all the samples predicted to be of a certain class, how many are actually labeled with that class?

Recall: Of all the samples that are actually labeled with a certain class, how many did the trained model correctly predict to be of that class?

Confusion matrix: The confusion matrix represents, in matrix structure, the distribution of the predicted vs. actual labels. The confusion matrix consists of N rows and N columns where N is the number of classes (or intents). Entries on the diagonal indicate correct predictions where the trained model predicted the same label as specified in the groundtruth. Entries off the diagonal indicate incorrect predictions.

Reviewing the confusion matrix carefully highlights which classes (or intents) may be most confusing for the trained model to classify correctly. For example, assume we train the custom model to classify among 5 classes (class A, class B, class C, class D, and class E) and that there are 100 samples of each class (as labeled in test_csv_file). If the confusion matrix results are as shown in Table 1, then you can see that Class A and Class C have significant overlap. This prompts a more careful review of the training and test data for classes A and C. To improve the performance of your model, you can either merge class A and class C into one class AC or you can provide better training data to help the model distinguish between the two classes.

Table 1: Example Confusion Matrix

 

In addition to computing accuracy, precision, recall, f1-score and the confusion matrix, the WDC notebooks also report the confidence of predicted classes. This is important to review as your runtime application will decide what action to take next based on both, the predicted label and the confidence of that prediction. We’ve typically found that when the trained model predicts an incorrect label, it usually does so with a significantly lower confidence than when it predicts a correct label.

Note that it is important to validate that the groundtruth is correct. When reviewing the results, it is important to verify that the labels specified in the groundtruth (in test_csv_file) are indeed correct. As you review the performance results, you may come across example utterances (or images) which were assigned the wrong label in the groundtruth. In that case, need to fix those and re-run your tests to update the performance metrics.

Once you’ve computed the accuracy, precision, and recall metrics, next you can focus on improving the performance of the system by following these guidelines:

  • To improve precision of an intent (or image) classification, review the training data to make sure there is consistent mapping of utterances to that intent.
  • To improve recall of an intent (or image) classification, add more training utterances that map to that intent.

Conclusion

As more applications leverage cognitive services to solve a wide range of business problems, including custom domains and enterprise-applications, there is an increased need to train custom machine learning models. The Watson Developer Cloud (WDC) services offer a rich set of customization capabilities. In this blog, we’ve outlined how to leverage the WDC Jupyter notebooks to evaluate performance of custom models built by NLC, Watson Conversation and Visual Recognition Watson services.

 

More AI and machine learning expertise

Join The Discussion

Your email address will not be published. Required fields are marked *