How DXC uses dictionaries to customize natural language processing solutions faster
Learn how one company created a chatbot to read a client's data in a specific domain, leading to more relevant answers for the users of…
At DXC, a global IT and consulting firm, we are always looking to leverage the best possible tools to serve our clients. On a recent project, our team was tasked with building an AI-powered chatbot in the insurance domain for a client, relying on government legislation content for answers. To do this, we turned to Watson Assistant, Watson Discovery, and Watson Knowledge Studio.
Users of this system need to be able to ask the Watson Assistant chatbot a question and be given a relevant, accurate answer. While we have built in answers to many common questions and intents, the real challenge for the system is surfacing a relevant answer when an unfamiliar topic comes up. This is where Watson Discovery, an AI-powered insight engine, can search for content in the client’s data that might answer the question.
We hit a roadblock with this step, as Watson Discovery wasn’t always pulling the correct information, largely because the text was so specialized to the insurance industry. Enter Watson Knowledge Studio, which lets users create custom natural language processing models, thereby letting Watson Discovery search data in the context of insurance. Users teach by example, annotating sample documents with the types of entities and relationships that they want Watson to recognize on its own. We pulled in subject matter experts from our client to determine what in the text was most important.
To speed up the annotation process, we used a feature of Watson Knowledge Studio called pre-annotation. Users can use dictionaries, Watson Natural Language Understanding, or a previously built machine learning model to automatically annotate new training documents, and save massive amounts of time on annotation.
A dictionary is a list of words that always fall under a single entity type. For example, one of our entity types was occupation, so we wanted words like doctor and teacher to always be recognized as this type. Watson Knowledge Studio lets users upload a CSV file of these terms, with three columns:
- Lemma, or the main word
- Poscode, a numerical value indicating the part of speech (POS)
- Surface forms of the word, or synonyms
Our team built 30 dictionaries, one for each of the 30 entity types in our custom model. Each of these dictionaries contains between 10 and 150 examples of that entity type. We then ran a pre-annotator to label all of these terms as the correct entity type in our training documents. This process was infinitely easier and faster than having humans annotate each of these mentions. We then manually added further annotations, especially for words with multiple meanings, like “pay.”
We trained the model based on these annotations and were thrilled with how well the model performed. Watson Knowledge Studio provides built-in statistics for model evaluation, along with recommendations on how to improve. We achieved a 0.96 precision score (the fraction of the machine learning model’s annotations that were correct) and 0.94 recall score (the fraction of correct annotations that the machine learning model recognizes). This was after just four weeks, with two people working on the model.
If your team is planning to leverage pre-annotation, we would recommend the following best practices:
While collecting all of the examples for each entity, centralize all of the data into a single Excel spreadsheet. Features like macros and duplicate highlight can greatly minimize effort. We created macros to automatically separate and sort the final cut of data into the CSV files required for Watson Knowledge Studio ingestion.
Find people well-versed in the relevant domain who can help you maximize the accuracy of the results for the following reasons:
- They are best-placed to provide you with content with the highest entity ‘mention’ density.
- They understand which entities are most relevant to the domain.
- They highlight entity examples that require manual annotation.
Ultimately, the Watson Knowledge Studio pre-annotation abilities were key to customizing our AI solution. With a domain-specific lens provided by our custom model, Watson Discovery is able to read the client’s data in the context of insurance, leading to more relevant answers for the end users of our solution.