Improve your natural language query results from Watson Discovery
Explore the Watson Discovery features that can help tune and improve relevance.
IBM Watson Discovery uses multiple artificial intelligence (AI) techniques to provide great out-of-the-box results for natural language queries. However, sometimes you might find that the results you’re getting aren’t as relevant as you’d like them to be and you need to improve them. Watson Discovery has a number of features that can help you tune and improve relevance, and here we present some ideas for how they can be used.
How does Watson Discovery optimize relevance
One thing to keep in mind when optimizing Watson Discovery is that it is built for “long-tail” use cases (as explained here), that is, use cases where you have many varied questions and results that you can’t easily anticipate and optimize for.
Using the following methods helps Watson Discovery perform better across future, unseen queries, rather than optimize a specific result for a specific query. In cases where you have a few, frequently asked, very specific results, this might be best suited for training intents in Watson Assistant to recognize these important, foreseeable questions (the “big head” of the information needs distribution) and have well-defined responses. Often, real-world use cases involve a combination of foreseeable “big head” information needs and unforeseeable “long-tail” information needs. Those use cases can be served best by a combination of Watson Assistant and Watson Discovery (that is, using the Watson Assistant search skill or other common integrations).
For almost any type of improvement, it’s important to have a “test set” to validate that any changes made had the intended effect. This “test set” should be a set of queries that you are not using for training or adjustments, so you can effectively measure the impact of changes. The queries should also be representative of what you expect your users to ask. Ideally, these queries should come from usage logs of an existing or pilot application.
Real-world challenges and potential solutions
The document results I’m getting don’t have anything to do with my question
First things first, make sure that you have content in Discovery that actually can answer the question. If you know the documents that it should be finding, run a more specific keyword query to try to find it. If the content that answers your query isn’t in your collection, you can add it either manually or by connecting to a data source.
If the document does appear in your collection but not in the query result, it’s possible that the document doesn’t contain any terms that match the question. For example, “how do I log in to my account” won’t be able to match a document that says something like, “use your user name and password to see your profile” because there are no overlapping terms. There are a couple of approaches to dealing with this problem:
Use Query Expansions. Define a set of synonyms that can help match the correct terms. For example, you might make account/profile synonyms or car/automobile synonyms, and that way you’ll find more matches.
Note that expansions can often have unintended side effects. For example, adding a synonym like car/auto might make sense for a specific question such as “how do I get auto insurance?” However, it might actually hurt accuracy for a question like, “how do I set up auto payment for my bill.” This is where having the separate test set can really help to determine whether the synonyms provided have a globally positive effect.
Add metadata tags to your documents. This is a way to augment the documents themselves to include terms that overlap with what words the users actually provide in their queries. This is usually done by updating a document and adding a new field to the JSON with the tagged terms.
Note as with expansions, this can have unintended effects if applied selectively. If tags are only added to documents to match specific queries, you might not get a more general benefit and might actually throw off the training. Ideally, you want an approach to tagging that is applied across all documents independent of the queries.
Another cause might be that Watson Discovery is returning results that you don’t want because of specific parts of your document, such as footers or table of contents. An approach to deal with this is to use Smart Document Understanding to identify this content and not index it. Smart Document Understanding can be an effective tool to understanding the visual structure of documents and creating clean documents for natural language queries.
Watson Discovery has a data metrics dashboard that uses feedback to make it easier to spot opportunities to identify queries that didn’t return any results and take any of the options discussed previously.
The relevant documents I’m expecting are not near the top of the results returned
There might be a couple of causes for this result:
The corpus of documents that you have is relatively small (<1000 docs) so Discovery doesn’t have enough information to effectively separate what terms might be important in the corpus. Watson Discovery can work with corpora this small, but it’s generally less effective because it has less information about the relative frequency of terms in your domain.
The important terms in the user query are being outweighed by other matches in the documents
Here are a few options to address these issues:
Update the Stopword list. Stopwords are very common terms that don’t contribute much to the actual relevance of a query, for example, “how do I log in to my account.” How, I, and my are low importance terms. With enough data, Watson Discovery can distinguish these as very common automatically, and some of these words are excluded by default. However, if you have a smaller number of documents, you might want to explicitly tell Watson Discovery to not consider all terms. To do this, you can provide a custom stopwords list. An extended list of English stopwords is available through NLTK and available for download. Such lists can provide a starting point, and you can augment it with words specific to your use case. For example, if all of your documents are about a single product, then the name of that product is not useful for distinguishing among documents and is a good candidate for a stopword.
Use Relevancy Training. Relevancy training can help learn how best to weigh matches between the query terms and document terms across fields. If given good representative training data, it can help weigh matches on things like multi-word terms that should be given more weight when matched together.
Pre-process your query by using Watson Assistant or Watson Natural Language Understanding. If the queries your users are running have specific entities of interest (for example, product names, companies, or brands) then you might want to consider pre-processing the query before sending it to Watson Discovery to identify these and provide them as structured data to Discovery.
For example, a query like, “how do I change the oil on my Ford,” you might want to ensure that only results for the car type “Ford” match. You can run this query through Watson Natural Language Understanding first and use the entities that are found as filters over the full text of the documents. Similarly, if you have a Watson Assistant application as a front end, you can use Watson Assistant entity detection to identify entities in the user query and send them as filters to Discovery.
Note that because filtering is narrowing down the set of results, these filters might exclude documents that are relevant based on the other terms in the query. This approach is most effective when you are confident that user queries need to contain a particular entity like a product name, and the relevant documents contain that information as well.
If you are using Watson Natural Language Understanding over the query, you can go a step further and filter the specific entity and entity type on the entity enrichments that are performed against the documents in Watson Discovery. The enrichments in Watson Discovery are performed by using Watson Natural Language Understanding as well so you can have a strong match to Watson Natural Language Understanding on the query, and when using Watson Knowledge Studio for domain customization of entities, this can be especially powerful.
I want the results to weigh certain fields of my docs more heavily (for example, title)
Consider using Relevancy Training. This is one of the significant advantages of a machine learning approach like Watson Discovery. It can use examples you’ve provided to determine the best way to weigh results so that you don’t have to do that tuning manually. Also, in some cases where it might seem like it’s important to emphasize a document title that might have negative implications for other queries, so letting the training handle this can be valuable. The important thing to focus on is providing representative training data to the service. Representative training data includes real user queries and the documents (or a subset) they are searching against.
One caveat here is that for training to take the title into account, it needs to be a top-level text field. That means that it needs to be at the same level as
document_idwhen you look at the JSON output of a document. This can be done by reingesting your content with a normalization step that copies a field like
extracted_metadata.titleto a new top-level field called
title. This happens by default when using Smart Document Understanding.
If your documents don’t have many useful top-level fields, you can use Smart Document Understanding to identify parts of your documents that could be extracted as new top-level fields that could then be used to train on or to split your documents.
Relevancy Training isn’t helping
An important thing to remember about Relevancy Training is that it is learning a general model about how terms in the query tend to match terms in relevant documents, it is not learning about the specific training questions or documents themselves. So the type of training data that it needs should be representative of what your users will ask rather than trying to make up questions. Often, assumptions about the types of questions that will be asked by users might be incorrect, which can cause the training to not be representative of real usage. When possible, try to use historical logs or other real user data to supply the training data.
Part of the rationale for using real user data is the fact that a set of realistic queries does not necessarily constitute a statistically representative sample of queries. Domain experts are often very good at crafting sample queries such that any one of them could be a real user query, but collectively, they do not share the same statistical characteristics as a real user query log would. If you are considering crafting your own sample queries to train Watson Discovery, ask yourself questions like “on average, how many words will there be in each query,” “what percentages of those words will be nouns, verbs, or adjectives,” “what percentage of those words will be abbreviations,” “what will the most frequently asked query be and how frequently will it be asked,” or “what will be the most frequent terms used in the query and how frequent will each of them be?” Because Watson Discovery’s relevancy training is learning statistical trends about term matches, it’s important to have data-driven answers to some of these questions so that you can create representative training sets that lead to more effective models.
Also, as previously noted, it’s important to assess the benefits of training in aggregate across a test set of queries versus against a few specific queries.
One reason that training might not be helping is that it does not find the relevant documents in an initial, untrained search. Relevancy training works in two passes. First, it runs an untrained search to find a set of 100 documents, then uses the trained model to rerank those 100 results, bringing the most relevant to the top. If the wanted answer doesn’t appear in the top 100, training has no way to bring it up even if it has a perfect model. In this case, review the issues and solutions from question (2) or (3) above to try to improve your matches, then work with training.
Note that if there are just a few cases where the results are not appearing in the top 100, it might be a better use of time to simply ignore those training examples and focus on collecting more data. Often, simply having more training examples provides more benefit than having fewer fully covered training examples. If there are specific questions that you want to make sure you get a particular answer for, consider incorporating Watson Assistant as mentioned in the general notes above.
Out of the box, Watson Discovery uses pre-trained, general models to improve the quality of the results. When training data is added, these models are replaced by your collection-specific training data. With a small amount of training data (for example, close to the minimum 50 queries), the out-of-the-box models might actually still perform better than with training. You might need to collect a larger amount of training data (a few hundred examples) to see a more meaningful boost from the relevancy training.
Discovery can help you get good training data with usage monitoring. It tracks the queries that are being run, and if you instrument your application, the clicks against those results. This way you can use real user queries as the basis for training. And as you roll out your application, the service can start to perform continuous relevancy training on its own using those user clicks.
There might not be enough signals in your training data for the Relevancy Training to have an effect. If you are training on single word queries with just a single top-level field, the training might not have enough factors to consider. In this case, try getting more representative queries or try introducing new top-level fields either through the enrichments in Watson Discovery or by using Smart Document Understanding.
I’m not getting relevant passages
Similar to the document search, passages first receive a set of candidate passages from each document based on matching terms and then rerank them to surface the most relevant passage. It can run into a few issues:
Documents are too large. If a document has a lot of text (for example, a 50-page PDF as a single document), passage retrieval will have too many candidate passages and might not be able to score the one that is most relevant.
Documents have special characters. If documents have a lot of special characters, they can be recognized as separators for sentences and passages can end up looking choppy and less relevant.
To address these issues, you can use some of the passage retrieval features:
Change the fields over which passage retrieval searches, and try to use the cleanest field in your data. For example, if you have an HTML and a text field, focus on the text field so that you can get better sentence boundaries and more relevant results.
Change the passage length. Adjusting the length can affect how the passages are scored, so if you are seeing lower-ranked passages that are close but not capturing enough content, try expanding the number of characters per passage.
Passage retrieval does not have training, but it can be associated with documents to take advantage of Relevancy Training. Each passage result indicates the document ID from which it came. In your application, you can choose to display results in document order and then choose the highest scoring passage for that
document_id. Note that in some cases, there might be documents that do not have a matching passage. In those cases, you can fall back to a highlight or skip that document in the result set.
Use Smart Document Understanding to define where to split your documents into smaller segments.
You can also go here to learn more about NLP.