Relevancy Training is a powerful capability in Watson Discovery Service that can improve search accuracy if the right approach is taken. However, if used or trained improperly then you won’t get the best results for your use cases, and will be less than what you may hope for. Follow the guidelines in this post to get the most out of relevancy training for your system.
Make sure you have the right use case
Start by making sure you are using Watson Discovery Service and Relevancy training for the right purpose. There are two primary search use cases in this space that you should be aware of: a small set of search topics for which there are lots of frequently asked questions (i.e. the short head) and a large set of search topics where each topic might only come up once or twice in questions (i.e. the long tail).
The short head is exemplified by queries that users will ask frequently (like “where is ____?” or “what is ____?”) and are good for conversation bots because trained intents can be used to match the query to pre-defined answers. Another option for dealing with these types of queries is to create spotlights. Spotlights are manually created documents that are surfaced to the top when a specific query has been entered (think Google Sponsored links).
Long tail queries are harder to anticipate, since the information needs are more varied. These are the typical kinds of questions that experts will ask because they know that they need to increase the complexity of their question to get a more precise answer. A good example of this type is the query “employee vacation schedule for Pittsburgh office”. The person has clearly indicated that they are looking for a particular representation of what vacation is available to employees, and they also know enough to specify that they want this information for the Pittsburgh office only.
You can see in the diagram depicted below a graph of the comparison between long tail and short head searches. On the left is a representation of short head queries, where there is a high volume of these kinds of queries. The graph shows long tail questions to the right, as the questions get more complex and more specific, the volume of these kinds of queries gets lower. You can also see that there are many queries that reside in this space. These tend to be the really hard questions to answer.
The Relevancy Training in the Watson Discovery Service is primarily used to solve long tail problems. This means that its best suited for questions where there are a lot of possible documents.
The fall back
You may find that you have both kinds of questions, short tail and long tail. This is actually a very common case, and you can solve this by using something like Watson Conversation for the short tail questions, and then when you have a question that cannot be answered (fits into the long tail category) then your system will fall back to Discovery for answering the long tail. The decision of when to fall back to Discovery could be made based on the confidence score that is produced by Watson Conversation. If the confidence score is below a certain threshold (meaning the system is not very confident), then this is a good indicator that query doesn’t fit into the short head and it is best to attempt long tail search strategies like Discovery. See our other post for guidance on how to choose optimal confidence thresholds.
Use the right approach to prepare your training data
There are two steps in preparing training data: (1) collecting sample queries and (2) rating “relevant” answers to these queries. Preparing this data does take some effort but the primary principle is that you need to mimic your end user’s interaction with Watson as much as possible.
In the first step where you want to collect a sample of queries that actual users will be submitting to Watson via your application. The best approach is to use real query logs. Perhaps you have a system you are replacing where you can source these query logs. Another option is to roll out your Watson Discovery Service solution without Relevancy training, and start to collect query logs that you can then use for relevancy training.
Avoid making up questions, even if these questions are created by SMEs. This can introduce an unnatural bias into the training and can result in less than desirable outcomes. SMEs making up queries for their users is even more dangerous. Creating questions from a document also introduces bias.
The second step is to identify which search results are “relevant” to the query. Whether or not a search result is “relevant” to a query is subjective and sometimes ambiguous to determine. So it’s generally convenient to use a simple rating scheme (0 = not relevant, 10 = relevant) and assign these results to the top documents returned in the search results before you have undertaken relevancy training.
It is ok if only one or two of the search results per query are marked with a rating of 10. However, if you found that no search results are relevant, then you may need to either:
• Look further down the list of search results to look for relevant documents
• Augment the documents in your collection with keywords and text so that they are more likely to show up somewhere in the search result for this query (even if it’s really far down the list)
Once you have uploaded these training examples, the system will actually check to see that there are a sufficient number of “relevant” documents appear in the initial search results for the sample queries. If there is an insufficient number, then the system won’t train and you will see that the requirements are not met in the collection training details. You can also view the notices API to see if there any problems with the examples you provided. You may have to drop some of these “no hit” queries from your training examples or work harder to augment the documents in your corpus.
When it comes to evaluating your system, make sure you aren’t using the same data for training as you are for evaluation. Remember, your goal with Relevancy Training is to prepare the system for the long tail where we cannot anticipate search topics ahead of time since they are varied; so we don’t want the system to cheat the evaluation by having previously seen all the queries during training (though our training algorithms do try to protect somewhat against this by preventing the model from simply memorizing examples).
The best way to ensure that you do not inadvertently bias your evaluation data is to simply take a random sample of your labelled queries and set them aside before you upload training data.
Feed and caring over time
Over time, you are going to add and remove documents from your collection. This should not have any direct impact on the performance of your system. Again, remember that Discovery’s goal is not to memorize your examples, but rather learn general trends from them – so simply adding/removing documents to the underlying collection shouldn’t dramatically change system behavior.
However, if the types of documents in your collection changes dramatically (e.g. if you were add a lot of summary articles or change how the documents are formatted/organized), then it’s probably a good idea to upload some new training data and thereby re-trigger system training. When doing so, keep in mind to re-visit the previously uploaded ratings to make sure recent changes to the collection have not invalidated which document ids are relevant versus not relevant.
Making these changes to your trained data is usually simple, so add it to your list of “house-keeping” tasks. In addition, if your application is able to track the end user experience through metrics like click-through-rate, you might want to monitor these statistics over time and use significant deviations as an indicator that it’s time for some training data house-keeping.
If you remove all training data the model remains (no re-training would occur as requirements are not met). If you then add new training data (that meet the requirements) the model will be retrained from scratch using the new training data.
Any changes to the training data going forward will kick off another retraining (checks are made about 1x per hour). If you delete some of the data, the model will be trained if the requirements for training data are still met. If you add more data the model will be retrained.
You should take an incremental approach to training. This means you should add some training data, evaluate, and then add more. Eventually you will begin to reach a plateau of accuracy improvements. This incremental approach allows you to optimize the amount of effort you invest in training.
For more information, you can watch a webinar on Relevancy Training or see what other webinars are offered in the Building with Watson webinar series. There is also a great tutorial on Using Relevancy training to train your private search collection.
Many people provided helpful input on drafts of this document. I would particularly like to thank Bill Murdock, Rishav Chakravarti, Anish Mathur, and Michael Keeling for their contributions.