Recently, IBM Watson Discovery Service introduced a new capability called Relevancy Training. Relevancy Training lets you teach Watson what results should be surfaced higher than others so that your users can get the right answer to their question faster. You can train your private search collections by using either a tooling-only approach, or by using the Discovery APIs. In this tutorial, I explain how to use the tooling to train your private search collection.
Relevancy training is a process that lets you take a query, look at the search results returned from that query, and tell Watson what the ordering should be. This way, you are training Watson by using example queries that are representative of the queries that your users enter, and with explicit ratings of the search results.
After Watson has enough information from you, it starts to learn about the patterns and structure of the search collection and the queries your users enter. Watson uses machine learning techniques to find specific signals in queries that can be applied against the corpus. It identifies what similarities exist between those learnings and new queries that are entered by the user. It can differentiate between “good” and “bad” documents by using these signals and patterns. Watson then reorders the search results based on the training it received.
Of course, Watson is only as good as its teacher, and so it is important to ensure that any training it receives is performed by someone who knows the data. The training questions should also be representative of what your users will enter. I recommend that you select the queries randomly from your records of actual user queries. Do not handpick examples that look like “good” queries to you. In doing so, you are likely to introduce a bias into your training data toward handling the queries that you would like users to ask and not the queries that users actually do ask.
So, how do you approach getting these queries?
If you are replacing an existing search system, then there is a good chance that you have logs of queries that actual users are asking the system. Source your queries from these logs to seed the relevancy training for your Watson Discovery-based solution. Enter the queries from your logs, view the results from Watson, and then tell Watson which results are good and which are bad.
If you are starting with a new implementation, I advise you to initially deploy the system without Relevancy Training, but ensure that you are logging queries. Then, use the queries that you have logged to train Watson using Relevancy Training.
As an example, you might be looking for information within content-rich, detailed publications like these from the Public Library of Science.
To search these documents, I’ll upload them to a Watson Discovery Service instance. After I upload the documents, I’ll search against them. Then, I’ll start the Relevancy Training process by uploading queries. For each query, I’ll review the results that are returned and rate them as being Relevant or Not Relevant. After I’ve satisfied the learning requirements for Watson, I’ll let Watson learn from the information that I provided. Finally, I’ll try some of those searches again with a newly trained Watson.
This tutorial assumes that you have some familiarity with IBM Cloud and Watson Discovery Service. You’ll need an IBM Cloud account to begin. If you don’t have an IBM Cloud account, you can request a free trial here. If you already have an IBM Cloud account and Discovery instance, you can skip to Step 5. If you already have a Collection, you can skip to step 9.
- Log in to your IBM Cloud account.
- Click Catalog.
- Click Discovery under the Watson services.
- Click Create to create a Discovery instance.
- From the Discovery instance details, launch the Discovery tool.
- Click Create a data collection to create a new data collection within your instance. If you previously created an environment, you should see the following window for naming the collection. If not, you’ll see a prompt to create one. You can also find more information on environments in the Discovery docs.
- Name the collection. You can continue to use the default configuration, which provides all the necessary settings for Relevancy Training.
- After the collection is created, the collections details page opens. On this page, you can upload the documents that you want to perform relevancy training on. The Discovery Service supports documents in PDF, Microsoft Word, HTML, or JSON formats. You can easily drag documents from your local file system to the Discovery Service.
- After you enter a few files, you should see the documents available count updated. Now you’re ready to query.
- Select Query this collection. This opens the data insights page, which provides an overview of the data in your collection based on the natural language enrichments that are applied to the content. To see your search results, select Build your own query.
- In the query builder interface, select Use natural language by default on the query and filter inputs.
- Select Run query to get results that contain passages as well as document results. The summary shows the top passages and the top results for the query. The JSON response contains a separate top-level portion of the response that contains the passages that were retrieved, plus the document results.
- The results are ordered based on their relevance to the user’s search query. Switch to the JSON view to see the document results in their raw form. You can collapse the passages section to get a better view of the results section (which are the documents returned in the search results).
- If the results are not great for your query or other queries you are testing with, click Train Watson to improve results. Watson needs you to tell it what documents are the best results for your queries. After you train Watson with enough queries (a good representation of your user’s queries) and associated answers, Watson starts reordering your results based on that learning. Adding queries provides representative queries for Watson. Rating the results lets Watson understand what makes good results based on the query. Adding more variety to your results helps Watson to differentiate between good results and great results.
- Click Add a natural language query.
- Type a natural language query into the box and click Add. Keep in mind that it needs to be representative of what users will enter.
- Click Rate more results for the query that you entered. Now you can view the search results returned for that query. You can see the document title and text passages, and you can also view the full text of the document by clicking View document. If you don’t see the best document in the first page of results, click the next page by using the page navigation at the bottom of the screen.
- Click Relevant if a document is relevant to your query. Click Not relevant if the document is not relevant to your query. After you’ve rated some documents that are relevant and some that are not, click Back to queries to return to the list of queries. You’ll notice that some aspects of the screen have changed after you rated the results. First, for each query you rated, you’ll see that Watson updated how many documents you rated as relevant and those that were marked as not relevant.
- Repeat the process to add and rate more queries. If you need to delete a query, click the trash can on the right side of the query. As you rate more queries, you will also notice that the training requirements at the top also change. With each requirement met, Watson crosses it off to let you know it has enough information. You’ll need about 50 unique queries to satisfy the requirement for training. When Watson has crossed off all the requirements, it begins the preparations to start learning. After Watson is ready, it starts the learning process. This process can take some time (typically no more than 30 minutes depending on how much data is present). Viola! Watson has learned by using the training information that you provided. You should continue to add representative queries and ratings over time to help Watson learn. Also, you should retrain Watson whenever you add or delete documents that could change the ratings of the queries you already rated. You should also revisit training whenever the types of documents or queries change dramatically. Now you should retry some of the queries you previously entered to see the adjustments that Watson makes.
- Click the magnifying glass on the left side of the screen, and select the collection that you just trained.
- Select Build your own query to go to the query page.
- Enter a natural language query and view the results. They should be improved based on the feedback that you used to teach Watson.
This tutorial demonstrated how you can use Relevancy Training in the Watson Discovery Service to teach Watson to make better judgments when ordering search results. This in turn gives your users the answers to their questions faster. Now that you know how to use Relevancy Training you can begin applying this technique to other business applications.
Use Relevancy Training in any application that is providing documents as search results. You can use this capability in product support cases to help agents find answers to customer questions quickly, research scenarios to scan the latest publications, training applications to help knowledge workers get up to speed, enterprise applications to surface the most relevant answers to FAQs, and many other potential use cases.
See the documentation for details on how to try out Relevancy Training in IBM Watson Discovery Service.
For more information, you can watch a webinar on Relevancy Training or see what other webinars are offered in the Building with Watson webinar series. There is also a good blog on How to get the most out of Relevancy Training.