IBM Watson Explorer has transformed business processes across industries where semi-structured data consisting of structural metadata and textual description is of great value. For example, Voice of Customer (VOC) sent to a call center, public infrastructure incident reports, insurance payment claims, and many more.  Watson Explorer helps users to find insights in these data sources by both advanced search for documents and statistical analysis over a distribution of words in the data of the users’ interest.

Watson Explorer Deep Analytics Edition focuses on enhancing these capabilities through machine learning techniques. Two such features were added to this version, released in February 2018: Similarity Search and Document Classification. This article introduces the Similarity Search function. Please read this blog for a description of Document Classification.

The Similarity Search function improves the ranking of a search result, that is, the order of documents in the search result, so that a search user can experience better search relevancy. Search relevancy is an important factor in the success of many business processes since it eases the workload of finding documents similar to a given document as a query. In this blog, I will explain a use case of similarity search and how Watson Explorer makes use of machine learning to address the issue of search relevancy. I will also explain how Ranker, the Similarity Search implementation in Watson Explorer, is designed to mitigate the workload of a user to prepare training data for machine learning.

How it works in an Insurance Underwriting Process

A key component of insurance underwriting is the evaluation of the risk of potential subscribers and determining the appropriate level of coverage and payment. Let us assume you want to subscribe to a health insurance product. First you will need to submit documents about your current health status and medical history to the insurance company. The insurance company will then review the documents to determine your insurance eligibility, and the coverage of the insurance. How do they make that decision? Usually each insurance company has its own underwriting guidelines for human investigators. An investigator reads the submitted documents and checks the facts as per the insurance company’s guidelines to determine your insurance eligibility.

Figure 1. Investigator needs to find relevant references.
Figure 1. Investigator needs to find relevant references.

A critical issue here is that the investigators may need to look for relevant references in the large number of documents processed in the past to make a decision consistent with previously accepted insurance policies (Figure 1). It is easy to imagine that this search task could be daunting and increase the workload of the investigators because they have to come up with good search criteria to find relevant documents. Remember that, since the document may include your health status and medical history, the number of metadata fields and/or keywords (let us call these “words”) in the document may be more than hundreds. Which of these candidates for search criteria matters for retrieving documents that are most likely to be a good reference? Note that the ideal search criteria may be a combination of multiple words and the investigators do not have the time to try all the words.

The Ranker function of Watson Explorer addresses this issue by performing a search for documents similar to a given document and returning a ranking of the retrieved documents. In this ranking, the documents are sorted in the descending order of relevancy to the given document. In short, Ranker can process a document as a query to find similar documents. For example, the insurance underwriting investigators can select a single document of their interest as query in the Ranker without wasting time to consider search criteria (Figure 2). This translates into an obvious benefit from business process optimization perspective. It reduces the workload of the investigators and shortens the time to process each document as a result.

Figure 2. Ranker searches for references using a given document.
Figure 2. Ranker searches for references using a given document.

How Machine Learning works in Watson Explorer’s Ranker

Watson Explorer uses machine learning to implement Ranker. You can grasp the essence of Ranker with some understanding of basic machine learning.  I will try to explain it without depending on mathematics or algorithmic details as much as possible. If you are familiar with machine learning, you can easily translate my explanations in this section into algorithmic details, although there may be many choices of concrete algorithms.

First, let us understand what Ranker does: it receives a document as an input (query) and returns a ranking of documents as the result of a search. A document consists of metadata and its textual segment. Metadata is typically a set of structured fields such as date, disease name, diagnosis type, and so on. A textual segment contains sentences written in natural language.

A ranking is an ordered list of documents. The order should reflect the information need represented by the query. If a ranking includes 100 documents, then the first ranked document in the ranking should be the most relevant to the query in the 100 documents, and the second ranked document should be the second relevant to the query, and so on. See Figure 3 for an example of a document and a ranking.

Figure 3. Example of Documents and Ranking.
Figure 3. Example of Documents and Ranking.

The biggest question here is how to determine the ranking when a query is made. To understand what is essential, let us consider very simple and abstract examples of query and document. We assume that each piece of information in a document is denoted by an uppercase character, A, B, C, etc., whatever it is metadata or a word in the text, and we call it a term. For example, “A” may be “Date:2018/05/06” in metadata, “B” may be “operation” as a keyword in the text, and so on. Then a query is represented as a set of terms such as Q={A, B, C, F}. Each document to retrieve is also represented as a set of terms.

Now, consider two documents D1={A, B, E, F} and D2 ={C, E, G, H, K}. When we think of the ranking of documents for the query Q={A, B, C, F}, which one of D1 or D2 should be ranked higher than the other? This seems obvious: Q and D1 have three terms in common, that is, A, B, and F, while Q and D2 have only one term in common, C. Except for a subtle problem of document size, there is no objection that D1 is better than D2 in terms of relevancy to Q.

In the above discussion, we implicitly assume that each term has the same significance for calculating a ranking.

Whatever the exact definition of “significance” is, when all the terms have equal significance in a ranking, it would be natural to regard the number of terms common in both a query Q and a document D as the relevancy of D to Q. In this example, we can say that rel(D1, Q)=3 and rel(D2, Q)=1 where rel(D, Q) means the relevance score of D to Q. Then the ranking is naturally defined such that D1 is ranked higher than D2 when rel(D1, Q) > rel(D2, Q).

However, it is also very likely that each of the terms has different significance in general. For example, a term that frequently appears in a text like “have” and “when” is probably less significant than a term that represents a topic in the text like “diabetes” and “pneumonia” when we consider medical documents. Note that the significance of a term depends on the context. When we consider a collection of documents about diabetes, the term “diabetes” would not be significant at all because it has no ability to distinguish documents in the collection. Actually, standard and traditional search algorithms can determine the highest ranked documents by estimating the significance of each term that commonly appears in both query and each document, using statistics over the collection of documents and other information, without machine learning.

Thank you for waiting, now we are ready to use machine learning for the ranking problem. In a nutshell, the machine learning used in Ranker estimates the significance of each term that commonly appears in both the query and each document by using training data, and determines the relevance score rel(D, Q) for each document D to define a ranking. Let us look at this with our usual example:

Whatever the significance of a term T to the query Q is, we consider it as a numerical value, and it is denoted by sig(T). Then it would be natural to say that the relevancy of D to Q is the sum of the significance values of terms:


What machine learning in Ranker learns is sig(T) for each possible term T to give a better ranking. Again, when all of the terms have the same significance (sig(A) = sig(B) = sig(C) = sig(F)), then rel(D1, Q) is three times larger than rel(D2, Q), and we can say that D1 is more relevant to Q than D2. However, the term C may be highly related to Q, resulting in rel(D1, Q) < rel(D2, Q)[1]. Machine learning learns this from training data. This means that the goodness of a ranking is determined by the training data.

[1] Exactly speaking, we assume that the significance score of each term is always positive here.

Now it’s time to look into training data. Training data is a collection of triples of (D, Q, r) where r is either “true” when a document D is considered as relevant to a query Q, or “false” otherwise. A triple is usually called as a training example, or simply an example. When r is true/false then the example is called a positive/negative example respectively. Also, in training data, a single query Q may be associated with multiple documents, and vice versa.

In our usual example of query and document, a machine learning algorithm in Ranker will give a larger value to sig(C) if it finds a positive example (D1, Q, true) in the training data, or more generally, it finds (D, Q, true) in which both D and Q have the term C in common. On the other hand, sig(A) may be very small if training data does not have any positive example for A. It would be intuitive to generate the expected rankings by preparing appropriate training data.

Of course, it is not so straightforward to determine the value of significance for each term, because the term may appear in both positive and negative examples, and there is typically a large number of terms to consider. To cope with this, machine learning in Ranker reduces this problem into that of optimization such that, roughly speaking, the majority of positive examples in training data has terms of high significance. Please read here for more technical details. You will find that the discussion in this article is an introduction to “learning-to-rank,” a family of learning algorithms for rankings.

How to train Ranker

Watson Explorer supports users who want to make document rankings more sophisticated in accordance to the past search results or some criteria of when two documents should be regarded as similar. Please read this article for in-depth technical information or to try it yourself. Now, let us see how to use Watson Explorer for training an instance of Ranker.

As we have already learned, training data consists of a collection of training examples. However, preparing such training examples is time consuming. Instead of using information on “This document is relevant to this query (document),” a user can also tell Ranker an attribute in a document which indicates similarity among documents in a sense that if two documents share the same term for that attribute, then they should be considered as relevant. Since Ranker takes a document as a query and evaluates relevancy of the query document to each document, it can learn the appropriate significance for each term from this information.

Figure 4 shows some documents in a dataset for Watson Explorer. There are 5 documents shown as rows in the table. Each document is a fictional record of a patient, including metadata (PATIENT_ID, AGE_RANGE, SEX, EXAMINATION, DIAGNOSIS) and text (MAIN_COMPLAINT). Let us assume that a user wants to train an instance of Ranker so that it can suggest patient records similar to a new patient. Just a moment, is this an easy task just to look at the terms of diagnosis in the query and documents? The answer is “no” because newly incoming records may not be associated with any diagnosis yet and hence a query does not contain a value of diagnosis. Ranker can learn the significance of each term based on the terms of diagnosis in training data, and it will return documents relevant to a query even if the query does not include a term of diagnosis.

Figure 4. Example of a dataset.
Figure 4. Example of a dataset.

Figure 5 shows the configuration in which a user can specify an attribute to be used in training. The user can select Answer Field from the list of available attribute names in the dataset. In this case the user should select “Attribute Type” as the Answer Field Type to use the diagnosis terms as mentioned above. Of course, training data including direct information in the form “This document is relevant to that document” is also available by selecting “ID Type” as the Answer Field Type. Then the corresponding answer field in training data should include the ID of the documents similar to each document.

Figure 5. Answer Field and Type in Training Configuration.
Figure 5. Answer Field and Type in Training Configuration.

Watson Explorer provides intuitive operations for training Ranker. Also, it takes advantage of the powerful capability of natural language processing to extract terms from text.

Give Watson Explorer a try to experience the power of Ranker

5 comments on"Do you still rely on keyword search? Find similar documents easily with Watson Explorer’s machine learning powered Ranker"

  1. Rajneesh Kumar May 30, 2018

    can we retrieve the filename and specific page number ?

  2. Issei Yoshida June 03, 2018

    Hi Rajneesh, assuming that you have a file that contains multiple pages such as pdf or other office document formats, you can access the filename of each file if it is retrieved as a document in a search result when the file is crawled by the filesystem crawler. As for page numbers, you would need to write a custom converter to extract the page number from each page, depending on the file format. This is a general specification of document ingestion of Watson Explorer, regardless of whether or not using Ranker.

  3. Hi Issei,
    Thanks for the article. Could you please help me to understand how we can use the built Ranker using API’s ?

  4. Issei Yoshdia October 09, 2018

    Hi DSiddhesh, for usage of API please refer to the documentation API documentation is available only on each installation for now.

Join The Discussion

Your email address will not be published. Required fields are marked *