By Seth Bachman, IBM Watson Offering Management

We have all found ourselves lost and frustrated looking for an answer buried somewhere in vast amounts of information. It costs time, wears patience down to a frazzle, and is expensive for businesses. This is one area where IBM Watson Retrieve and Rank can help–an API announced in General Availability last week. IBM Watson Document Conversion, which is often used with Retrieve and Rank, was also released in experimental mode to help convert content in commonly used documents (e.g., PDF, Word) into formats that can be used by other Watson services.

retriveandrankRetrieve and Rank is designed to help developers build cloud-based cognitive apps that improve knowledge workers’ ability to find a needle in a haystack.   We consider it to be a core part of our platform and a service around which many future capabilities will be built.  This service applies a machine learning technique known as Learning to Rank to help users get better results from Apache Solr, a popular open source platform. It does this by finding “signals” in the data that are relevant to a user’s query.

For example, one early validation partner in a Contact Center used Retrieve and Rank to improve the ability of their agents to find highly relevant content to address incoming customer queries. Typical approaches to achieving this require painful and time-consuming manual tuning to get it right. Using Retrieve and Rank, and leveraging a combination of relevant information (for example, product name, short description, detailed problem description etc.) the partner was able to train a machine learning model to improve the information results generated from contact center agent queries. This resulted in a significant improvement in findability compared to conventional information retrieval techniques.

passage-prep-160As we continue to expand our portfolio of Watson Developer Cloud services, you’ll notice how they start to fit together nicely.  Document Conversion is a natural fit with Retrieve and Rank–and other services too. Document Conversion is designed to take documents (currently PDFs, Word, and HTML), break them into bite sized chunks, and convert them to an output format that can be used by another service.  In conjunction with Retrieve and Rank, Document Conversion is a perfect fit for an organization that must wade through a stack of manuals or other types of unstructured documents to find critical information. You could use the Document Conversion service to produce textual units of the appropriate size, and feed those as documents into Retrieve and Rank. Then, by using our enhanced relevancy techniques, users will be able to quickly find the information they need.

Our research scientists continue to innovate in these areas.  We are looking at ways to make the process of getting up and running easier–while striking the balance between making machine learning easy to apply while still giving sufficient control and flexibility to developers who want to tweak knobs. We are also investigating ways to provide developers additional Watson automated learning capabilities. This can include, for example, applying what can be learned from user clicks and other activity to continuously improve the models for a service in order to provide better results. 

Please let us know the interesting things you do with these services by participating in our Watson Developer Community Forum.


IBM is placing the power of Watson in the hands of developers and an ecosystem of partners, entrepreneurs, tech enthusiasts and students with a growing platform of Watson services (APIs) to create an entirely new class of apps and businesses that make cognitive computing systems the new computing standard.


3 comments on"Retrieve and Rank finds highly relevant content buried deep in complex information"

  1. I am trying to use Retrieve and Rank for a very similar use case (call centre answering incoming customers questions). I can load up Solr with a bunch of information for it to search. The next step involves creating a huge training CSV file where I have to basically come up with as many potential customer questions as possible, and then just tell the R&R service what the answer is. Creating that CSV file would be an absolutely massive job as I have loaded a significant amount of data (ie. the proverbial haystack) and trying to manually sift out the right answer would take me a long time. Now doing that for every question I can come up with makes the task seem incomprehensible. What am I gaining by using this service if I have to actually tell it the right answer? That doesn’t seem like ‘machine learning’. Am I missing something fundamental here?

    Next question, is there some sort of UI available for leverage that can aid in the creation of this CSV file?

    Thanks!

    • Emmanuel Vigno October 27, 2015

      The key point is that you train your model on a limited scope of questions. Not all the possible questions. Then, if in real life, the system gets a question that is not part of the training set, it will rely on the trained model to come up with an answer.

      To give you more concrete metrics, let’s say you have 1000 Q&A pairs. There are many more than 1 single way to ask one question. That’s why you will have to train your model on, let’s say, 5,000 questions & answers pairs. And in real life, the system will be able to answer 50k, 500k, 5000k questions based on the trained model.

      hope this helps.

    • Seth Bachman October 29, 2015

      Emmanuel is correct about the principle of training on a subset of your data using a “ground truth” made up of known relevant answers. Some guidance on how much training data is provided here: http://www.ibm.com/smarterplanet/us/en/ibmwatson/developercloud/doc/retrieve-rank/training_data.shtml. Note that these types of use cases are less about an exact answer and more about getting the best result from a broad set of possibilities (i.e., less about Q & A pairs per se, and more about defining documents most relevant to a particular query). This relevancy can be defined on a scale that makes sense based on your data, meaning it could be a range like 0 = not relevant, 4 = very good, 5 = perfect, etc., and the model will adjust accordingly. This process helps the machine learning side of the service find the “signal” coming from your data and rerank the results appropriately.

Join The Discussion

Your email address will not be published. Required fields are marked *