Time is running out! Submit now
by Janki Vora, Mathews Thomas, Tomyo Maeshiro, Ronald Rutkowski, Joshua Purcell, Juel Raju | Published January 23, 2017
Watson Conversation is now Watson Assistant. Although some illustrations in this tutorial may show the service as Watson Conversation, the steps and processes will still work.
In earlier tutorials in this series, we gave an overview of cognitive computing and provided examples of how cognitive computing is being used to create industry solutions. We also provided successful cognitive use cases to help you understand what can be accomplished by building a cognitive platform. In this third tutorial in the series, we look at design patterns to make cognitive data searchable and understandable.
The amount of data that is generated by both man and machine is increasing at an exponential rate. In recent years, technology has advanced to glean insights from large data sets by using big data analytics platforms. The IBM cognitive platform gives organizations the ability to understand and usein their decision making. It does this by using cognitive search patterns. Each form this dark data takes — text, video, trouble tickets, or social media data — requires a unique pattern. In this tutorial, we define the composition and purpose of cognitive search patterns as a whole and explore the process, tools, and technologies that are used to work with specific dark data types — specifically those in the Telecommunications and Media & Entertainment industries. (Refer to the previous tutorial for more information on our scenarios.)
Cognitive data is information that humans can process effortlessly. It’s easy for us to see a video and make sense of it by putting the audio and visual data points together. The IBM cognitive platform enables machines to make sense and learn from all varieties of unstructured dark data such as emails, manuals, 3D videos, and audio data. To do this, the platform uses a cognitive search pattern — a specific set of tools, technologies, and processes — to extract dark data from the source and make it searchable, and ultimately, usable.
Cognitive search patterns are as unique and numerous as the data types they leverage and what you want to do with them. Because of this, we limit the discussion in this tutorial to the four prevalent patterns we applied to the use cases discussed in our previous tutorial:
By applying the optimal cognitive search pattern to each of these data types, you can move beyond simple keyword search results to findings that have already been interpreted for you through machine learning. While we were motivated to apply our results to help support technicians tackle their most challenging networking problems (Watson for Network Operations), help wireless phone customers answer questions about their devices and wireless plans (Device Doctor), and provide personalized recommendations for media content (Personalized TV), the possibilities will continue to expand as the field matures and your own expertise grows with it.
In the Watson for Network Operations use case, we wanted to glean insights from various network device manuals, trouble tickets, and log data. In the Device Doctor use case, we used troubleshooting suggestions from community forums to enrich the existing knowledge base. In both cases, the resulting knowledge bases proved to be unstructured, large, and complex. However, in both cases we were able to apply the following cognitive search pattern to quickly pinpoint the most relevant sections of documents and forum posts, all in the context of the search question in natural language.
As shown, the natural language query is parsed and context-aware learning algorithms are applied, with the most relevant search results appearing at the top. This system is able to learn from its successes and failures and improves its results based on feedback. We applied this pattern in three different ways: context-based search, topic similarity, and ranked search.
Crawlers represent the first process in the pattern that is shown above. In some cases such as Watson for Network Operations, the knowledge base existed, meaning there was no need to configure a crawler. But in the Self-service agent use case, we needed to collect data from external web forums to build the knowledge corpus, and to do that we needed crawlers.
Crawlers are components within IBM Watson Explorer that manage the connection between the remote data source and the rest of Watson Explorer. Crawlers federate raw data sources into a single queue for further processing such as performing document conversion. Document conversion represents the next step in our search pattern and is discussed in the next section. Using crawlers can enable the following business benefits:
After verifying that we had the proper access and usage rights to those sources, we simply selected the data sources we wanted to use to create the crawler. (We will discuss crawling best practices later in this series.) Next, we created seeds for each of the data sources to add to our crawler. After that, we configured our crawler by using specific global settings and conditional statements.
By default, Watson Explorer supports over 40 different types of data sources. We had files and URLs as the primary sources for our projects. To further elaborate, we used traditional Windows file/folder hierarchies that contained various documents such as Word, PDF, Excel, or rich text type file formats by configuring file type seeds. Furthermore, our URL type seeds represented online public web forums in HTML formats. Each HTML page represented one thread on a forum, which was composed of one technical question and one or more answers in the form of user comments. We created a URL type seed for each individual web domain we chose to include. At this point, no further work was required other than simply adding seeds by selecting the corresponding seed type. In the following figure, notice that we have added two URL type seeds, one for each individual web domain to the “DeveloperWorks” corpus crawler. More settings were configured to allow Website2.com a maximum of 99 hops and Website1.com a maximum of 3 hops, with hops referring to the depth to which a crawler will click while on a webpage.
After we added seeds to our crawler, we could configure two panels of settings. The first panel of settings, the Global Settings, represented a wide variety of performance control measures that could be specified for our crawler. For example, a user could specify that a crawler should discard all duplicate webpages, download data at a maximum of 5Mbs, or to automatically recrawl every 7 days. In the following figure, notice that for crawling limits the maximum number of URLs was set to 25,000 webpages.
The second panel of settings we configured was the Conditional Settings panel, which provided more flexibility in terms of the crawler’s scope. Watson Explorer supports roughly 12 types of conditional statements, but for our use we only needed to create URL Filters. URL Filters let the user specify which HTML webpages to crawl based on the string patterns of the URL. In the following figure, the conditional rules state that all URLs must not contain the substrings /tagged/ or /users/ but must contain the substring /questions. Additionally, all URLs must include a substring with one or more consecutive digits such as 12345 as defined by the regular expression.
Shortly after completing the crawler configuration, a custom converter was created.
Watson Explorer makes it possible to convert unstructured data to semi-structured or even structured file formats by using converters. Converters are components within Watson Explorer that allow the manipulation of data across differing file types. Converters can be customized to execute based on traditional condition statements and are configurable by using various programming languages. Overall, converters are used after data sources have been crawled but before document indexing and exporting takes place. By default, Watson Explorer has over 50 different converters available with support for over 100 file types to be used out of the box. However, Watson Explorer lets businesses create their own custom converters as well, which enhances the types of conversions Watson Explorer can perform.
Creating custom converters allows you to:
To create a converter, we first selected the input file type and the wanted output file type. Next, we specified the conditions to which our converter should execute. Finally, based on the file types specified earlier, we selected the programming language to be used to specify custom instructions to be executed during the conversion process. The custom instructions include code that specifies which text snippets we wanted to extract from the original input, as well as which text snippets we wanted to discard. We even organized the extracted data by specifying an output file type.
In the Self-service agent example, we essentially created one custom XML file for each HTML file that was collected during the crawling process. The primary data source used was online public forums. Each HTML file represented one thread on a forum and was composed of one technical question with one or many answers in the form of user comments. Therefore, we selected HTML as the input file type for the converter.
Because our desired output file type was custom XML, we selected Watson Explorer XML (or VXML) as the converter output file type. The difference between traditional XML and VXML is that VXML is embedded with more functions specific to the Watson Explorer Product Suite. With both an input and output file type identified, we wrote conditional statements to control when the converter should execute. For conditional statements, Watson Explorer supports a wide variety of languages including wildcards, XPaths, and regular expressions in Perl. We configured our conditional statements to detect wildcard expressions within the various URLs we had been crawling. We also only executed our converter when a URL containing the string questions/ was detected. Every URL that did not match this rule was ignored.
The last step in creating the converter was writing instructions in XSL, which we chose because it supports the use of XPath statements to extract text from an HTML file and inject that text into an XML file. Usually, this last step represents most of the work that is required to create a converter because identifying patterns within HTML files and writing those patterns in the form of XSL/XPath statements is not always straightforward.
The converting process and code resulted in a converter capable of creating a high-quality corpus of XML files, with each file containing only relevant snippets of text. In the following example, the code instructs the converter to create an element within our VXML file called answer1_text. The XPath specifies to look for a div tag within the HTML file with the ID of answers. Within that answers div tag, skip to the second nested div tag. Within that second nested div tag, find another div tag that matches the class post-text. Finally, copy the value of the text and make it the value of the answer1_text element.
Regarding the crawling process that is described in the Crawlers section after both the crawler and the converter were completed, the Watson Explorer engine was given permission to start through the overview pane, effectively crawling and converting documents on a one-by-one basis and creating a corpus of XML documents from scratch. The following figure shows that 25,000 webpages were crawled — out of which 855 URLs were ignored and 24,145 XML documents were created.
Watson Explorer has file export capabilities in which the entire corpus of data can easily be exported for use with other products and services regardless of vendor or brand. For the Self-service agent example in particular, the corpus of new XML files was eventually exported to be used with the Watson Retrieve and Rank service.
In some of our projects, we also used IBM Document Conversion as a tool to mutate input data. Document Conversion is not a crawler, but it will receive single HTML code and split that file into smaller segments, called answer units. It looks into the header (h1, h2, and so on) tags, and for each header tag it tags and creates an answer unit. We used Document Conversion in the Watson for Network Operations example, ingesting some websites’ HTML code. It provides fairly good results and can be easily integrated with the Retrieve and Rank service. In addition, it can also ingest DOC and PDF files, which are converted into HTML and processed in the same manner. Having an instance of Document Conversion is required when you launch the new Retrieve and Rank UI.
Text tagging and annotation, also called entity extraction, forms an important component of language-processing tasks, including text mining, information extraction, and information retrieval. The annotated text is also used for making search contextual. For the Self-service agent, the documents were annotated with the device manufacturer, operating system, and model. Device troubleshooting issues were also annotated. This process let us define a more intelligent and contextual search. We discuss the details of the annotation process Annotation process in this tutorial. The following section discusses how we annotated trouble ticket data and used it for text mining and information extraction. In the next tutorial in the series, we will dive deeper into using more tools and technologies to perform annotation because this is an important step for improving accuracy and relevancy of search patterns.
Retrieve and Rank, as its name indicates, is a combination of two components — using Apache Solr for the “retrieve” and sophisticated machine learning algorithms to refine or “rank” the searching experience. The Retrieve and Rank service is offered on Bluemix as a cloud-based service, and a set of API endpoints are available in the service documentation. With these APIs, it is easy to integrate Retrieve and Rank into virtually any RESTful application.
Apache Solr is a widely known open source searching platform.Solr provides a spectrum of query parsers and search features, based on a combination of vector space and Boolean models to determine how relevant a document is to the query. The ranker is a component that works on top of the Solr search to rank the results that are determined by trained machine-learning components. The procedure for creating a ranker is defined and documented on the Watson Developer Cloud. Solr with a ranker provides a robust natural language search.
There are different ways to connect to a retrieve and rank engine on Bluemix:
The first two methods require four main steps to have an instance ready for ingesting documents:
In contrast, using the Retrieve and Rank UI requires fewer steps and can be done by using a friendly web interface. In our project, we experimented with all three alternatives. The first two procedures are more powerful because they let you create a custom configuration of Solr. The Retrieve and Rank UI method is simpler, but comes with a default Solr configuration that cannot be changed. The following figure shows a custom Solr configuration.
You can compare that to the default configuration.
After you create the configuration, the collection is easy to create. You can upload as many configurations and create as many collections as your cluster and resources in Bluemix allow. In the Watson Network for Operations example, we directly created the service and ingested the corpus using the API. For the Device doctor example, we used the Java SDK to automate this configuration/ingestion and open the possibility of connecting this to other data sources such as crawlers, databases, or other repositories. We had the rankers for both projects carefully elaborated and validated by subject matter experts.
The following figure shows an example of how the ground truth was created. The CSV file shows a question, the document’s ID on Retrieve and Rank, and the relevancy to the question. Note that the scale is set from 0 – 4, where 0 is irrelevant to the question and 4 is highly relevant. This scale is set up for a thumb up/thumb down feedback mechanism, with a binary range 0 – 1, where 0 is irrelevant and 1 is relevant.
When the corpus is ingested and the ranker is trained, we are ready to ask questions to retrieve and rank. The search query is sent to a Solr instance and retrieves a set of relevant documents. This set is calculated mathematically by computing the cosine similarity of the terms of the query with the documents ingested. Using this method, Solr retrieves the most similar documents to the query search. Then, the ranker sorts the response by using the trained machine-learning model to provide a refined set of answer units for the query. To enhance the search even more, we included relevant information to the query by appending information about the context. In the Device doctor scenario, we have Watson Assistant keep track of the context, such as device type, model, and OS version. Similarly, the Watson for Network Operations example gets its context from the alert that is generated on the external network monitoring system. This context is appended to the query and boosted to have more weight on the search than the other terms.
Multimedia is a huge source of data-encoded audio, images, and video. The content of these formats is becoming more popular as a source of information, but overlooked by many as a source too cumbersome to use effectively. The reason could be due to the inflexibility of searching inside these media types. Content is usually searchable only by its metadata such as title, tags, and duration. The audiovisual content itself is difficult to interpret for a traditional search engine. With that in mind, we used Watson services to extract features on the content itself by generating text from the audio track. Additionally, we extracted video frames as still images and ran diverse algorithms to annotate these images by recognizing the objects, people, and text inside those images.
The audio channel usually provides many elements to analyze: spoken language, music tracks, and general sounds. We used Watson Speech to Text to extract a time-stamped transcript of the audio. The service provides accurate text recognition, but is limited in that it doesn’t recognize different entities that are speaking. To achieve better results, we used videos with a single speaking entity. We indexed the text that is extracted in the search engine to find fragments of the video where these specific topics were mentioned. To enrich the content even further, we used that text to extract language features (by using Alchemy Language) such as concepts, entities, and relationships. By annotating the transcript with this extra information we can get richer search capabilities by not only indexing the content of the video, but also creating better relationships within our data with text that is not explicitly stated in the video.
We extracted one video frame every second using some custom tooling that we created (this extraction is configurable). We then used the Watson Visual Recognition service on the video frames to get another dimension of information, the annotation of the images. This service provides three main features of extraction: image tagging, face recognition, and text-on-image recognition. The image tagging, a trainable feature, generates information about the objects, places, and actions on the image. For example, it can recognize groups of people, objects like cars, or actions such as people playing sports. Although these services do not always provide completely accurate results, it is a great way to annotate video, a traditionally expensive, manual task. The face detection uses a machine-learning algorithm that detects the particular features on the human face, even detecting gender and estimating an age range.The text on images can be extracted, too. All of these features were also time stamped and correlated to the data extracted on the audio to enhance the quality of the data available.
In our use case for Watson for Network Operations, we applied various cognitive patterns. Different tools are available to perform these patterns. We discuss the tools we used either based on the availability of them to be on-premise versus off-premise (on cloud) or based on skills. In this section, we discuss how to glean insights and trends from unstructured trouble ticket data. For this search pattern, we used the Watson Explorer Content Analytics platform.
Trouble tickets represent a wealth of information, encompassing any service request, customer inquiry, problem management record, or any other form of documentation that is developed by a service desk, help desk, or support organization while addressing a customer’s concerns. Trouble tickets contain detailed information that relates to specific customers, dates and times, categories of problems experienced, and most importantly, resolutions to those problems. Closed trouble tickets tell a rich story of what problems customers faced, how they were overcome, and how the problems can be fixed or prevented. Organizations that archive, manage, and use trouble tickets have access to all these insights but often struggle to derive meaningful information from this popular form of dark data.
Most businesses actively log, collect, and store vast amounts of trouble ticket data but perform limited analysis despite the potential to glean deeper, more compelling insights. One method of deeper analysis includes using cognitive technologies such as natural language processing along with text analytics. Examples of potential benefits to businesses that use cognitive technologies with text analytics include:
Watson Explorer Advanced Edition represents one of the primary tools we used with the Network Operations Agent and Self-service agent scenarios. Throughout this tutorial, we reference Watson Explorer as we discuss our scenarios, tooling, and methodology. Within our solution development projects, we position Watson Explorer as the solution to crawling, converting, and annotating data at a large scale. For the annotation of trouble tickets within the Network Operations Agent example, we include Watson Explorer Content Analytics, a tool within the Watson Explorer Advanced Edition product suite.
From a technical standpoint, Watson Explorer Content Analytics follows the IBM UIMA architecture, which provides a framework for performing text analytics. Ultimately, the UIMA architecture suggests that three major processes are followed in the text analytics pipeline: crawl and import, parse and index, search and content analytics.
The first process, crawling and importing, was discussed previously in this tutorial. The second process, parse and index, represents the bulk of the work that is performed by a custom annotator. The third process, search and content analytics, refers to the management of the results. Here, we focus on the second process and creating a custom annotator.
Watson Explorer Content Analytics Studio is a development environment that is used for training Watson Explorer Content Analytics for text annotation that is performed during the second process of the UIMA pipeline. Essentially, we used Content Analytics Studio to create custom dictionaries and parsing rules that were then uploaded to the Watson Explorer Content Analytics server where all the trouble tickets within our corpus were annotated. Dictionaries refer to a collection of mentions associated with an entity. For example, the dictionary for entity “IP Quartet” might contain mentions such as “192” or “168.” Parsing rules refer to the particular sequence in which one or more entities might occur. For example, the “IP Address” parsing rule might require the mention of an “IP Quartet” to be followed by the character “.” and then be followed by another mention of an “IP Quartet” and so on. The text “192.168.1.1” would be identified as an IP address in a well-trained text annotator.
The result of training and annotation is the identification of facets. By default, Watson Explorer Content Analytics is able to identify traditional parts of speech such as nouns, verbs, and adjectives. However, Content Analytics Studio can create custom parts of speech such as problem topics or people. These parts of speech are what we refer to as a facet. The following figure shows an example of the Verb fact type.
An alternative tool recently made available by IBM is Watson Knowledge Studio. Instead of using a local development environment such as Watson Explorer Content Analytics, Knowledge Studio is cloud based and includes a streamlined user interface available through your web browser. Rather than using Watson Explorer Content Analytics for training an annotator by defining custom dictionaries, facets, and parsing rules on a one-by-one basis and then performing sample annotations, Knowledge Studio uses a single list of entities (which are similar to facets), relationships, and a color-coded UI to quickly allow creation of sample annotations in support of training your annotator. Both Knowledge Studio and Watson Explorer Content Analytics will always require a degree of manual human intervention to train an annotator, but Knowledge Studio wants to make the training process faster and less labor-intensive. In addition, Knowledge Studio is used to create a cognitive machine-learning model that processes documents and infers annotations based on your sample annotations, entities, and relationships.
Cognitive computing lets you use insights from nontraditional data sets, which have previously been considered too complex for machines to understand. In this section, we discuss by example the various data sets we used in our work. The first step is acquiring the data and curating it, with the various data staging and processing patterns described below.
In the personalized TV Agent scenario, we used various unconventional data like social data, data news feeds, and video data. We discussed video data above and will continue our detailed discussion on video annotation in the next tutorial.
Social media contains massive amounts of data that can be used to find great points of metadata. However, there are two barriers. The first barrier is acquiring useful data and the second is how to filter and enhance that data so that it can be useful to your use case.
For the personalized TV example, we focused on gathering social media data of anonymized customers as well as a group of people who were tweeting about certain topics to understand what their opinions were of certain media content. We then compared the data to see whether any patterns emerged to determine what else they might enjoy watching.
We decided that the best way to compare and gather the data that we wanted was to use a source where people were most likely to post their opinions, Twitter. Using the Bluemix DashDb database service and IBM Insights for Twitter, we were able to pull data based on our query. To try this yourself:
We were doing a search in the Media and Entertainment Industry, so our focus was on popular TV shows. One of the most popular shows at the time was The Walking Dead. After doing some research on Twitter, we found that some of the popular hashtags that are used when referring to The Walking Dead were #TWD and #TheWalkingDead. We also found that people used #Walkers. However, another popular show at the time, Game of Thrones, also called one of its antagonists the “White Walkers.” Something that could have given us large amounts of data would have also caused us a negative impact on our data. As such, we decided to leave #Walkers out of our query.
You can do a quick query count to see how much data you can pull and then select Next and choose the tables you would like your data loaded to. You will see all the tables and the data that will come from your query. After it’s complete, you can move to the next step, refining/filtering the data so that it can be enriched.
The AlchemyData News service is an API that is available with AlchemyAPI. AlchemyData has a large index of news and blog articles from the past 60 days that you can use to retrieve articles that have been enriched with Natural Language Processing. We used this API to pull concepts and entities and further enrich our search. This process gave us more search terms that we could use in our Twitter search. We used this service to understand the emerging topics.
In this tutorial, we provided a brief overview of various patterns that we applied to build our cognitive systems. As discussed, it is possible to use various on-premise and off-premise tools to achieve similar results. Watson APIs make cognitive application development very easy.In our next tutorial in the series, we will discuss patterns that we used for cognitive user experience.We will discuss how the cognitive user experience is defined by a machine’s ability to communicate and augment human understanding, and patterns that enable artificial intelligent systems to understand human behavior.
April 8, 2019
May 8, 2019
Get an overview of computer vision with deep learning and learn how it can help your applications recognize what an…
Artificial intelligenceDeep learning+
Back to top