Note: This pattern is part of a composite pattern. These are code patterns that can be stand-alone applications or might be a continuation of another code pattern. This composite pattern consists of:
- Extend Watson Text Classification
- Correlate documents from different sources
- Build a knowledge graph from documents (this pattern)
- Query a knowledge base for documents
In any business, Microsoft Word documents are commonly used. They contain information in the form of raw text, tables, and images. And all of the documents contain facts important to that business. This code pattern addresses the problem of extracting knowledge out of text and tables in domain-specific word documents. We build a knowledge graph on the knowledge extracted, which makes the knowledge queryable. This gives you the best of both worlds – training and a rules-based approach to extract knowledge out of documents.
One of the biggest challenges in the industry today is how to make machines understand data in documents just like humans understand the context and intent of the document by reading it. The first step towards this goal is to convert the unstructured information (free-floating text and tables text) to a semi-structured format and then process it further. That’s where graphs play a major role – giving shape and structure to the unstructured information present in the documents. This code pattern looks at the problem of extracting knowledge out of text and tables in domain-specific Word documents. A domain-specific knowledge graph is built on the knowledge extracted, and this makes the knowledge queryable. You can use this code pattern to to shape your analysis and use the data for further processing to get better insights.
The code pattern demonstrates a way to derive insights from a document containing raw text and information in tables using IBM Cloud, IBM Watson services, the Python package Mammoth, the Python NLTK, and IBM Watson Studio.
With this code pattern, you get:
- The ability to process the tables in .docx files along with the free-floating text
- A strategy for combining the results of a real-time analysis by Watson NLU along with the results from the rules defined by a subject matter expert or domain expert.
- The unstructured text data from the .docx files (HTML tables and free-floating text) that needs to be analyzed and correlated is extracted from the documents using custom Python code.
- The text is classified using NLU and tagged using the Extend Watson text classification code pattern.
- The text is correlated with other text using the Correlate documents code pattern.
- The results are filtered using custom Python code.
- The knowledge graph is constructed.
Find the detailed steps for this pattern in the README. Those steps will show you how to:
- Create IBM Cloud services.
- Run using a Jupyter Notebook in IBM Watson Studio.
- Analyze the results.