Correlate documents from different sources
Correlate content across documents by using the Python NLTK and IBM Watson Studio
Note: This pattern is part of a composite pattern. These are code patterns that can be stand-alone applications or might be a continuation of another code pattern. This composite pattern consists of:
Text analytics involves getting insights from text content in documents, books, social media, and various other sources. A common requirement is to find the correlation of text content across sources to get a comprehensive picture. This code pattern uses Watson Natural Language Understanding, Python Natural Language Processing Toolkit (NLTK), and IBM Watson Studio to build a graph of entities with attributes and use its relationship with other entities to correlate text content across various sources.
In this code pattern, you will use Jupyter notebooks in IBM Data Science experience (Watson Studio) to correlate text content across documents with the Python NLTK toolkit and IBM Watson Natural Language Understanding. The correlation algorithm is driven by an input configuration JSON that contains the rules and grammar for building the relations. You can modify the JSON configuration document to obtain better correlation results between text content across documents.
After completing this pattern, you’ll learn how to:
- Create and run a Jupyter notebook in IBM Watson Studio.
- Use Watson Studio Object Storage to access data and configuration files.
- Use the IBM Watson Natural Language Understanding API to extract metadata from documents in Jupyter notebooks.
- Extract and format unstructured data by using simplified Python functions.
- Use a configuration file to specify the co-reference and relations grammar.
- Store the processed JSON output in Watson Studio Object Storage.
- The documents of interest are stored in IBM Cloud Object Storage.
- The stored document content is in text format and is retrieved by the Jupyter Notebook for processing.
- The Jupyter notebook is hosted on IBM Watson Studio and has all the processing logic to correlate the content from the documents.
- The content from the documents is first sent to the Watson Natural Language Understanding and a response is received.
- Next, the input configuration JSON to drive the correlation is retrieved from Object Storage. The Python NLTK module generates keywords, POS tags, and chunks based on tag patterns specified in the configuration file.
- IBM Watson Studio is powered by Spark.
- A graph of entities (with attributes) and relationships between them is built by using the combined output from the Watson Natural Language Understanding and Python NLTK. The input configuration drives the correlation algorithm. The output graph with the entities and relationships are stored in Object Storage.
Find the detailed steps for this pattern in the README. The steps will show you how to:
- Sign up for Watson Studio.
- Create IBM Cloud services.
- Create the notebook.
- Add the data and configuraton file.
- Update the notebook with service credentials.
- Run the notebook.
- Analyze the results.