Correlate documents from different sources

Get the code View the demo

Summary

Text analytics involves getting insights from text content in documents, books, social media, and various other sources. A common requirement is to find the correlation of text content across sources to get a comprehensive picture. This code pattern uses Watson Natural Language Understanding (NLU), Python Natural Language Processing Toolkit (NLTK), and IBM Watson Studio to build a graph of entities with attributes and use its relationship with other entities to correlate text content across various sources.

Description

In this code pattern, you will use Jupyter notebooks in IBM DSX to correlate text content across documents with the Python NLTK toolkit and IBM Watson NLU. The correlation algorithm is driven by an input configuration JSON that contains the rules and grammar for building the relations. You can modify the JSON configuration document to obtain better correlation results between text content across documents.

After completing this pattern, you’ll learn how to:

  • Create and run a Jupyter notebook in DSX.
  • Use Watson Studio Object Storage to access data and configuration files.
  • Use the IBM Watson NLU API to extract metadata from documents in Jupyter notebooks.
  • Extract and format unstructured data by using simplified Python functions.
  • Use a configuration file to specify the co-reference and relations grammar.
  • Store the processed JSON output in Watson Studio Object Storage.

Flow

flow

  1. The documents of interest are stored in IBM Cloud Object Storage.
  2. The stored document content is in text format and is retrieved by the Jupyter Notebook for processing.
  3. The Jupyter notebook is hosted on IBM Watson Studio and has all the processing logic to correlate the content from the documents.
  4. The content from the documents is first sent to the Watson NLU and a response is received.
  5. Next, the input configuration JSON to drive the correlation is retrieved from Object Storage. The Python NLTK module generates keywords, POS tags, and chunks based on tag patterns specified in the configuration file.
  6. IBM Watson Studio is powered by Spark.
  7. A graph of entities (with attributes) and relationships between them is built by using the combined output from the Watson NLU and Python NLTK. The input configuration drives the correlation algorithm. The output graph with the entities and relationships are stored in Object Storage.

Instructions

Find the detailed steps for this pattern in the README. The steps will show you how to:

  1. Sign up for Watson Studio.
  2. Create IBM Cloud services.
  3. Create the notebook.
  4. Add the data and configuraton file.
  5. Update the notebook with service credentials.
  6. Run the notebook.
  7. Analyze the results.