Fingerprinting personal data from unstructured text

Get the code View the demo

Summary

Have you ever tried to identify personal data from unstructured text? If you have, then you know it can be a daunting task. IBM Watson Natural Language Understanding together with Watson Knowledge Studio provides an effective way of identifying necessary information from unstructured documents. The result can be augmented with regular expressions, and personal data identified is provided a score based on which further processing or consuming can be done.

Description

In this developer pattern, we show you how to build a custom model using Watson Knowledge Studio and use it to identify personal data from unstructured documents. The pattern also augments the results with regex parsers.

When you have completed this pattern, you will understand how to:

  • Build a custom model using Watson Knowledge Studio and have Natural Language Understanding (NLU) use that model to identify personal data.
  • Use regular expressions to augment NLU for metadata identification.
  • Configure what personal data needs to be identified and assign weight for personal data to assign a score.
  • View the score and personal data identified in a tree structure for better visualization.
  • Consume the output by other applications.

Instructions

Find the detailed steps for this pattern in the readme file. The steps will show you how to:

  1. Install the prerequisites.
  2. Learn the concepts used.
  3. Deploy the application.
  4. Develop the Watson Knowledge Studio model.
  5. Deploy the model to Watson Natural Language Understanding.
  6. Verify that configuration parameters are correct.
  7. Analyze the results.
  8. Consume the output from other applications.

Flow

flow

  1. The viewer passes input text to the personal data extractor.
  2. The personal data extractor passes text to NLU.
  3. Personal data is extracted from input text. NLU uses custom model to provide the response.
  4. Personal data extractor passes NLU output to regex component.
  5. The regex component uses the regular expressions provided in the configuration to extract personal data, which is then augmented to the NLU output.
  6. The augmented personal data is passed to the scorer component.
  7. The scorer component uses the configuration to come up with a overall document score; result is passed back to the personal data extractor component.
  8. This data is then passed to the viewer component.