Taxonomy Icon

Data Science

Learning objectives

The objective of this tutorial is to extract personal data (or any keywords) from unstructured text using Watson™ Natural Language Understanding with a custom model that is built using Watson Knowledge Studio. This tutorial guide can be used by developers who want to identify keywords from unstructured text (e.g., developers building General Data Protection Regulation (GDPR) solutions). This tutorial guide is made for the code pattern Fingerprinting personal data from unstructured documents specifically, but can be used for other requirements as well.

Prerequisites

  • IBM Cloud account: If you do not have an IBM Cloud account, you can create an account here.
  • Watson Knowledge Studio (WKS) account: If you do not have an account, you can create a free account. here. Make a note of the login URL since it is unique to every account.
  • Basic familiarity of building custom models in Watson Knowledge Studio.
  • Any REST client. In this guide we have used Postman.

Estimated time

Estimated time depends on the number of documents that we will use for annotation. In this tutorial guide, for the sake of simplicity, let’s consider about 20 documents. For more accurate results, it is recommended to annotate large set. However, for the 20 documents it should take about two hours.

Steps

Learn about GDPR here. One of the GDPR policies is the right to be forgotten, which means that user details must be thoroughly deleted from an organization in response to:

  • Customer requests
  • Stored personal data becoming invalid
  • Legal compliance obligations

For organizations to delete personal data, it needs to identify personal data from all its sources. One of the sources they find challenging, to process, is unstructured text. In this tutorial guide, we will address the need to identify personal data from unstructured text documents. We will use IBM Watson services like Watson Natural Language Understanding (NLU) and Watson Knowledge Studio (WKS).

In this guide, let us consider personal data like name, email, address, phone numbers. Let’s try to identify these personal data from the text documents.

The goal here is to import type systems and documents, create an annotator, train and evaluate the machine learning model, and deploy it to the Natural Language Understanding service.

Create Artifacts

You will need a set of documents to train and evaluate the WKS model. These documents will contain the unstructured text, from which we will identify personal data. Refer to sample documents in References section at the end of this document. You will need to have many such chat transcripts. You can either create them or get them from other sources you may have and store them in a folder on your local file system. To train WKS model, a large and varied set of documents are needed. More the training data and variety in training more accurate the results will be. To complete this exercise, let’s consider a smaller set of documents, given the time constraint. It is recommended to have at least twenty such documents. In real-world scenarios WKS models are trained on thousands of documents. You can learn more about Documents here.

Create Project

Log in to WKS using the URL noted down in prerequisites step for WKS.

On WKS home page, click Create Project.

WKSCreateProject

In the Create New Project pop up window, enter the name of the new project. Click Create.

WKSCreateProjectOptions

Create type system

As discussed earlier, there will be different attributes of individuals captured in chat transcripts. We will define these attributes as entity types in WKS. If you are new to entity types, you can refer this link.

You have created WKS project ‘Create Project’ section. Navigate to that project. Click Type Systems on the top navigation bar.

ProjectCreated.png

We will add these entity types:

  • Name
  • EmailId
  • Address
  • MobileNo
  • EmployeeId
  • Company
  • DOB
  • DOJ

Click Add Entity Type.

AddEntityType

Enter an entity type name and click Save.

EntityTypeSave

Similarly, add other entity types.

WKSCreatedEntityTypes

Import Documents

Click Documents on the top navigation bar.

WKSImportDocuments

Click Import Document Set.

WKSImportDocSet

Click the import button on the pop-up window. Browse to the chat transcripts folder that was created in Create Artifacts section. Select all the files. Click Import. You may rename the document set to something meaningful.

Create and assign annotation sets

All the documents can be grouped into different sets for annotation purposes. These groups are called annotation sets. In annotation, a user goes through a document set and marks, in each document in the document set, keywords that represent personal data so that WKS will learn how to identify personal data from documents.

Click Annotation Sets on top navigation bar to create annotation sets.

WKSAnnotationSet

Click Create Annotation Sets.

WKSCreateAnnotationSet

Type in name for the annotation set and click Generate.

WKSAnnotationGenerate

Annotation set is created.

WKSAnnotationCreated

Human Annotation

Click Human Annotation on the top navigation bar. Click Add Task.

WKSAddTask

Enter a name for the task and click Create.

WKSCreateTask

In the pop-up window, select the Annotation Set that was created earlier. Click Create Task.

WKSCreateTask2

Task should get created. Click on the Task.

WKSTaskCreated

Next we need to annotate, mapping document entries with entity types defined in Type system. Click Annotate.

WKSAnnotate

Click OK for any Alert message that pops up. Ground truth editor opens up. Here you can select each document one by one to annotate all the documents. Click on any of the documents.

WKSGroundTruthFiles

From the documents select an entry that you want to be extracted from the document as entities. Then click on the entity type on the right-hand side of the screen. Similarly, do this for all the keywords in the document.

WKSEntityMapping

Once all the keywords are mapped to entity types, select Completed from the status drop-down.

WKSMappingComplete

Click Save to save the changes.

WKSMappingSaved

Repeat above steps for all. All the documents should be annotated and completed. If the status shows IN PROGRESS, click the Refresh button.

WKSAnnotationStatusRefresh

Status should now change to SUBMITTED. Select the Annotation Set name and click Accept button.

WKSAnnotationAccept

Click OK on the confirmation pop-up window. Task status now changes to COMPLETED.

WKSAnnotationCompleted

Click Annotator Component on the top navigation bar.

WKSAnnotatorComponentLink

We will create Machine Learning annotator, so click Create this type of annotator under Machine Learning.

WKSMachineLearning

Under Document Set, select the set the annotation that was completed in previous steps. Click Next.

WKSCreateAnnotator

Click Train and Evaluate.

WKSTrainEvaluate

Train and Evaluate process takes place. It will take a few minutes for this step to complete.

WKSAnnotatorProcessing

Create NLU service instance

The WKS model created needs to be deployed on an NLU instance.

Click here to create the NLU service. The following screen is displayed.

NLUCreateDefault

Edit the field Service name to say NLUGDPR and leave the other settings default. Click Create.

NLUCreateEdit

NLU service instance should get created. In IBM Cloud dashbaord, NLUGDPR service that was just created in above steps.

DashboardNLUService

On the left navigation bar, click Service credentials.

NLUServiceCredentails

Click View Credentials.

NLUViewCreds

Make a note of username and password.

NLUUserPwd

Deploy WKS model to Watson Natural Language Understanding

In WKS, navigate to Annotator Component and Click on NLU.

WKSCaptureModelId1

Click Details.

WKSAnnotatorCreated

Click Take Snapshot.

WKSSnapshot.png

Enter any meaningful description for the snapshot. Click OK.

WKSSnapshotOK

Snapshot is created.

WKSSnapshotCreated

Click Deploy to deploy on the NLU instance created in Create NLU service instance section. Click Deploy.

WKSDeploy

Select Natural Language Understanding. Click Next.

WKSDeployModel

Select your IBM Cloud Region, Space and NLU service instances. Click Deploy.

WKSDeployNLUIntsance

WKS model should get deployed on the NLU. Make a note of the Model Id. Click OK.

WKSModelId

Model is deployed to NLU.

WKSDeployedSnapshot

Testing the model

Test the deployed model using postman (REST client). Use the following details:

  • Method: POST
  • URL: https://gateway.watsonplatform.net/natural-language-understanding/api/v1/analyze?version=2017-02-27
  • Authorization: Basic. Enter NLU service User name and password as noted in Create NLU service instance section.
  • Headers: Content-Type: application/json
  • Body

    {
      "features": {
          "entities": {
          "model": "<model_id as noted in 'Deploy WKS model to Watson Natural Language Understanding' section>"
          }
      },
      "text": "<the text from which personal data needs to be extracted>"
    }
    

These input details are captured as shown in the 3 images below.

Test1

Test2

Test3

Click Send. You should see the personal data extracted as in the screen below.

TestResult

Summary

In this tutorial guide, we:

  • Defined a problem statement.
  • Imported a document set and Type Systems in WKS.
  • Annotated the document set.
  • Created, trained, and evaluated custom machine learning model.
  • Deployed the WKS model to NLU.
  • Analyzed the results.

You should now know:

  • How to effectively use WKS custom model to identify metadata from unstructured text.
  • How to define entity types, import documents, annotate documents, create machine learning models, train machine learning models, and evaluate machine learning models.
  • How to deploy WKS model to NLU.
  • How to query NLU for identifying metadata from given text document.

This concludes the tutorial guide. I hope you found it useful.

An example of a chat transcript:

Rep: This is Thomas. How can I help you?
Caller: This is Alex. I want to change my plan to corporate plan
Rep: Sure, I can help you. Do you want to change the plan for the number from which you are calling now?
Caller: yes
Rep: For verification purpose may I know your date of birth and email id
Caller: My data of birth is 10-Aug-1979 and my email id is alex@gmail.com
Rep: Which plan do you want to migrate to
Caller: Plan 450 unlimited
Rep: Can I have your company name and date of joining
Caller: I work for IBM and doj 01-Feb-99
Rep: Ok.. I have taken your request to migrate plan to 450 unlimited. You will get an update in 3 hours. Is there anything else that I can help you with
Caller: No
Rep: Thanks for calling Vodaphone. Have a good day.
Caller: you too

Another example of a chat transcript:

Rep: Thanks for calling Vodaphone. This is Monica. How can I help you
Caller: I want to migrate to a different service provider
Rep: Sorry to hear that sir but why do you want to migrate
Caller: Signal is bad and service is pathetic
Rep: Signal is bad at any specific location or everywhere sir?
Caller: It's bad at my home
Rep: Can a representative visit your place to check what is the issue?
Caller: Yes
Rep: For verification can you let me know your name, email id and date of birth
Caller: my name's Abdul email id is adbul@hotmail.com and data of birth is 9th march 1988
Rep: Thank you Can you confirm your home address please?
Caller: My home address is #35, 5th corss, 2nd main, Kalyan Nagar, Bengaluru - 560043
Rep: A representative will call you and visit your place to check signal issues and address your concern
Caller: Hope the issue will be resolved
Rep: We will do our best sir
Caller: Thanks
Rep: Is there anything that I can help you with
Caller: No
Rep: Than you for calling and have a good day sir