We’re giving away 1,500 DJI Tello drones. Enter to win ›
By Muralidhar Chavan | Published February 12, 2018 - Updated August 10, 2018
Artificial IntelligenceData ScienceNatural Language Processing
The objective of this tutorial is to extract personal data (or any keywords) from
unstructured text using Watson™ Natural Language Understanding with a custom
model that is built using Watson Knowledge Studio. This tutorial guide can be
used by developers who want to identify keywords from unstructured text (e.g.,
developers building General Data Protection Regulation (GDPR) solutions). This
tutorial guide is made for the code pattern Fingerprinting personal data from
specifically, but can be used for other requirements as well.
Estimated time depends on the number of documents that we will use for
annotation. In this tutorial guide, for the sake of simplicity, let’s consider
about 20 documents. For more accurate results, it is recommended to
annotate large set. However, for the 20 documents it should take about two
Learn about GDPR here. One of the GDPR policies is
the right to be forgotten, which means that user details must be thoroughly
deleted from an organization in response to:
For organizations to delete personal data, it needs to identify personal data
from all its sources. One of the sources they find challenging, to process, is
unstructured text. In this tutorial guide, we will address the need to identify
personal data from unstructured text documents. We will use IBM Watson services
like Watson Natural Language Understanding (NLU) and Watson Knowledge Studio
In this guide, let us consider personal data like name, email, address, phone
numbers. Let’s try to identify these personal data from the text documents.
The goal here is to import type systems and documents, create an annotator, train
and evaluate the machine learning model, and deploy it to the Natural Language
You will need a set of documents to train and evaluate the WKS model. These
documents will contain the unstructured text, from which we will identify
personal data. Refer to sample documents in References section
at the end of this document. You will need to have many such chat transcripts.
You can either create them or get them from other sources you may have and
store them in a folder on your local file system. To train WKS model, a large
and varied set of documents are needed. More the training data and variety in
training more accurate the results will be. To complete this exercise, let’s
consider a smaller set of documents, given the time constraint. It is
recommended to have at least twenty such documents. In real-world scenarios WKS
models are trained on thousands of documents. You can learn more about
Log in to WKS using the URL noted down in prerequisites step for WKS.
On WKS home page, click Create Project.
In the Create New Project pop up window, enter the name of the new project.
As discussed earlier, there will be different attributes of individuals
captured in chat transcripts. We will define these attributes as entity types
in WKS. If you are new to entity types, you can refer this
You have created WKS project ‘Create Project’ section. Navigate to that
project. Click Type Systems on the top navigation bar.
We will add these entity types:
Click Add Entity Type.
Enter an entity type name and click Save.
Similarly, add other entity types.
Click Documents on the top navigation bar.
Click Import Document Set.
Click the import button on the pop-up window. Browse to the chat transcripts folder that
was created in Create Artifacts section. Select all the files. Click
Import. You may rename the document set to something meaningful.
All the documents can be grouped into different sets for annotation purposes.
These groups are called annotation sets. In annotation, a user goes through a
document set and marks, in each document in the document set, keywords that
represent personal data so that WKS will learn how to identify personal data
Click Annotation Sets on top navigation bar to create annotation sets.
Click Create Annotation Sets.
Type in name for the annotation set and click Generate.
Annotation set is created.
Click Human Annotation on the top navigation bar. Click Add Task.
Enter a name for the task and click Create.
In the pop-up window, select the Annotation Set that was created earlier. Click
Task should get created. Click on the Task.
Next we need to annotate, mapping document entries with entity types defined in
Type system. Click Annotate.
Click OK for any Alert message that pops up. Ground truth editor opens up. Here
you can select each document one by one to annotate all the documents. Click on
any of the documents.
From the documents select an entry that you want to be extracted from the
document as entities. Then click on the entity type on the right-hand side of
the screen. Similarly, do this for all the keywords in the document.
Once all the keywords are mapped to entity types, select Completed from the
Click Save to save the changes.
Repeat above steps for all. All the documents should be annotated
and completed. If the status shows IN PROGRESS, click the Refresh button.
Status should now change to SUBMITTED. Select the Annotation Set name and
click Accept button.
Click OK on the confirmation pop-up window. Task status now changes to
Click Annotator Component on the top navigation bar.
We will create Machine Learning annotator, so click Create this type of
annotator under Machine Learning.
Under Document Set, select the set the annotation that was completed in
previous steps. Click Next.
Click Train and Evaluate.
Train and Evaluate process takes place. It will take a few minutes for this
step to complete.
The WKS model created needs to be deployed on an NLU instance.
to create the NLU service. The following screen is displayed.
Edit the field Service name to say NLUGDPR and leave the other settings
default. Click Create.
NLU service instance should get created. In IBM Cloud dashbaord, NLUGDPR
service that was just created in above steps.
On the left navigation bar, click Service credentials.
Click View Credentials.
Make a note of username and password.
In WKS, navigate to Annotator Component and Click on NLU.
Click Take Snapshot.
Enter any meaningful description for the snapshot. Click OK.
Snapshot is created.
Click Deploy to deploy on the NLU instance created in Create NLU service
instance section. Click Deploy.
Select Natural Language Understanding. Click Next.
Select your IBM Cloud Region, Space and NLU service instances. Click
WKS model should get deployed on the NLU. Make a note of the Model Id. Click
Model is deployed to NLU.
Test the deployed model using postman (REST client). Use the following details:
"model": "<model_id as noted in 'Deploy WKS model to Watson Natural Language Understanding' section>"
"text": "<the text from which personal data needs to be extracted>"
These input details are captured as shown in the 3 images below.
Click Send. You should see the personal data extracted as in the screen
In this tutorial guide, we:
You should now know:
This concludes the tutorial guide. I hope you found it useful.
An example of a chat transcript:
Rep: This is Thomas. How can I help you?
Caller: This is Alex. I want to change my plan to corporate plan
Rep: Sure, I can help you. Do you want to change the plan for the number from which you are calling now?
Rep: For verification purpose may I know your date of birth and email id
Caller: My data of birth is 10-Aug-1979 and my email id is email@example.com
Rep: Which plan do you want to migrate to
Caller: Plan 450 unlimited
Rep: Can I have your company name and date of joining
Caller: I work for IBM and doj 01-Feb-99
Rep: Ok.. I have taken your request to migrate plan to 450 unlimited. You will get an update in 3 hours. Is there anything else that I can help you with
Rep: Thanks for calling Vodaphone. Have a good day.
Caller: you too
Another example of a chat transcript:
Rep: Thanks for calling Vodaphone. This is Monica. How can I help you
Caller: I want to migrate to a different service provider
Rep: Sorry to hear that sir but why do you want to migrate
Caller: Signal is bad and service is pathetic
Rep: Signal is bad at any specific location or everywhere sir?
Caller: It's bad at my home
Rep: Can a representative visit your place to check what is the issue?
Rep: For verification can you let me know your name, email id and date of birth
Caller: my name's Abdul email id is firstname.lastname@example.org and data of birth is 9th march 1988
Rep: Thank you Can you confirm your home address please?
Caller: My home address is #35, 5th corss, 2nd main, Kalyan Nagar, Bengaluru - 560043
Rep: A representative will call you and visit your place to check signal issues and address your concern
Caller: Hope the issue will be resolved
Rep: We will do our best sir
Rep: Is there anything that I can help you with
Rep: Than you for calling and have a good day sir
November 15, 2018
November 29, 2018
April 24, 2018
Back to top