We’re giving away 1,500 more DJI Tello drones. Enter to win ›
Muralidhar Chavan | Updated July 13, 2018 - Published February 12, 2018
The objective of this tutorial is to group the personal data of individuals in
multi-person chat transcripts from a semi-structured text. It uses Watson
Natural Language Understanding (NLU) with a custom model built using Watson
Knowledge Studio (WKS). This tutorial can be used by developers who want to
identify keywords from unstructured text, such as developers building
General Data Protection Regulation (GDRP) solutions.
The estimated time to complete this guide depends partly on the number of
documents that we will use for annotation. In this tutorial, for the sake
of simplicity, let’s consider about twenty documents. For more accurate
results, it is recommended to annotate a large set. However, for the sample
documents that we will use, it should take about 2 hours and 30 minutes.
Learn about GDPR here. One of the GDPR policies is
the right to be forgotten, which means that user details must be thoroughly
deleted from an organization in response to:
For organizations to delete personal data, it needs to identify personal data
from all it’s sources. One of the sources they find challenging to process is
unstructured text. In this tutorial, we will address the need to identify
personal data from unstructured text documents. We will look into documents
having personal data of more than one person and map them to individuals they
belong to. We will use IBM Watson services like Watson Natural Language
Understanding (NLU) and Watson Knowledge Studio (WKS)
In this guide let us consider personal data like name, email address, home
address, phone number. We will try to capture these personal data points and
map them to the individuals they belong to.
The goal here is to import type systems and documents, create an annotator, and
train, evaluate, and deploy the machine learning model to the Natural Language
You will need a set of documents to train and evaluate the WKS model. These
documents will contain the unstructured text, from which we will identify
personal data. Refer to sample documents in References section at the end of
this document. You will need to have many such chat transcripts. You can either
create them or get them from other sources you may have and store them in a
folder on your local file system. To train WKS model, a large and varied set of
documents are needed. The more the training data and the more variety in
training, the more accurate the results will be. To complete this exercise, let
us consider a smaller set of documents, given the time constraint. It is
recommended to have at least twenty such documents. In real world scenarios WKS
models are trained on thousands of documents. You can learn more about
Login to WKS using the URL noted down in prerequisites step for WKS. On WKS
home page click Create Project.
Create New Project pop up window should appear; enter the name of the new
project. Click Create.
As mentioned earlier, there will be different attributes of individuals
captured in chat transcripts. We will define these attributes as entity types
in WKS. There is one entity type that requires a little more attention than the
others, which is the Person entity type. In a multi-person chat, each
person’s chat is represented by his or her role or name (for example, Agent,
caller, customer, or Rep). The entity type person is further divided into
subtypes to distinguish between each individual. If you are new to entity types
and subtypes, you can refer this
You have created WKS project in Create Project section. Navigate to that
project. Click Type Systems on the top navigation bar.
Click Add Entity Type.
Enter entity type name and click Save.
Type Rep under Subtypes.
Click Add after the Subtype text field.
Similarly add another subtype, Caller.
Similarly add other entity types as shown below. Note, only person entity type
Click the tab Relation types. Another concept we use while modeling is
Relation. It will help us map the attributes to persons. Each person will
have personal data which can be defined using a relation (for example, Agent name is
“Harry”). We can define a relation as hasName. We can read this as “Agent
Click Add Relation Type.
Enter Relation, First Entity type, Second Entity type as shown in
the below diagram and click Save.
Similarly add all other relations as shown in the below image.
Click Documents on the top navigation bar.
Click Import Document Set.
Click import button on the popup window.
Browse to files created in Create Artifacts section. Select all the files.
Documents are now imported.
Rename the document-set by clicking Rename link.
Click Annotation Sets on top navigation bar to create annotation sets.
Click Create Annotation Sets. Type in a name for the annotation set and
The annotation set is created.
Each conversation in a chat transcript is represented by the role or name of
the individual. There can be numerous mentions of the roles and names.
Annotating these mentions would be time consuming and strenuous. Creating
Dictionaries will help us pre-annotate these mentions using Dictionary
annotator. Follow below instructions to create dictionaries.
Click Dictionary on the top navigation bar.
Click the + icon to add a dictionary. Enter a dictionary name. Click
PersonRepDictionary is created.
Add Surface Forms as shown in below image. Click Save.
Dictionary for Rep is created and saved.
Similarly create a dictionary for Caller.
When created it looks as in below image.
Click Annotator Component in the top navigation bar. Click Create this
type of annotator in the Dictionary Annotator box.
Click Edit link under Actions corresponding to Person.
Select the two dictionaries that were created earlier. Click Save.
Under Create… dropdown, click Create & Run.
Select AnnotationSet1 and click Run.
Dictionary annotator is created.
Click Human Annotation on the top navigation bar.
Click Add Task.
Enter a name for the task and click Create.
In the popup window, select the annotation set that was created earlier. Click
The task will be created. Click on the Task.
Next we need to annotate, mapping document entries with entity types defined in
Type system. Click Annotate.
Click OK for any Alert message that pops up. Ground truth editor opens up.
Here you can select each document one by one for annotation. Click on any of
Here you see that mentions of roles of individuals are already pre-annotated
because we have used Dictionary annotator. Click on any appearance of the word
Rep. On the right hand side of the screen under Entity tab > Type
column, you can see that Person is highlighted. Click Subtype column
and select Rep subtype.
Repeat the steps of selecting Rep and assigning Type and Subtype for all
occurrences of Rep. Now click on any occurrence of Caller and follow
the procedure of selecting Type and Subtype but for Caller ensure
you select Caller as Subtype.
Click on any other words now that you want identified with entity types. For
example, Natasha is a name and you want that word to be identified. So
click on the word Natasha. On the right hand side of the screen select the
corresponding entity type. In this case it is Name.
Follow the above steps for all the keywords that you want identified.
There can be two or more entity types pointing to the same thing. For example,
the entry Rep in multiple places refers to the same person. Or sometimes
full name could be mentioned and else where in the document only first name
could be mentioned. To indicate that two or more occurrences indicate the same
thing we use a “coreference”. Click Coreference section on the left hand
side of the screen.
Click on all the Rep occurrences. Double click on the last occurrence. The
model will understand that all the Rep occurrences are same. Repeat the
above step for Caller and other entities which are same.
Click on the relation section on the left hand side of the screen.
All the entities that were marked are listed. Here you can mark relations
between the entities. In each sentence select (click) two entities that have a
relation between them. If not listed select the entities in reverse order. The
relationship is listed on the right hand side of the screen under Relation
Type. Select the applicable relation.
Similarly repeat for other entities in the same sentence and also all other
Once all the relation are marked, we can complete annotation of this document.
Select status as Completed as shown in the screen below.
Click the Save icon next to the status dropdown. After it is saved, click
Close button to close the annotation of this document. Repeat annotation
for all the documents. Once done, the status of all files should be completed.
Navigate to Human Annotation task. Click the task that was created. If the
status shows IN PROGRESS, click Refresh button.
The status should now change to SUBMITTED. Select the Annotation Set name and
click Accept button.
Click OK on the confirmation popup window. Task status now changes to
Click Annotator Component on the top navigation bar. Click Machine
Select AnnotatorSet1 and Click Next.
For the Entity Type Name Person, select the two dictionaries that were
created aka PersonRepDictionary and PersonCallerDictionary. Click
Train & Evaluate.
The model training and evaluations process is started.
When Completed, the status is shown as in the screen below.
The WKS model created needs to be deployed on an NLU instance.
to create NLU service. The screen below is displayed.
Edit the field Service name to type NLUGDPR and leave the other settings
default. Click Create.
NLU service instance should get created. In IBM Cloud dashboard, click
NLUGDPR service that was created in the previous step.
On the left navigation bar, click Service credentials.
Click View Credentials.
Make a note of username and password.
Login to WKS using the link noted in the Prerequisites section above. Click
on the project name that was created for following up with this tutorial. Click
Annotator Component at the top navigation bar. Click Details under
Machine Learning annotator.
Click Take Snapshot. Optionally enter a description and click OK.
The snapshot is created. Click Deploy under Action. Select Natural
Language Understanding. Click Next.
Select your IBM Cloud Region, Space and NLU service instances. Click
WKS model should get deployed on the NLU. Make a note of the Model ID. Click
The model is deployed to NLU.
Test the deployed model using postman (REST client). Use the following details
"model": "<model_id as noted in 'Deploy WKS model to Watson Natural Language Understanding' section >"
"model": "<model_id as noted in 'Deploy WKS model to Watson Natural Language Understanding' section>"
"text": "<the text from which personal data needs to be extracted>"
The above details are captured in below 3 images.
Click Send. You should see the personal data extracted as shown in the
Interpreting the result:
Similarly expand rest of the nodes to check the personal data identified using
the relations between then.
In this tutorial, we accomplished:
You should now know:
This concludes the tutorial. I hope you found it useful.
An example of a chat transcript:
Rep: This is Thomas. How can I help you?
Caller: This is Alex. I want to change my plan to corporate plan
Rep: Sure, I can help you. Do you want to change the plan for the number from which you are calling now?
Rep: For verification purpose may I know your date of birth and email id
Caller: My data of birth is 10-Aug-1979 and my email id is firstname.lastname@example.org
Rep: Which plan do you want to migrate to
Caller: Plan 450 unlimited
Rep: Can I have your company name and date of joining
Caller: I work for IBM and doj 01-Feb-99
Rep: Ok.. I have taken your request to migrate plan to 450 unlimited. You will get an update in 3 hours. Is there anything else that I can help you with
Rep: Thanks for calling Vodaphone. Have a good day
Caller: you too
Another example of a chat transcript:
Rep: Thanks for calling Vodaphone. This is Monica. How can I help you
Caller: I want to migrate to a different service provider
Rep: Sorry to hear that sir but why do you want to migrate
Caller: Signal is bad and service is pathetic
Rep: Signal is bad at any specific location or everywhere sir?
Caller: It's bad at my home
Rep: Can a representative visit your place to check what is the issue?
Rep: For verification can you let me know your name, email id and date of birth
Caller: my name's Abdul email id is email@example.com and data of birth is 9th march 1988
Rep: Thank you Can you confirm your home address please?
Caller: My home address is #35, 5th corss, 2nd main, Kalyan Nagar, Bengaluru - 560043
Rep: A represetative will call you and visit your place to check signal issues and address your concern
Caller: Hope the issue will be resolved
Rep: We will do our best sir
Rep: Is there anything that I can help you with
Rep: Than you for calling and have a good day sir
May 20, 2019
Get started with Machine Learning and AI in this three part series.
Back to top