IBM Cloud Satellite: Build faster. Securely. Anywhere. Read more

Map personal data in multi-person chat transcripts

Learning objectives

The objective of this tutorial is to group the personal data of individuals in multi-person chat transcripts from a semi-structured text. It uses Watson Natural Language Understanding (NLU) with a custom model built using Watson Knowledge Studio (WKS). This tutorial can be used by developers who want to identify keywords from unstructured text, such as developers building General Data Protection Regulation (GDRP) solutions.


  • IBM Cloud account: If you do not have an IBM Cloud account, you can create an account here.
  • Watson Knowledge Studio account: User must have a WKS account. If you do not have an account, you can create a free account here. Make a note of the login URL since it is unique to every login ID.
  • Basic familiarity of building custom models in Watson Knowledge Studio.
  • Any REST client. In this guide we have used Postman.

Estimated time

The estimated time to complete this guide depends partly on the number of documents that we will use for annotation. In this tutorial, for the sake of simplicity, let’s consider about twenty documents. For more accurate results, it is recommended to annotate a large set. However, for the sample documents that we will use, it should take about 2 hours and 30 minutes.


Learn about GDPR here. One of the GDPR policies is the right to be forgotten, which means that user details must be thoroughly deleted from an organization in response to:

  • Customer requests
  • Stored personal data becoming invalid
  • Legal compliance obligations

For organizations to delete personal data, it needs to identify personal data from all it’s sources. One of the sources they find challenging to process is unstructured text. In this tutorial, we will address the need to identify personal data from unstructured text documents. We will look into documents having personal data of more than one person and map them to individuals they belong to. We will use IBM Watson services like Watson Natural Language Understanding (NLU) and Watson Knowledge Studio (WKS)

In this guide let us consider personal data like name, email address, home address, phone number. We will try to capture these personal data points and map them to the individuals they belong to.

The goal here is to import type systems and documents, create an annotator, and train, evaluate, and deploy the machine learning model to the Natural Language Understanding service.

Create Artifacts

You will need a set of documents to train and evaluate the WKS model. These documents will contain the unstructured text, from which we will identify personal data. Refer to sample documents in References section at the end of this document. You will need to have many such chat transcripts. You can either create them or get them from other sources you may have and store them in a folder on your local file system. To train WKS model, a large and varied set of documents are needed. The more the training data and the more variety in training, the more accurate the results will be. To complete this exercise, let us consider a smaller set of documents, given the time constraint. It is recommended to have at least twenty such documents. In real world scenarios WKS models are trained on thousands of documents. You can learn more about Documents here.

Create Project

Login to WKS using the URL noted down in prerequisites step for WKS. On WKS home page click Create Project.


Create New Project pop up window should appear; enter the name of the new project. Click Create.


Create type system

As mentioned earlier, there will be different attributes of individuals captured in chat transcripts. We will define these attributes as entity types in WKS. There is one entity type that requires a little more attention than the others, which is the Person entity type. In a multi-person chat, each person’s chat is represented by his or her role or name (for example, Agent, caller, customer, or Rep). The entity type person is further divided into subtypes to distinguish between each individual. If you are new to entity types and subtypes, you can refer this link.

You have created WKS project in Create Project section. Navigate to that project. Click Type Systems on the top navigation bar.


Click Add Entity Type.


Enter entity type name and click Save.


Type Rep under Subtypes.


Click Add after the Subtype text field.


Similarly add another subtype, Caller.


Click Save.


Similarly add other entity types as shown below. Note, only person entity type has subtypes.


Click the tab Relation types. Another concept we use while modeling is Relation. It will help us map the attributes to persons. Each person will have personal data which can be defined using a relation (for example, Agent name is “Harry”). We can define a relation as hasName. We can read this as “Agent hasName Harry”.


Click Add Relation Type.


Enter Relation, First Entity type, Second Entity type as shown in the below diagram and click Save.


Similarly add all other relations as shown in the below image.


Import Documents

Click Documents on the top navigation bar.


Click Import Document Set.


Click import button on the popup window.


Browse to files created in Create Artifacts section. Select all the files. Click Import.


Documents are now imported.


Rename the document-set by clicking Rename link.


Create and assign annotation sets

Click Annotation Sets on top navigation bar to create annotation sets.


Click Create Annotation Sets. Type in a name for the annotation set and click Generate.


The annotation set is created.



Each conversation in a chat transcript is represented by the role or name of the individual. There can be numerous mentions of the roles and names. Annotating these mentions would be time consuming and strenuous. Creating Dictionaries will help us pre-annotate these mentions using Dictionary annotator. Follow below instructions to create dictionaries.

Click Dictionary on the top navigation bar.


Click the + icon to add a dictionary. Enter a dictionary name. Click Save.


PersonRepDictionary is created.


Add Surface Forms as shown in below image. Click Save.


Dictionary for Rep is created and saved.


Similarly create a dictionary for Caller.


When created it looks as in below image.


Dictionary Annotator

Click Annotator Component in the top navigation bar. Click Create this type of annotator in the Dictionary Annotator box.


Click Edit link under Actions corresponding to Person.


Select the two dictionaries that were created earlier. Click Save.


Under Create… dropdown, click Create & Run.


Select AnnotationSet1 and click Run.


Dictionary annotator is created.


Human Annotation

Click Human Annotation on the top navigation bar.

Click Add Task.


Enter a name for the task and click Create.


In the popup window, select the annotation set that was created earlier. Click Create Task.


The task will be created. Click on the Task.


Next we need to annotate, mapping document entries with entity types defined in Type system. Click Annotate.


Click OK for any Alert message that pops up. Ground truth editor opens up. Here you can select each document one by one for annotation. Click on any of the documents.


Here you see that mentions of roles of individuals are already pre-annotated because we have used Dictionary annotator. Click on any appearance of the word Rep. On the right hand side of the screen under Entity tab > Type column, you can see that Person is highlighted. Click Subtype column and select Rep subtype.


Repeat the steps of selecting Rep and assigning Type and Subtype for all occurrences of Rep. Now click on any occurrence of Caller and follow the procedure of selecting Type and Subtype but for Caller ensure you select Caller as Subtype.


Click on any other words now that you want identified with entity types. For example, Natasha is a name and you want that word to be identified. So click on the word Natasha. On the right hand side of the screen select the corresponding entity type. In this case it is Name.


Follow the above steps for all the keywords that you want identified.


There can be two or more entity types pointing to the same thing. For example, the entry Rep in multiple places refers to the same person. Or sometimes full name could be mentioned and else where in the document only first name could be mentioned. To indicate that two or more occurrences indicate the same thing we use a “coreference”. Click Coreference section on the left hand side of the screen.


Click on all the Rep occurrences. Double click on the last occurrence. The model will understand that all the Rep occurrences are same. Repeat the above step for Caller and other entities which are same.


Click on the relation section on the left hand side of the screen.


All the entities that were marked are listed. Here you can mark relations between the entities. In each sentence select (click) two entities that have a relation between them. If not listed select the entities in reverse order. The relationship is listed on the right hand side of the screen under Relation Type. Select the applicable relation.


Similarly repeat for other entities in the same sentence and also all other sentences.


Once all the relation are marked, we can complete annotation of this document. Select status as Completed as shown in the screen below.


Click the Save icon next to the status dropdown. After it is saved, click Close button to close the annotation of this document. Repeat annotation for all the documents. Once done, the status of all files should be completed.


Navigate to Human Annotation task. Click the task that was created. If the status shows IN PROGRESS, click Refresh button.


The status should now change to SUBMITTED. Select the Annotation Set name and click Accept button.


Click OK on the confirmation popup window. Task status now changes to COMPLETED.


Click Annotator Component on the top navigation bar. Click Machine Learning annotator.


Select AnnotatorSet1 and Click Next.


For the Entity Type Name Person, select the two dictionaries that were created aka PersonRepDictionary and PersonCallerDictionary. Click Train & Evaluate.


The model training and evaluations process is started.


When Completed, the status is shown as in the screen below.


Create NLU service instance

The WKS model created needs to be deployed on an NLU instance.

Click here to create NLU service. The screen below is displayed.


Edit the field Service name to type NLUGDPR and leave the other settings default. Click Create.


NLU service instance should get created. In IBM Cloud dashboard, click NLUGDPR service that was created in the previous step.


On the left navigation bar, click Service credentials.


Click View Credentials.


Make a note of username and password.


Deploy WKS model to Watson Natural Language Understanding

Login to WKS using the link noted in the Prerequisites section above. Click on the project name that was created for following up with this tutorial. Click Annotator Component at the top navigation bar. Click Details under Machine Learning annotator.


Click Take Snapshot. Optionally enter a description and click OK.


The snapshot is created. Click Deploy under Action. Select Natural Language Understanding. Click Next.


Select your IBM Cloud Region, Space and NLU service instances. Click Deploy.


WKS model should get deployed on the NLU. Make a note of the Model ID. Click OK.


The model is deployed to NLU.


Testing the model

Test the deployed model using postman (REST client). Use the following details

  • Method: POST
  • URL:
  • Authorization: Basic. Enter NLU service User name and password as noted in the Create NLU service instance section.
  • Headers: Content-Type: application/json
  • Body

      "features": {
        "entities": {
          "model": "<model_id as noted in 'Deploy WKS model to Watson Natural Language Understanding' section >"
        "relations": {
          "model": "<model_id as noted in 'Deploy WKS model to Watson Natural Language Understanding' section>"
      "text": "<the text from which personal data needs to be extracted>"

The above details are captured in below 3 images.

Test1 Test2 Test3

Click Send. You should see the personal data extracted as shown in the screen below.


Interpreting the result:



Similarly expand rest of the nodes to check the personal data identified using the relations between then.


In this tutorial, we accomplished:

  • Defining a problem statement.
  • Importing a document set and Type Systems in WKS.
  • Annotating a document set.
  • Creating, training, and evaluating a custom machine learning model.
  • Deploying the WKS model to NLU.
  • Analyzing the results.

You should now know:

  • How to effectively use WKS custom model to identify metadata from unstructured text and link personal data to the individuals they belong to.
  • Defining Entity types, importing documents, annotating documents, creating machine learning model, training, and evaluating the model
  • How to deploy WKS model to NLU
  • How to query NLU for identifying metadata from given text document
  • How to analyze result

This concludes the tutorial. I hope you found it useful.

An example of a chat transcript:

Rep: This is Thomas. How can I help you?
Caller: This is Alex. I want to change my plan to corporate plan
Rep: Sure, I can help you. Do you want to change the plan for the number from which you are calling now?
Caller: yes
Rep: For verification purpose may I know your date of birth and email id
Caller: My data of birth is 10-Aug-1979 and my email id is
Rep: Which plan do you want to migrate to
Caller: Plan 450 unlimited
Rep: Can I have your company name and date of joining
Caller: I work for IBM and doj 01-Feb-99
Rep: Ok.. I have taken your request to migrate plan to 450 unlimited. You will get an update in 3 hours. Is there anything else that I can help you with
Caller: No
Rep: Thanks for calling Vodaphone. Have a good day
Caller: you too

Another example of a chat transcript:

Rep: Thanks for calling Vodaphone. This is Monica. How can I help you
Caller: I want to migrate to a different service provider
Rep: Sorry to hear that sir but why do you want to migrate
Caller: Signal is bad and service is pathetic
Rep: Signal is bad at any specific location or everywhere sir?
Caller: It's bad at my home
Rep: Can a representative visit your place to check what is the issue?
Caller: Yes
Rep: For verification can you let me know your name, email id and date of birth
Caller: my name's Abdul email id is and data of birth is 9th march 1988
Rep: Thank you Can you confirm your home address please?
Caller: My home address is #35, 5th corss, 2nd main, Kalyan Nagar, Bengaluru - 560043
Rep: A represetative will call you and visit your place to check signal issues and address your concern
Caller: Hope the issue will be resolved
Rep: We will do our best sir
Caller: Thanks
Rep: Is there anything that I can help you with
Caller: No
Rep: Than you for calling and have a good day sir