Extracting personal data from unstructured text using Watson Knowledge Studio

Learning objectives

The objective of this tutorial is to extract personal data (or any keywords) from unstructured text using Watson™ Natural Language Understanding with a custom model that is built using Watson Knowledge Studio. This tutorial guide can be used by developers who want to identify keywords from unstructured text (e.g., developers building General Data Protection Regulation (GDPR) solutions). This tutorial guide is made for the code pattern Fingerprinting personal data from unstructured documents specifically, but can be used for other requirements as well.


  • IBM Cloud account: If you do not have an IBM Cloud account, you can create an account here.
  • Watson Knowledge Studio account: If you do not have an account, you can create one. Make a note of the login URL since it is unique to every account.
  • Basic familiarity of building custom models in Watson Knowledge Studio.
  • Any REST client. In this guide, we have used Postman.

Estimated time

Estimated time depends on the number of documents that we will use for annotation. In this tutorial guide, for the sake of simplicity, let’s consider about 20 documents. For more accurate results, it is recommended to annotate large set. However, for the 20 documents it should take about two hours.


Learn about GDPR. One of the GDPR policies is the right to be forgotten, which means that user details must be thoroughly deleted from an organization in response to:

  • Customer requests
  • Stored personal data becoming invalid
  • Legal compliance obligations

For organizations to delete personal data, it needs to identify personal data from all its sources. One of the sources they find challenging, to process, is unstructured text. In this tutorial guide, we will address the need to identify personal data from unstructured text documents. We will use IBM Watson services like Watson Natural Language Understanding (NLU) and Watson Knowledge Studio (WKS).

In this guide, let us consider personal data like name, email, address, phone numbers. Let’s try to identify these personal data from the text documents.

The goal here is to import type systems and documents, create an annotator, train and evaluate the machine learning model, and deploy it to the Natural Language Understanding service.

Create Artifacts

You will need a set of documents to train and evaluate the WKS model. These documents will contain the unstructured text, from which we will identify personal data. You will need to have many such chat transcripts. You can either create them or get them from other sources you may have and store them in a folder on your local file system. To train WKS model, a large and varied set of documents are needed. More the training data and variety in training more accurate the results will be. To complete this exercise, let’s consider a smaller set of documents, given the time constraint. It is recommended to have at least twenty such documents. In real-world scenarios WKS models are trained on thousands of documents. You can learn more about Documents.

Create Project

Log in to WKS using the URL noted down in prerequisites step for WKS.

On WKS home page, click Create Project.


In the Create New Project pop up window, enter the name of the new project. Click Create.


Create type system

As discussed earlier, there will be different attributes of individuals captured in chat transcripts. We will define these attributes as entity types in WKS. If you are new to entity types, you can refer this link.

You have created WKS project ‘Create Project’ section. Navigate to that project. Click Type Systems on the top navigation bar.


We will add these entity types:

  • Name
  • EmailId
  • Address
  • MobileNo
  • EmployeeId
  • Company
  • DOB
  • DOJ

Click Add Entity Type.


Enter an entity type name and click Save.


Similarly, add other entity types.


Import Documents

Click Documents on the top navigation bar.


Click Import Document Set.


Click the import button on the pop-up window. Browse to the chat transcripts folder that was created in Create Artifacts section. Select all the files. Click Import. You may rename the document set to something meaningful.

Create and assign annotation sets

All the documents can be grouped into different sets for annotation purposes. These groups are called annotation sets. In annotation, a user goes through a document set and marks, in each document in the document set, keywords that represent personal data so that WKS will learn how to identify personal data from documents.

Click Annotation Sets on top navigation bar to create annotation sets.


Click Create Annotation Sets.


Type in name for the annotation set and click Generate.


Annotation set is created.


Human Annotation

Click Human Annotation on the top navigation bar. Click Add Task.


Enter a name for the task and click Create.


In the pop-up window, select the Annotation Set that was created earlier. Click Create Task.


Task should get created. Click on the Task.


Next we need to annotate, mapping document entries with entity types defined in Type system. Click Annotate.


Click OK for any Alert message that pops up. Ground truth editor opens up. Here you can select each document one by one to annotate all the documents. Click on any of the documents.


From the documents select an entry that you want to be extracted from the document as entities. Then click on the entity type on the right-hand side of the screen. Similarly, do this for all the keywords in the document.


Once all the keywords are mapped to entity types, select Completed from the status drop-down.


Click Save to save the changes.


Repeat above steps for all. All the documents should be annotated and completed. If the status shows IN PROGRESS, click the Refresh button.


Status should now change to SUBMITTED. Select the Annotation Set name and click Accept button.


Click OK on the confirmation pop-up window. Task status now changes to COMPLETED.


Click Annotator Component on the top navigation bar.


We will create Machine Learning annotator, so click Create this type of annotator under Machine Learning.


Under Document Set, select the set the annotation that was completed in previous steps. Click Next.


Click Train and Evaluate.


Train and Evaluate process takes place. It will take a few minutes for this step to complete.


Create NLU service instance

The WKS model created needs to be deployed on an NLU instance.

Click here to create the NLU service. The following screen is displayed.


Edit the field Service name to say NLUGDPR and leave the other settings default. Click Create.


NLU service instance should get created. In IBM Cloud dashbaord, NLUGDPR service that was just created in above steps.


On the left navigation bar, click Service credentials.


Click View Credentials.


Make a note of username and password.


Deploy WKS model to Watson Natural Language Understanding

In WKS, navigate to Annotator Component and Click on NLU.


Click Details.


Click Take Snapshot.


Enter any meaningful description for the snapshot. Click OK.


Snapshot is created.


Click Deploy to deploy on the NLU instance created in Create NLU service instance section. Click Deploy.


Select Natural Language Understanding. Click Next.


Select your IBM Cloud Region, Space and NLU service instances. Click Deploy.


WKS model should get deployed on the NLU. Make a note of the Model Id. Click OK.


Model is deployed to NLU.


Testing the model

Test the deployed model using postman (REST client). Use the following details:

These input details are captured as shown in the 3 images below.




Click Send. You should see the personal data extracted as in the screen below.



In this tutorial guide, we:

  • Defined a problem statement.
  • Imported a document set and Type Systems in WKS.
  • Annotated the document set.
  • Created, trained, and evaluated custom machine learning model.
  • Deployed the WKS model to NLU.
  • Analyzed the results.

You should now know:

  • How to effectively use WKS custom model to identify metadata from unstructured text.
  • How to define entity types, import documents, annotate documents, create machine learning models, train machine learning models, and evaluate machine learning models.
  • How to deploy WKS model to NLU.
  • How to query NLU for identifying metadata from given text document.

This concludes the tutorial guide. I hope you found it useful.

An example of a chat transcript:

Rep: This is Thomas. How can I help you?
Caller: This is Alex. I want to change my plan to corporate plan
Rep: Sure, I can help you. Do you want to change the plan for the number from which you are calling now?
Caller: yes
Rep: For verification purpose may I know your date of birth and email id
Caller: My data of birth is 10-Aug-1979 and my email id is alex@gmail.com
Rep: Which plan do you want to migrate to
Caller: Plan 450 unlimited
Rep: Can I have your company name and date of joining
Caller: I work for IBM and doj 01-Feb-99
Rep: Ok.. I have taken your request to migrate plan to 450 unlimited. You will get an update in 3 hours. Is there anything else that I can help you with
Caller: No
Rep: Thanks for calling Vodaphone. Have a good day.
Caller: you too

Another example of a chat transcript:

Rep: Thanks for calling Vodaphone. This is Monica. How can I help you
Caller: I want to migrate to a different service provider
Rep: Sorry to hear that sir but why do you want to migrate
Caller: Signal is bad and service is pathetic
Rep: Signal is bad at any specific location or everywhere sir?
Caller: It's bad at my home
Rep: Can a representative visit your place to check what is the issue?
Caller: Yes
Rep: For verification can you let me know your name, email id and date of birth
Caller: my name's Abdul email id is adbul@hotmail.com and data of birth is 9th march 1988
Rep: Thank you Can you confirm your home address please?
Caller: My home address is #35, 5th corss, 2nd main, Kalyan Nagar, Bengaluru - 560043
Rep: A representative will call you and visit your place to check signal issues and address your concern
Caller: Hope the issue will be resolved
Rep: We will do our best sir
Caller: Thanks
Rep: Is there anything that I can help you with
Caller: No
Rep: Than you for calling and have a good day sir