Overview and Creating a Watson Discovery instance
The goal of this recipe is to show you how to leverage data from Watson Discovery Service to be used as a basis for a Machine Learning annotator in Watson Knowledge Studio. Further details on the annotation itself, optimizing the model and deploying the annotator to WDS is out of the scope of this recipe and can be found¬†the additional links in step 7.
In this recipe you will¬†follow these activities:
- Create a Watson Discovery instance – this service allows¬†you¬†to ingest, normalise, enrich and query private, licensed or public data (PDF, Word, HTML and JSON). In our scenario we will use the news collection shipped in the Discovery service as our data source and as a stand-in for your ingested data.
- Create a Watson Knowledge studio instance – with this service you can create custom machine learning models (so called annotators) that identify entities and relationships unique to your industry in unstructured text. These custom models can then be deployed into Watson discovery for¬†custom entity and relationship extraction.
- Use a custom Python script that leverage the Watson Discovery Query API to create a CSV file that can be used as source for a document set in Watson Knowledge Studio. It leverages the existing news collection shipped with Discovery service as a data source and uses particular news articles from there.
- Import this CSV as document set into Watson Knowledge Studio.
- Import a sample type system – in this scenario the KLUE type system – into Watson Knowledge Studio to be able to¬†take a glimpse at an¬†existing type system optimized for news annotations.
- Create an annotation set and annotation task to be able to take a look at the Ground Truth Editor in WKS.
Enough writing, lets get start with creating a Watson Discovery instance. Make sure you are logged into your IBM Cloud Public account.
Navigate to Catalog -> Platform -> Watson and¬†select the¬†Discovery Service.
Provide a name, region, organization and space and trigger the creation of the service. The lite plan is sufficient for our purpose.
If the service has been provisioned launch the Discovery tooling in Manage -> Launch Tool. If everything worked out fine you should see the Watson Discovery News collection that we will use in our recipe.
To be able to interact via the WDS APIs with this collection we need to find out the Collection and Environment Id.
Select the Watson Discovery News collection and then select the link “Use this collection in API”. Here you can find the Collection Id: “news-en” and the Environment Id: “system”.
In addition you need to know the general credentials of your Watson Discovery instance. You can find this information in the service credentials section of the Discovery service (Service Credentials -> Actions -> View Credentials).
“username”: “<your username>”,
“password”: “<your password>”
Be careful to double-check the URL as it depends on your region.
Creating a Watson Knowledge Studio instance
In this step we will provision the Watson Knowledge Studio instance. The purpose of Watson Knowledge Studio is to¬†build domain-specific models through combined supervised machine learning and rule-based annotations. It provides advanced macro and micro analysis tools to optimize the performance of your custom models.
Navigate to Catalog -> Platform -> Watson and select the¬†Knowledge Studio Service.
If you have provisioned this service successfully launch the Knowledge Studio. To be able to work with Knowledge Studio you have to create a Workspace.¬†
You have to provide a Workspace name and the language of the documents you will work with. Optionally you can provide a Workspace description, a component configuration and Project Manager(s) for this specific workspace.
The component configuration allows for a Default tokenizer and a Dictionary-based tokenizer. If you have to cope with a lot of abbreviations the Dictionary-based tokenizer will be helpful as you can influence e.g. the sentence segmentation.
After you have created your Workspace the first step will be to import the documents you want to annotate.¬†
You can upload the WKS Documents Sets in Assets & Tools -> Documents -> Document Sets -> Upload Document Sets.
If you want to upload a Document set now you will find out that there are certain constraints you have to comply with.
- It must be a CSV file in UTF-8 format with two columns: 1) the document file name, 2) the document body
- It should not contain more than 2000 words plus there is a hard limit at 40.000 bytes
Wouldn’t it be helpful now to have¬†a Python sample that you can modify that helps you with that job and that takes the data you have imported into WDS as a basis?
Python Script leveraging Watson Developer Cloud SDK to create the WKS Document Set
We will work in this step with the Watson Developer Cloud Python SDK that gives you in one SDK access to all the Watson Service on IBM Cloud.
Follow the instruction there to install the library, ideally the following command should be sufficient:
pip3 install –upgrade watson-developer-cloud
Verify¬†that “watson-developer-cloud” package has been installed:
The following public GitLab project hosted on IBM Cloud gives you access to the Python script that will give you a fast-start for creating a WKS document set based on an WDS news collection. Make sure you use the Master branch.
Clone the GIT project and take a look at the file wks_create_doc_set.py . Make sure to replace wds_url, wds_username and wds_password with your specific values.
wds_url = “YOUR_WDS_URL”
wds_username = “YOUR_WDS_USERNAME”
wds_password = “YOUR_WDS_PASSWORD”
Now you are ready to run the script. If everything works you will see in the console 10 documents returned from our news collection. There should be a file “wks_document_set.csv” that represents your WKS document set and a folder “wds_json_docs” that contains the full JSON docs that can be used for re-ingestion purposes.
$ python3 wks_create_docset.py
Matching results: 6617
Returned documents: 10
Oscars 2018: What to expect at Sunday’s Academy Awards – CNN
Malawi consent classes teach children no means no – CNN
Residents flee as Syrian regime takes control of villages in Eastern Ghouta – CNN
‘Security threat’ forces closing of US Embassy in Turkey – CNN
Ryan Seacrest has uneventful Oscars red carpet, despite misconduct accusation – CNN
Lacoste temporarily changes logo to raise awareness for endangered species – CNN
Jordan Peele is first black screenwriter to win best original screenplay – CNN
Stunning photos of Kyrgyzstan and Tajikistan | CNN Travel
Daniela Vega becomes Oscars’ first trans presenter – CNN
Key House races to watch in 2018 – CNN Video
Importing the Document Set
Now that you have created a WKS compliant document set its time to import it. Switch back to WKS to upload the document set.
If the upload processed completely you¬†can select the uploaded document set to see the Document name for which we have used the title attribute of our News collection.
Import the KLUE type system in WKS
The goal here is not to fully explain you the annotation process, but just show you an existing (advanced) type system to get a glimpse how one could look like. Your first type systems will certainly look simpler.
To be able to create annotation sets and annotation tasks you need to have a type system in place. The KLUE type system is an excellent, but also very advanced sample in the news domain that has been used for pre-enriching the News collection. You can download it here:
Then drag and drop it in Assets & Tools -> Entity Types -> Upload
If you want to dive deeper into building entity type systems you can learn more here:
Create an Annotation Set and Annotation Task
After you have imported / created your type system you can create an annotation set and based on this create an annotation task.
Lets start with creating the annotation set based on the document set that you have created in the earlier steps. You can do this in¬†Assets & Tools -> Documents -> Annotations Sets -> Create Annotation Sets.
In this dialog you define your base document set, an overlap percentage in case of multiple human annotators and the name of the annotation set to be created.¬†
After that you create an annotation task in¬†Assets & Tools -> Documents ->¬†Tasks based on the annotation set you have created.
Now there is only one step left, just select the Annotate Action to get into the so-called Ground-Truth Editor. Here you can annotate your document sets with mention, relations and coreferences based on the KLUE type system you have imported.
Details on how this can be accomplished can be found in the links in¬†step 7.¬†
Where to go from here?
To get deeper into the next steps in training, optimizing and also publishing a machine learning annotator the IBM Cloud WKS docs are a good reference:
Your custom trained machine learning annotator would then be published into a WDS instance and a new WDS collection would be created and configured to use the WKS machine learning annotator for entity and relationship extraction each time you re-ingest new documents into this new WDS collection.
Certainly a good place as well is the IBM Cloud Garage Architecture center and there in particluar the reference architecture on Cognitive Discovery:
I hope this recipe has been useful to you to prepare¬†a WKS corpus for document annotation and you learned something new along the way.