Overview

Skill Level: Intermediate

Skill requirements: IBM Cloud, Git and Python basics

Are you facing a large amount of domain-specific documents you have imported into WDS to start data analysis and now you are tasked to perform domain-specific information extraction? If that's your challenge then you should take a look at this recipe.

Ingredients

  • Skills needed: IBM Cloud, Git and Python basics
  • IBM Cloud account -¬†created on https://bluemix.net
  • Installed Python environment
  • Installed Git client

Step-by-step

  1. Overview and Creating a Watson Discovery instance

    The goal of this recipe is to show you how to leverage data from Watson Discovery Service to be used as a basis for a Machine Learning annotator in Watson Knowledge Studio. Further details on the annotation itself, optimizing the model and deploying the annotator to WDS is out of the scope of this recipe and can be found the additional links in step 7.

    In this recipe you will follow these activities:

    1. Create a Watson Discovery instance Рthis service allows you to ingest, normalise, enrich and query private, licensed or public data (PDF, Word, HTML and JSON). In our scenario we will use the news collection shipped in the Discovery service as our data source and as a stand-in for your ingested data.
    2. Create a Watson Knowledge studio instance Рwith this service you can create custom machine learning models (so called annotators) that identify entities and relationships unique to your industry in unstructured text. These custom models can then be deployed into Watson discovery for custom entity and relationship extraction.
    3. Use a custom Python script that leverage the Watson Discovery Query API to create a CSV file that can be used as source for a document set in Watson Knowledge Studio. It leverages the existing news collection shipped with Discovery service as a data source and uses particular news articles from there.
    4. Import this CSV as document set into Watson Knowledge Studio.
    5. Import a sample type system Рin this scenario the KLUE type system Рinto Watson Knowledge Studio to be able to take a glimpse at an existing type system optimized for news annotations.
    6. Create an annotation set and annotation task to be able to take a look at the Ground Truth Editor in WKS.

    Enough writing, lets get start with creating a Watson Discovery instance. Make sure you are logged into your IBM Cloud Public account.

    Navigate to Catalog -> Platform -> Watson and select the Discovery Service.

    watson_discover_service_catalog

     

    Provide a name, region, organization and space and trigger the creation of the service. The lite plan is sufficient for our purpose.

    If the service has been provisioned launch the Discovery tooling in Manage -> Launch Tool. If everything worked out fine you should see the Watson Discovery News collection that we will use in our recipe.

    watson_discovery_service_news

     

    To be able to interact via the WDS APIs with this collection we need to find out the Collection and Environment Id.

    Select the Watson Discovery News collection and then select the link “Use this collection in API”. Here you can find the Collection Id: “news-en” and the Environment Id: “system”.

    collection_environment_id

     

    In addition you need to know the general credentials of your Watson Discovery instance. You can find this information in the service credentials section of the Discovery service (Service Credentials -> Actions -> View Credentials).

    {
    “url”: “https://gateway-fra.watsonplatform.net/discovery/api”,
    “username”: “<your username>”,
    “password”: “<your password>”
    }

    Be careful to double-check the URL as it depends on your region.

  2. Creating a Watson Knowledge Studio instance

    In this step we will provision the Watson Knowledge Studio instance. The purpose of Watson Knowledge Studio is to build domain-specific models through combined supervised machine learning and rule-based annotations. It provides advanced macro and micro analysis tools to optimize the performance of your custom models.

    Navigate to Catalog -> Platform -> Watson and select the Knowledge Studio Service.

    watson_knowledge_studio

    If you have provisioned this service successfully launch the Knowledge Studio. To be able to work with Knowledge Studio you have to create a Workspace. 

    You have to provide a Workspace name and the language of the documents you will work with. Optionally you can provide a Workspace description, a component configuration and Project Manager(s) for this specific workspace.

    The component configuration allows for a Default tokenizer and a Dictionary-based tokenizer. If you have to cope with a lot of abbreviations the Dictionary-based tokenizer will be helpful as you can influence e.g. the sentence segmentation.

    wks_create_workspace

     

    After you have created your Workspace the first step will be to import the documents you want to annotate. 

    You can upload the WKS Documents Sets in Assets & Tools -> Documents -> Document Sets -> Upload Document Sets.

    wks_upload_documents

    If you want to upload a Document set now you will find out that there are certain constraints you have to comply with.

    • It must be a CSV file in UTF-8 format with two columns: 1) the document file name, 2) the document body
    • It should not contain more than 2000 words plus there is a hard limit at 40.000 bytes

    Wouldn’t it be helpful now to have¬†a Python sample that you can modify that helps you with that job and that takes the data you have imported into WDS as a basis?

  3. Python Script leveraging Watson Developer Cloud SDK to create the WKS Document Set

    We will work in this step with the Watson Developer Cloud Python SDK that gives you in one SDK access to all the Watson Service on IBM Cloud.
    https://github.com/watson-developer-cloud/python-sdk

    Follow the instruction there to install the library, ideally the following command should be sufficient:
    pip3 install –upgrade watson-developer-cloud

    Verify¬†that “watson-developer-cloud” package has been installed:
    pip3 list

    The following public GitLab project hosted on IBM Cloud gives you access to the Python script that will give you a fast-start for creating a WKS document set based on an WDS news collection. Make sure you use the Master branch.
    https://git.eu-de.bluemix.net/watmann/wks-document-set-creator/

    Clone the GIT project and take a look at the file wks_create_doc_set.py . Make sure to replace wds_url, wds_username and wds_password with your specific values.

    wds_url = “YOUR_WDS_URL”
    wds_username = “YOUR_WDS_USERNAME”
    wds_password = “YOUR_WDS_PASSWORD”

    Now you are ready to run the script. If everything works you will see in the console 10 documents returned from our news collection. There should be a file “wks_document_set.csv” that represents your WKS document set and a folder “wds_json_docs” that contains the full JSON docs that can be used for re-ingestion purposes.

    $ python3 wks_create_docset.py
    Matching results: 6617
    Returned documents: 10
    Oscars 2018: What to expect at Sunday’s Academy Awards – CNN
    Malawi consent classes teach children no means no – CNN
    Residents flee as Syrian regime takes control of villages in Eastern Ghouta – CNN
    ‘Security threat’ forces closing of US Embassy in Turkey – CNN
    Ryan Seacrest has uneventful Oscars red carpet, despite misconduct accusation – CNN
    Lacoste temporarily changes logo to raise awareness for endangered species – CNN
    Jordan Peele is first black screenwriter to win best original screenplay – CNN
    Stunning photos of Kyrgyzstan and Tajikistan | CNN Travel
    Daniela Vega becomes Oscars’ first trans presenter – CNN
    Key House races to watch in 2018 – CNN Video

  4. Importing the Document Set

    Now that you have created a WKS compliant document set its time to import it. Switch back to WKS to upload the document set.

    wks_doc_upload

     

    If the upload processed completely you can select the uploaded document set to see the Document name for which we have used the title attribute of our News collection.

     

    wks_documents

  5. Import the KLUE type system in WKS

    The goal here is not to fully explain you the annotation process, but just show you an existing (advanced) type system to get a glimpse how one could look like. Your first type systems will certainly look simpler.

    To be able to create annotation sets and annotation tasks you need to have a type system in place. The KLUE type system is an excellent, but also very advanced sample in the news domain that has been used for pre-enriching the News collection. You can download it here:

    https://watson-developer-cloud.github.io/doc-tutorial-downloads/knowledge-studio/en-klue2-types.json

    Then drag and drop it in Assets & Tools -> Entity Types -> Upload

    wks_import_type_system

     

    If you want to dive deeper into building entity type systems you can learn more here:
    https://console.bluemix.net/docs/services/knowledge-studio/typesystem.html#typesystem

     

     

  6. Create an Annotation Set and Annotation Task

    After you have imported / created your type system you can create an annotation set and based on this create an annotation task.

    Lets start with creating the annotation set based on the document set that you have created in the earlier steps. You can do this in Assets & Tools -> Documents -> Annotations Sets -> Create Annotation Sets.

     

    wks_annotation_set

    In this dialog you define your base document set, an overlap percentage in case of multiple human annotators and the name of the annotation set to be created. 

     

    After that you create an annotation task in Assets & Tools -> Documents -> Tasks based on the annotation set you have created.

    wks_annotation_task_created

     

    Now there is only one step left, just select the Annotate Action to get into the so-called Ground-Truth Editor. Here you can annotate your document sets with mention, relations and coreferences based on the KLUE type system you have imported.

    wks_annotation_ground_truth_editor

     

     

    Details on how this can be accomplished can be found in the links in step 7. 

  7. Where to go from here?

    To get deeper into the next steps in training, optimizing and also publishing a machine learning annotator the IBM Cloud WKS docs are a good reference:
    https://console.bluemix.net/docs/services/knowledge-studio/ml-annotator.html#ml_annotator

    Your custom trained machine learning annotator would then be published into a WDS instance and a new WDS collection would be created and configured to use the WKS machine learning annotator for entity and relationship extraction each time you re-ingest new documents into this new WDS collection.

    Certainly a good place as well is the IBM Cloud Garage Architecture center and there in particluar the reference architecture on Cognitive Discovery:
    https://www.ibm.com/cloud/garage/architectures/cognitiveDiscoveryDomain/reference-architecture

     

    I hope this recipe has been useful to you to prepare a WKS corpus for document annotation and you learned something new along the way.

3 comments on"Create a Watson Knowledge Studio (WKS) corpus from a Watson Discovery Service (WDS) collection"

  1. GerdWatmann March 25, 2018

    If you have any feedback, input or questions for this recipe please let me know.

  2. Hi Gerd, I could not run the script to generate 10 documents as described, got the following error:

    Traceback (most recent call last):
    File “wks_create_docset.py”, line 6, in
    from watson_developer_cloud import DiscoveryV1, WatsonApiException
    File “build/bdist.macosx-10.13-intel/egg/watson_developer_cloud/__init__.py”, line 16, in

    File “build/bdist.macosx-10.13-intel/egg/watson_developer_cloud/watson_service.py”, line 18, in
    ImportError: No module named requests

    I have tried to run it from DSX, but it did not work either.

    Could you help? thanks!

    • GerdWatmann April 25, 2018

      Hi Mai – thanks for going through my article and using the python script! I updated the article, please make sure you use “pip3 install ‚Äďupgrade watson-developer-cloud” (to make sure you use the latest package) and “pip3 list” to verify whether the watson developer cloud package was installed successfully -> currently this is watson-developer-cloud (1.3.3). Then run “python3 wks_create_docset.py” for the execution. Let me know if that helps, if not we will dive deeper.

Join The Discussion