Taxonomy Icon

Artificial Intelligence

The Watson™ Discovery Service can analyze your data and enrich it, as well as let you query your data using a cognitive search engine. The code pattern Create a cognitive Node.js web app to import, enrich, and explore data using the Watson Discovery Service demonstrates how you can build an application using the Watson Discovery Service with your own data set. You’ll learn how to prepare a data set of your own so you can import and enrich it using the Watson Discovery Service, and then build a search engine using that data set.

Example use case:

  • You have a large data set, with structured fields and free-form text fields.
  • You want to build an application to enable people to find patterns in that data set.
  • Users need to be able to filter the data by the existing fields.
  • They should also be able to filter on fields added by Watson Discovery Service enrichments.

Preparing the data set

For the code pattern demonstration, we used a public data set that lists cyber security breaches between 2004 and 2017. The data is available as a spreadsheet on Google Docs. You can download the data set as comma-separated values (CSV) by clicking File > Download as and then selecting the .csv file format.

gitlab screenshot

The Watson Discovery Service can import documents in various formats, including PDFs, Word documents, HTML files, and JSON files. CSV files are not accepted by the service, so we created a script to convert the data from CSV to JSON format. The script creates one JSON file for each row in the CSV file. You can find the resulting JSON files in the data directory of the repository. Here’s an example JSON file:

{
    "title": "Netflix Twitter account",
    "alternative_name": "",
    "text": "Dec. 'OurMine' hacked Netflix's Twitter account & sent out mocking tweets.",
    "year": "2017",
    "organisation": "web",
    "method_of_leak": "hacked",
    "no_of_records_stolen": 1,
    "data_sensitivity": 1,
    "source_link": "http://www.reuters.com/article/us-netflix-cyber-idUSKBN14A1GR",
    "source_name": "Reuters"
}

The structure of the CSV file has been preserved, but some of the field names have been renamed. For example, the story field in the CSV file has been renamed as text in the JSON file. When you import a JSON document into the Watson Discovery Service, it automatically applies enrichments to the field called text. You can apply enrichments to other fields by creating a custom configuration for your data collection, and we’ll discuss how to do this later. For now, it makes sense to massage our data so that it can use the default configuration.

The README file for the code pattern includes detailed instructions on how to create a Watson Discovery Service of your own. After you’ve created your own collection, uploading the data into the collection couldn’t be easier. You just drag the files from your file system and drop them onto the uploader widget:

gitlab screenshot

It might take a few minutes for the files to upload, and for the Watson Discovery Service to perform its enrichments on the data set.

Querying the data set

Now that you’ve got your data into a collection, you can start having fun by running queries against your data set. The tooling for the Watson Discovery Service provides some utilities to help you with building queries. Click the Query this collection button and you’ll see an overview showing some insights into your data.

gitlab screenshot

Now click the Build your own query button to bring up a query building form. This interface enables you to compose a query by specifying keywords, filters, and aggregations. To begin with, just leave all fields blank and click the Run Query button. In the panel on the right, you should now see the results from running this query.

gitlab screenshot

You can use this interface to explore the data in the collection. Click the disclosure icons to expand and contract fields. You should find that each result corresponds to one of the JSON files that you uploaded earlier. All of the original fields from the JSON file are there, and you’ll find some additional fields too: id, score, extracted_metadata, and enriched_text.

If you drill down into the enriched_text field, you’ll find fields such as entities, docSentiment, taxonomy, and so on. When your data was imported to the Watson Discovery Service, enrichments were applied to the text field, and these fields are the result of that process.

Filtering by original fields

Take another look at the fields in the original JSON files that you uploaded to your collection. It might be useful to filter the data set on fields such as year, no_of_records_stolen, organisation, and method_of_leak.

For example, let’s filter the collection to show all of the records where method_of_leak is hacked. In the query builder, enter the following in the Narrow your results (filter) field:

method_of_leak:hacked

Then click the Run query button. In the right column, you should now see the result list filtered:

gitlab screenshot

Try modifying that query to show all the records where the method_of_leak is 'accidentally published'. Now try modifying it to show results where year is 2017, or where organisation is 'healthcare'. You can apply more than one filter at a time by separating your terms with a comma. For example, to show all hacks that affected healthcare organisations, use this:

organisation:'healthcare',method_of_leak:'hacked'

The Watson Discovery Service makes it easy to query your data set by these original fields that were included in the original documents.

Filtering by generated fields

Take another look at the fields that come under enriched_text. These fields weren’t there in the original data set. They were generated by the enrichment process. You can just as easily run queries against these generated fields.

For example, let’s filter the collection to show all of the records where 'bank account' is mentioned in the text field. In the query builder, enter the following in the Narrow your results (filter) field:

enriched_text.entities.text:"bank account"

Click the Run query button. In the right column, you should now see the result list filtered:

gitlab screenshot

In the same way, you could adapt this query to filter on other fields within the enriched_text field. Try modifying that filter to only show results where the docSentiment has type:'positive'. As you can see, applying filters on generated fields is just as easy as filtering on fields from the original dataset.

Using queries in your own application

The API documentation describes how to run a query against your own collection programatically. For example, to show results where method_of_leak:hacked using curl, you could run:

curl -u "{username}":"{password}"
\ "https://gateway.watsonplatform.net/discovery/api/v1/environments/{environment_id}/collections/{collection_id}/query?version=2016-12-01&filter=method_of_leak:hacked&return=text"

If you replace the {username}, {password}, {environment_id}, and {collection_id} placeholders with the appropriate keys for your Watson Discovery Service, you should see the same results as when you ran that query through the tooling.

This code pattern app uses the Node client to connect to the Watson Discovery Service and run queries against it. You can use it as a reference to help you build an application that queries your own data set.

Applying enrichments to your own data

When you upload a JSON file to your Watson Discovery collection using the default configuration, it applies enrichments to the text field. In the resulting data set, you can find those enrichments under the enriched_text key. If your data has other fields that you’d like to apply enrichments to, you can create a custom configuration.

On the Your data page that summarizes the status and API information for your collection, you should see a Configuration section. Click the Switch link, and then Create a new configuration. Give your custom configuration a name and then click Create.

You can upload sample documents and use them to test your configuration. For example, if you upload the 001.json file and apply the default configuration to it you should see something like this:

gitlab screenshot

Notice how the preview in the right panel contains an enriched_text field with all of the specified enrichments. If you wanted to extract entities from the title field, you could set up your configuration like this:

gitlab screenshot

This time in the preview panel there’s an additional enriched_title field. With the title of &Netflix Twitter account& in the sample document, the Netflix and Twitter entities have been extracted and labelled with type:'company'. You can tweak your configuration to apply whichever enrichments you need to each of the appropriate fields from your data.

Summary

I hope these steps will help you get started with your own data sets and the Watson Discovery Service. Leave a comment below and let me know how it goes!