Digital Developer Conference on Data and AI: Essential data science, machine learning, and AI skills and certification Register for free

Find, prepare, and understand data with Watson Knowledge Catalog

This tutorial is part of the Getting started with IBM Cloud Pak for Data learning path.

This tutorial demonstrates how to solve the problems of enterprise data governance using the IBM Watson® Knowledge Catalog on the IBM Cloud Pak® for Data platform. We’ll explain how to use governance, data quality, and active policy management to help you protect and govern sensitive data, trace data lineage, and manage data lakes. This knowledge can help you quickly discover, curate, categorize, and share data assets, data sets, analytical models, and their relationships with others in your organization.

Learning objectives

In this tutorial, you will learn how to:

  1. Set up the catalog and data
  2. Add collaborators and control access
  3. Add categories
  4. Add data classes
  5. Add business terms
  6. Add rules for policies

Prerequisites

Estimated time

Completing this tutorial should take about 30-45 minutes.

Steps

NOTE: The default catalog is your enterprise catalog. It is created automatically after you install the Watson Knowledge Catalog service and is the only catalog to which advanced data curation tools apply. The default catalog is governed so that data protection rules are enforced. The information assets view shows additional properties of the assets in the default catalog to aid curation. Any subsequent catalogs that you create can be governed or ungoverned, do not have an information assets view, and supply basic data curation tools.

Step 1. Set up catalog and data

Create the catalog

If you haven’t yet started IBM Watson Knowledge Catalog, you’ll need to provision it. Open IBM Watson Knowledge Catalog by clicking the Services icon at the top right of the home page.

Click services icon

Under the Data Governance section, click the Watson Knowledge Catalog tile.

Open IBM Watson Knowledge Catalog

Follow the instructions to deploy IBM Watson Knowledge Catalog.

Open IBM Watson Knowledge Catalog

  1. Click Open in the top-right corner to launch. Open IBM Watson Knowledge Catalog

  2. Go to the upper-left hamburger (☰) menu and choose Organize > All catalogs. Open catalog menu

  3. From the Your catalogs page, click either Create catalog or New Catalog. Create IBM Watson Knowledge Catalog

  4. Give your catalog a name (TelcoDataCatalog, for example) and optional description, check Enforce data protection rules and click Create. Name and create IBM Watson Knowledge Catalog

  5. Click OK on the pop-up that shows up when you checked the checkbox on the previous screen. Enforce data protection

Option 1: Add data assets

  1. Download the Telco-Customer-Churn.csv file. Under the Browse Assets tab, below Now you can add assets, click here to add your data. Add assets

  2. Alternatively, you can click Add to catalog + in the top right and, for example, choose Local files. Add local files to catalog

  3. Browse to the location where you downloaded the Telco-Customer-Churn.csv file and double-click or click Open. Add an optional description and click Add. Add local files to catalog

NOTE: Stay in the catalog until loading is complete. If you leave the catalog, the incomplete asset will be deleted.

The newly added Telco-Customer-Churn.csv file will show up under the Browse Assets tab of your catalog.

Newly added data in catalog

Option 2: Add connection

  1. You can add a connection to a remote DB — DB2 Warehouse in IBM Cloud, for example — by choosing Add to catalog + > Connection. Add connection to catalog

  2. Choose your remote DB and click it. chose db2 warehouse connection

  3. Enter the connection details and click Test. When it returns a Success message, click Create. Enter Db2 warehouse connection details

The connection now shows up in the catalog.

Db2 warehouse connection shows up

Option 3: Add virtualized data

NOTE: Virtualized data can be added to the default catalog by someone with admin or editor access to that catalog.

  1. Go to the upper-left hamburger (☰) menu and choose Organize > All catalogs, then click Add to Catalog + > Connected asset. Add connected asset

  2. Click Source > Select source. Browse under DV to your schema and choose the table you wish to add and click Select. Select source

A user can now add this to a project like any other asset from a catalog.

Step 2. Add collaborators and control access

  1. Under the Access Control tab, you can click Add Collaborator to give other users access to your catalog. Give users access to the catalog

  2. You can search for a user by clicking on the name to select them, choosing a role for the user – Admin, Editor, or Viewer – and clicking Add. Search for user and add as collaborator

  3. To access data in the catalog, click on the name of the data. Click data name to open

  4. A preview of the data will open, with metadata and the first few rows. Preview of data

  5. You can click the Review tab and rate the data, as well as comment on it, to provide feedback for your teammates. Review data

Step 3. Add categories

The fundamental abstraction in IBM Watson Knowledge Catalog is the category. A category is analogous to a folder.

Add a category for your assets by going to the upper-left hamburger (☰) menu and choosing Organize > Data and AI Governance > Categories.

Add category

You can import them in .csv format (option 1), or you can add categories manually (option 2).

Option 1: Import categories

Download the glossary-organize-categories.csv file. This file contains the categories data that we will be importing.

  1. Click Import. Import categories

  2. Click Add file and navigate to where you downloaded the glossary-organize-categories.csv file, select it, and click Next. Import csv

  3. Under the Select merge option, choose Replace all values and click Import. Import select merge option

You will see “The import completed succesfully” when it is completed. Click Close.

Import complete

In this way, you can import categories, business terms, classifications, policies, etc. to populate your governance catalogs.

Option 2: Add category manually

  1. Click Create category. Organize data categories

  2. Give your category a name, such as Billing, and an optional description, then click Save. New category billing

  3. Now, if you choose Create category again on the Billing category screen, you can create a subcategory, such as Total Charges. Subcategory total charges

  4. For the Billing category, you can select a type, such as Business term. Select business term type

  5. We can also create classifications for assets, similar to Confidential, Personally Identifiable Information, or Sensitive Personal Information in a similar way, by going to the upper-left hamburger (☰) menu, choosing Organize > Data and AI Governance > Classifications. Add classifications

  6. Click the New classification drop-down and select Create new classification. These classifications can then be added to your category as a type. Select classification type

Step 4. Add data classes

When you profile your assets, a data class will be inferred from the contents where possible. You can also add your own data classes.

  1. Add a data class for your assets by going to the upper-left hamburger (☰) menu, choosing Organize > Data and AI Governance > Data class, then clicking New data class > Create new data class. Organize data classes

  2. Give your new data class a name — such as alphanumeric — and an optional primary category and/or description, then click Save as draft. New data class

  3. Once the data class is created, we can add stewards for this class, and also associate classifications and business terms. When you are ready, click Publish. Tools for data class Publish comment data class

Now let’s add that data class to a column in our Telco-Customer-Churn.csv asset.

  1. Go back to the catalog you created (the instructions suggested naming it TelcoDataCatalog) and open it to the column view by clicking on the hamburger (☰) menu, then Organize > All catalogs > TelcoDataCatalog.

  2. Under the Browse assets tab, click on the data set Telco-Customer-Churn.csv to get the column/row preview.

  3. Scroll right to get to the CustomerID column and click the down arrow next to Customer Number, then click View all. Change data class

  4. In the window that opens, search for your newly created data class (alphanumeric), click it when it returns in the search, then click Select. Set column to numerical data class

Step 5. Add business terms

You can use Business terms to standardize definitions of business concepts so that your data is described in a uniform and easily understood way across your enterprise. You already saw how to create a category and make it a business term. You can also create the business term as its own entity.

  1. From the upper-left hamburger (☰) menu, choose Organize Data and AI Governance > Business terms. Organize data business terms

  2. Click on the upper-right New business term drop-down and click the Create new business term button. Create business term

  3. Give the new business term a name, such as Billing, add an optional description, and click Save as draft. name new business term

  4. A window will come up once the term is created. You can see a rich set of options for creating related terms and adding other metadata. Click Publish to make this term available to users of the platform. Publish business term

  5. Add an optional comment and click Publish in the new window. Verify publish business term

  6. Now go back to your catalog (the instructions suggested naming it TelcoDataCatalog) and open it to the column view ((☰) hamburger menu, then Organize > All catalogs > TelcoDataCatalog). Under the Browse assets tab, click on the data set Telco-Customer-Churn.csv to get the column/row preview. Scroll right to get to the TotalCharges column and click the Column information icon (looks like an eye). Choose TotalCharges column information

  7. In the window that opens, click the edit icon (looks like a pencil) next to Business terms. Edit business terms

  8. Enter Billing (that is the name you had provided for the business term) under Business terms and the term will be searched for. Click on the Billing term that is found and click Apply. Edit business terms

  9. Close that window once the term has been applied.

  10. Now, do the same thing to add the Billing business term to the MonthlyCharges column. You will now be able to search for these terms from within the platform. For example, going back to your top-level TelcoDataCatalog, in the search bar with the comment “What assets are you searching for?” enter your unique Billing term. Search using business terms

The Telco-Customer-Churn.csv data set will show up, since it contains columns tagged with the Billing business term.

Step 6. Add rules for policies

We can now create rules to control how a user can access data. Create a business term called CustomerID and assign it to your CustomerID column in the data set using the instructions above. See below if you need details, but try it yourself first, and skip to Adding a rule below if you do not need a reminder.

How to create a business term review

  1. From the upper-left hamburger (☰) menu, choose Organize > Data and AI Governance > Business terms.
  2. Click on the upper-right New business term drop-down and click the Create new business term button.
  3. Give the new Business term the name CustomerID, add an optional description, then click Save as draft. In the next window, click Publish. Provide an optional comment in the pop-up and click Publish.
  4. Now go back to your TelcoDataCatalog and open it to the column view from the hamburger (☰), Organize > All catalogs and choose TelcoDataCatalog. Under the Browse assets tab, click on the data set Telco-Customer-Churn.csv to get the column/row preview. Scroll right to get to the CustomerID column and click the Column information icon (looks like an eye).
  5. In the window that opens, click the edit icon (looks like a pencil) next to Business terms.
  6. Enter CustomerID under Business terms, and the term will be searched for. Click on the CustomerID term that is found and click Apply.

Adding a rule

  1. From the upper-left hamburger (☰) menu, choose Organize > Data and AI Governance > Rules. Select rule

  2. Click on the New rule drop-down and select Create new rule. Create rule

  3. Choose Data protection rule as the type of rule to create. Data protection

  4. Under Details, give your rule a name, type, access, and a business definition.

  5. Under Rule builder Condition1, fill out if business term contains any CustomerID and action, then mask data in columns containing alphanumeric. Choose the tile for Substitute, which will make a non-identifiable hash. This obscures the actual CustomerID, but allows actions like database joins to still work. Click Create. Define rule for masking CustomerID

Now if we go back to our Telco-Customer-Churn.csv asset in the catalog in the CustomerID column, it will look the same as before, but a non-admin user will see the lock icon and see that the CustomerID has now been substituted with a hash value. CustomerID is now masked

To add a rule to obfuscate data, go to the Profile tab and scroll to the TotalCharges column. You can see that the data has been inferred to be classified as a Quantity. TotalCharges classified as Quantity

Here is where you could change the classification if the inferred one was not what you wanted.

  1. You can build a rule to obfuscate this TotalCharges column. TotalCharges obfuscate rule

  2. And now that column will have data that is replaced with similarly formatted data. TotalCharges column obfuscated

Summary

In this tutorial, you have learned a few of the powerful tools available for working with data on the IBM Cloud Pak for Data platform. With IBM Watson Knowledge Catalog, team members can work together in their individual roles to bring data and AI to the enterprise.

This tutorial is part of the Getting started with IBM Cloud Pak for Data learning path. To continue the series and learn more about IBM Cloud Pak for Data, you can either take a look at the next pattern, Data analysis, model building, and deploying with Watson Machine Learning with notebook, look at the next tutorial titled Automate model building with AutoAI or Build a predictive machine learning model quickly and easily with IBM SPSS Modeler.