This tutorial is part of the Getting started with IBM Cloud Pak for Data learning path.
|100||Introduction to IBM Cloud Pak for Data||Article|
|101||Data Virtualization on IBM Cloud Pak for Data||Tutorial|
|201||Data visualization with data refinery||Tutorial|
|202||Find, prepare, and understand data with Watson Knowledge Catalog||Tutorial|
|301A||Data analysis, model building, and deploying with Watson Machine Learning with notebook||Pattern|
|301B||Automate model building with AutoAI||Tutorial|
|301C||Build a predictive machine learning model quickly and easily with IBM SPSS Modeler||Tutorial|
|401||Monitor the model with Watson OpenScale||Pattern|
This tutorial demonstrates how to solve the problems of enterprise data governance using the IBM Watson® Knowledge Catalog on the IBM Cloud Pak® for Data platform. We’ll explain how to use governance, data quality, and active policy management to help you protect and govern sensitive data, trace data lineage, and manage data lakes. This knowledge can help you quickly discover, curate, categorize, and share data assets, data sets, analytical models, and their relationships with others in your organization.
In this tutorial, you will learn how to:
- Set up the catalog and data
- Add collaborators and control access
- Add categories
- Add data classes
- Add business terms
- Add rules for policies
Completing this tutorial should take about 30-45 minutes.
NOTE: The default catalog is your enterprise catalog. It is created automatically after you install the Watson Knowledge Catalog service and is the only catalog to which advanced data curation tools apply. The default catalog is governed so that data protection rules are enforced. The information assets view shows additional properties of the assets in the default catalog to aid curation. Any subsequent catalogs that you create can be governed or ungoverned, do not have an information assets view, and supply basic data curation tools.
Step 1. Set up catalog and data
Create the catalog
If you haven’t yet started IBM Watson Knowledge Catalog, you’ll need to provision it. Open IBM Watson Knowledge Catalog by clicking the Services icon at the top right of the home page.
Under the Data Governance section, click the Watson Knowledge Catalog tile.
Follow the instructions to deploy IBM Watson Knowledge Catalog.
Open IBM Watson Knowledge Catalog
Click Open in the top-right corner to launch.
Go to the upper-left hamburger (☰) menu and choose Organize > All catalogs.
From the Your catalogs page, click either Create catalog or New Catalog.
Give your catalog a name (
TelcoDataCatalog, for example) and optional description, check Enforce data protection rules and click Create.
Click OK on the pop-up that shows up when you checked the checkbox on the previous screen.
Option 1: Add data assets
Download the Telco-Customer-Churn.csv file. Under the Browse Assets tab, below Now you can add assets, click here to add your data.
Alternatively, you can click Add to catalog + in the top right and, for example, choose Local files.
Browse to the location where you downloaded the Telco-Customer-Churn.csv file and double-click or click Open. Add an optional description and click Add.
NOTE: Stay in the catalog until loading is complete. If you leave the catalog, the incomplete asset will be deleted.
The newly added Telco-Customer-Churn.csv file will show up under the Browse Assets tab of your catalog.
Option 2: Add connection
You can add a connection to a remote DB — DB2 Warehouse in IBM Cloud, for example — by choosing Add to catalog + > Connection.
Choose your remote DB and click it.
Enter the connection details and click Test. When it returns a Success message, click Create.
The connection now shows up in the catalog.
Option 3: Add virtualized data
NOTE: Virtualized data can be added to the default catalog by someone with admin or editor access to that catalog.
Go to the upper-left hamburger (☰) menu and choose Organize > All catalogs, then click Add to Catalog + > Connected asset.
Click Source > Select source. Browse under DV to your schema and choose the table you wish to add and click Select.
A user can now add this to a project like any other asset from a catalog.
Step 2. Add collaborators and control access
Under the Access Control tab, you can click Add Collaborator to give other users access to your catalog.
You can search for a user by clicking on the name to select them, choosing a role for the user – Admin, Editor, or Viewer – and clicking Add.
To access data in the catalog, click on the name of the data.
A preview of the data will open, with metadata and the first few rows.
You can click the Review tab and rate the data, as well as comment on it, to provide feedback for your teammates.
Step 3. Add categories
The fundamental abstraction in IBM Watson Knowledge Catalog is the category. A category is analogous to a folder.
Add a category for your assets by going to the upper-left hamburger (☰) menu and choosing Organize > Data and AI Governance > Categories.
You can import them in .csv format (option 1), or you can add categories manually (option 2).
Option 1: Import categories
Download the glossary-organize-categories.csv file. This file contains the categories data that we will be importing.
Click Add file and navigate to where you downloaded the glossary-organize-categories.csv file, select it, and click Next.
Under the Select merge option, choose Replace all values and click Import.
You will see “The import completed succesfully” when it is completed. Click Close.
In this way, you can import categories, business terms, classifications, policies, etc. to populate your governance catalogs.
Option 2: Add category manually
Click Create category.
Give your category a name, such as
Billing, and an optional description, then click Save.
Now, if you choose Create category again on the Billing category screen, you can create a subcategory, such as
For the Billing category, you can select a type, such as
We can also create classifications for assets, similar to Confidential, Personally Identifiable Information, or Sensitive Personal Information in a similar way, by going to the upper-left hamburger (☰) menu, choosing Organize > Data and AI Governance > Classifications.
Click the New classification drop-down and select Create new classification. These classifications can then be added to your category as a type.
Step 4. Add data classes
When you profile your assets, a data class will be inferred from the contents where possible. You can also add your own data classes.
Add a data class for your assets by going to the upper-left hamburger (☰) menu, choosing Organize > Data and AI Governance > Data class, then clicking New data class > Create new data class.
Give your new data class a name — such as
alphanumeric— and an optional primary category and/or description, then click Save as draft.
Once the data class is created, we can add stewards for this class, and also associate classifications and business terms. When you are ready, click Publish.
Now let’s add that data class to a column in our Telco-Customer-Churn.csv asset.
Go back to the catalog you created (the instructions suggested naming it
TelcoDataCatalog) and open it to the column view by clicking on the hamburger (☰) menu, then Organize > All catalogs > TelcoDataCatalog.
Under the Browse assets tab, click on the data set Telco-Customer-Churn.csv to get the column/row preview.
Scroll right to get to the CustomerID column and click the down arrow next to Customer Number, then click View all.
In the window that opens, search for your newly created data class (alphanumeric), click it when it returns in the search, then click Select.
Step 5. Add business terms
You can use Business terms to standardize definitions of business concepts so that your data is described in a uniform and easily understood way across your enterprise. You already saw how to create a category and make it a business term. You can also create the business term as its own entity.
From the upper-left hamburger (☰) menu, choose Organize Data and AI Governance > Business terms.
Click on the upper-right New business term drop-down and click the Create new business term button.
Give the new business term a name, such as
Billing, add an optional description, and click Save as draft.
A window will come up once the term is created. You can see a rich set of options for creating related terms and adding other metadata. Click Publish to make this term available to users of the platform.
Add an optional comment and click Publish in the new window.
Now go back to your catalog (the instructions suggested naming it
TelcoDataCatalog) and open it to the column view ((☰) hamburger menu, then Organize > All catalogs > TelcoDataCatalog). Under the Browse assets tab, click on the data set Telco-Customer-Churn.csv to get the column/row preview. Scroll right to get to the TotalCharges column and click the Column information icon (looks like an eye).
In the window that opens, click the edit icon (looks like a pencil) next to Business terms.
Billing(that is the name you had provided for the business term) under Business terms and the term will be searched for. Click on the Billing term that is found and click Apply.
Close that window once the term has been applied.
Now, do the same thing to add the
Billingbusiness term to the MonthlyCharges column. You will now be able to search for these terms from within the platform. For example, going back to your top-level TelcoDataCatalog, in the search bar with the comment “What assets are you searching for?” enter your unique Billing term.
The Telco-Customer-Churn.csv data set will show up, since it contains columns tagged with the Billing business term.
Step 6. Add rules for policies
We can now create rules to control how a user can access data. Create a business term called
CustomerID and assign it to your CustomerID column in the data set using the instructions above. See below if you need details, but try it yourself first, and skip to Adding a rule below if you do not need a reminder.
How to create a business term review
- From the upper-left hamburger (☰) menu, choose Organize > Data and AI Governance > Business terms.
- Click on the upper-right New business term drop-down and click the Create new business term button.
- Give the new Business term the name
CustomerID, add an optional description, then click Save as draft. In the next window, click Publish. Provide an optional comment in the pop-up and click Publish.
- Now go back to your TelcoDataCatalog and open it to the column view from the hamburger (☰), Organize > All catalogs and choose TelcoDataCatalog. Under the Browse assets tab, click on the data set Telco-Customer-Churn.csv to get the column/row preview. Scroll right to get to the CustomerID column and click the Column information icon (looks like an eye).
- In the window that opens, click the edit icon (looks like a pencil) next to Business terms.
CustomerIDunder Business terms, and the term will be searched for. Click on the CustomerID term that is found and click Apply.
Adding a rule
From the upper-left hamburger (☰) menu, choose Organize > Data and AI Governance > Rules.
Click on the New rule drop-down and select Create new rule.
Choose Data protection rule as the type of rule to create.
Under Details, give your rule a name, type, access, and a business definition.
Under Rule builder Condition1, fill out if business term contains any CustomerID and action, then mask data in columns containing
alphanumeric. Choose the tile for Substitute, which will make a non-identifiable hash. This obscures the actual CustomerID, but allows actions like database joins to still work. Click Create.
Now if we go back to our Telco-Customer-Churn.csv asset in the catalog in the CustomerID column, it will look the same as before, but a non-admin user will see the lock icon and see that the CustomerID has now been substituted with a hash value.
To add a rule to obfuscate data, go to the Profile tab and scroll to the TotalCharges column. You can see that the data has been inferred to be classified as a Quantity.
Here is where you could change the classification if the inferred one was not what you wanted.
You can build a rule to obfuscate this TotalCharges column.
And now that column will have data that is replaced with similarly formatted data.
In this tutorial, you have learned a few of the powerful tools available for working with data on the IBM Cloud Pak for Data platform. With IBM Watson Knowledge Catalog, team members can work together in their individual roles to bring data and AI to the enterprise.
This tutorial is part of the Getting started with IBM Cloud Pak for Data learning path. To continue the series and learn more about IBM Cloud Pak for Data, you can either take a look at the next pattern, Data analysis, model building, and deploying with Watson Machine Learning with notebook, look at the next tutorial titled Automate model building with AutoAI or Build a predictive machine learning model quickly and easily with IBM SPSS Modeler.