Data breaches are not the way you want your company to make headlines, and even one data breach can mean an average cost of 3.9 million USD. With data becoming more of a competitive advantage, and with the amount of data that is produced worldwide projected to double between 2010 and 2023, the process of organizing, managing, and securing data is more important than ever. It’s no surprise that the tools to take that data, organize it, govern it, and ensure quality and searchability are what you need.
As companies begin the process of using their data with AI, many already have the first step of collecting data complete. They store data in a database, and they use that data to inform customers of their previous transactions, conversations, and other information. The second step of organizing the data focuses on creating a foundation based on analytics. More specifically, it’s about enabling your data scientists and business analysts to do their job efficiently.
This step is where a secure metadata management platform like the IBM Watson Knowledge Catalog on the IBM Cloud Pak® for Data platform comes in. At its core, IBM Watson® Knowledge Catalog connects data and knowledge with the people who need to use it. Some typical use cases for Watson Knowledge Catalog include ensuring regulatory compliance, data quality management, and data delivery. In this tutorial, I focus on privacy and protection, with the goal of minimizing the chance for a data breach.
This hands-on tutorial focuses on showing you how to solve the problems of enterprise data governance on the IBM Cloud Pak for Data platform from the data steward or data administrator persona. I explain how to use governance, data quality, and active policy management to help your organization protect and govern sensitive data, trace data lineage, and manage data lakes. This knowledge can help you quickly discover, curate, categorize, and share data assets, data sets, and analytical models with other members of your organization. It serves as a single source of truth for data engineers, data stewards, data scientists, and business analysts to gain self-service access to data they can trust.
You need the admin role to create a catalog, and you begin the tutorial by creating a catalog and loading data.
Set up catalog and data
Note: The default catalog is your enterprise catalog. It is created automatically after you install the Watson Knowledge Catalog service and is the only catalog to which advanced data curation tools apply. The default catalog is governed so that data protection rules are enforced. The information assets view shows additional properties of the assets in the default catalog to aid curation. Any subsequent catalogs that you create can be governed or ungoverned, do not have an information assets view, and supply basic data curation tools.
Create the catalog
Provision Watson Knowledge Catalog
If you haven’t started Watson Knowledge Catalog yet, you’ll need to provision it.
Open Watson Knowledge Catalog by clicking Services at the upper right of the home page.
Click Watson Knowledge Catalog in the Data governance section.
Open Watson Knowledge Catalog
Click Open to launch Watson Knowledge Catalog.
Choose Organize, then All catalogs from the menu on the left.
Click either Create catalog or New Catalog from the Your catalogs page.
Name your catalog and give it an optional description. Check Enforce data protection rules and click Create.
Click OK on the pop-up menu that opens, then click Create.
Option 1: Add data assets
- From the Browse Assets tab, then click here to add your data.
You could also click Add to Catalog + in the upper-right, and choose Local files.
- Clone the following repository: https://github.com/horeaporutiu/wkc-tutorial-intelligent-loan, browse to where you cloned the repository and go to /data/split/applicant_personal_data.csv. Click Open, add an optional description, and click Add.
Note: Stay in the catalog until loading is complete. If you leave the catalog, the incomplete asset is deleted.
The newly added applicant_personal_data.csv file appears under the Browse Assets tab of your catalog.
Option 2: Add connection
You can add a connection to any remote DB, such as Db2® Warehouse on IBM Cloud®, Netezza® Performance Server, or Mongo DB by choosing Add to Catalog + > Connection.
Connect to Netezza Performance Server
- Before you create a connection to Netezza Performance Server, you should load the applicant data into the Netezza® Performance Server using the
nzload
CLI. To install thenzload
CLI, follow the instructions.
Log in to your Netezza Performance Server console and run the create table SQL.
Then you can use the nzload
CLI command to load the CSV data to your Netezza Performance Server database:
bash
nzload -u <user> -pw <password> -host <host> -db <database> -t <table name> -delim ',' -df <csv file name>
If the nzload
CLI is not supported — for example, in macOS X — you can load the insert statements from applicant_personal_data.sql to the IBM Netezza console in a SQL editor and run them. This might take a little longer than the nzload
command.
Select global, then choose Netezza and CLI.
Enter the connection details and click Create.
The connection now shows up in the catalog.
From the Add to Catalog drop-down, select Connected asset and click select source. Choose the Netezza connection you created earlier and the table you want to use.
The data asset you want to use in the Watson Knowledge Catalog now shows up in the list of assets.
Connect to IBM Db2 on Cloud
Click your remote database.
Enter the connection details and click Create.
The connection now shows up in the catalog.
Option 3: Add virtualized data
Virtualized data can be added to the default catalog by someone with admin or editor access to that catalog.
Choose Organize, then All catalogs from the menu on the left. Click Add to Catalog +, then Connected asset.
Click Source, then Select source. Browse under DV to your schema (for example, UserXYZW), choose the joined table, and click Select.
A user can now add this to a project like any other asset from a catalog.
Add collaborators and control access
Click Add Collaborator under the Access Control to give other users access to your catalog.
Assign a role (admin, editor, or viewer) to a user by searching for the user, clicking the name, choosing a role, and clicking Add.
Access data in the catalog by clicking the name of the data.
A preview of the data opens, with metadata and the first few rows.
You can click the Review tab and rate the data, as well as comment on it to provide feedback for your team.
Add categories
The fundamental abstraction in Watson Knowledge Catalog is the category, which is like a folder. You can add a category for your assets by choosing Organize > Data and AI Governance > Categories from the menu on the left.
You can then import the categories in a .csv format (option 1), or you can add categories manually (option 2).
Option 1: Import categories
Click Import.
Click Add file, navigate to where you cloned this repository, then choose data/wkc/glossary-organize-categories.csv.
Under Select merge option, choose Replace all values and click Import.
You see “The import completed successfully” message when it is complete.
Using this option, you can import categories, business terms, classifications, and policies to populate your governance catalogs.
Option 2: Add category manually
Click Create category.
Name your category, such as personal data, give it an optional description, then click Save.
If you click Create category again on the personal data category screen, you can create a subcategory, such as residence information.
For the residence information subcategory you can select a type, such as business term to filter for all business terms associated with this subcategory. Currently, it is blank.
You also can create classifications for assets — similar to Confidential, Personally Identifiable Information, or Sensitive Personal Information — in a similar way by choosing Organize > Data and AI Governance > Classifications from the left menu.
Click New classification from the drop-down menu, then select Create new classification. These classifications can then be added to your category as a type.
Add data classes
When you profile your assets, a data class is inferred from the contents where possible. You’ll see more on this later. You can also add your own data classes.
Add a data class for your assets by choosing Organize > Data and AI Governance > Data classes > New data class > Create new data class from the left menu.
Give your new data class a name — for example, alphanumeric — and an optional primary category or description, then click Save as draft.
After the data class is created, you can add stewards for this class, and associate classifications and business terms if they are available (you might not have any business terms yet). When you are ready, click Publish.
Now let’s add that data class to a column in your applicant_personal_data.csv asset.
Go back to your CreditDataCatalog catalog and open it. Select Organize > All catalogs > CreditDataCatalog. Under the Browse assets tab, click applicant_personal_data.csv to get the column/row preview. Scroll to the right to get to the customer ID column, then click the down arrow next to customer number and select View all.
Search for your newly created data class called alphanumeric in the window that opens, click it when it returns in the search, then click Select.
Add business terms
You can use business terms to standardize definitions of business concepts so that your data is described in a uniform and easily understood way across your enterprise. You already saw how to create a category and make it a business term. You also can create the business term as its own entity.
Choose Organize > Data and AI Governance > Business terms from the left menu.
Click New business term drop-down menu at the upper-right and choose Create new business term.
Give the new business term a name such as contact Information and an optional description, then click Save as draft.
A window opens after the term is created. You can see a rich set of options for creating related terms and adding other metadata. For now, click Publish to make this term available to users of the platform.
Add an optional comment and click Publish in the new window.
Go back to your CreditDataCatalog catalog, open it to the column view, then select Organize > All catalogs > CreditDataCatalog.
Under the Browse assets tab, click applicant_personal_data.csv to get the column/row preview. Scroll right to get to the email column and click column information (it looks like an eye).
In the window that opens, click the edit icon (it looks like a pencil) next to business terms.
Enter Contact Information under Business terms, and the term is searched for. Click the Contact Information term that is found, then click Apply.
Close that window after the term has been applied and repeat the same steps to add the contact information business term to the telephone column.
You are now able to search for these terms from within the platform — for example, by going back to the top-level CreditDataCatalog in the search bar with the comment “What assets are you searching for?” and entering your unique contact information term.
The applicant_personal_data.csv data set shows up because it contains columns that are tagged with the contact information business term.
Add rules for policies
You can now create rules to control how a user can access the data. Create a business term called CustomerID and assign it to your customer ID column in the data set using the instructions given previously. See the following if you need details, but you might like to try it yourself first. (Skip to the “Adding a rule” section if you do not need a reminder.)
Create a business term review
Choose Organize > Data and AI Governance > Business terms from the left menu.
Click + Create Business term.
Give the new business term the name CustomerID and add an optional description, then click Save as draft. In the next window, click Publish.
Go back to your credit data catalog and open it to the column view. Select Organize > All catalogs > CreditDataCatalog.
Under the Browse assets tab, click the applicant_personal_data.csv data set to get the column/row preview. Scroll right to get to the customer ID column, and click the column information icon (it looks like an eye).
In the window that opens, click the edit icon (it looks like a pencil) next to Business terms.
Enter
CustomerID
under Business terms, and the term is searched for. Click the CustomerID term that is found and click Apply.
Adding a rule
Choose Organize > Data and AI Governance > Rules from the left menu.
Click New rule > Create new rule.
Choose Data protection rule for the new rule, then click Select the type of rule to create.
Under Details, give your rule a name, type = access, and a business definition.
Under Rule builder Condition1, fill out if Business term Contains any CustomerID and action, then mask data in columns containing alphanumeric. Choose the tile for Substitute, which makes a non-identifiable hash. This obscures the actual customer ID, but allows actions like database joins to still work. Click Create.
If you go back to your applicant_personal_data.csv asset in the catalog at the CustomerID column, it looks the same as before. However, a non-admin user sees the lock icon and that the customer ID has now been substituted with a hash value.
Back in the CreditDataCatalog, under the applicant_personal_data.csv asset, go to the Overview tab and scroll to the age column. Click the down arrow, and you can see that the data has been inferred to be classified as a code.
Change the classifier by clicking View all, and start typing Age. When this appears in the search, click Use, then click Close.
You can build a rule to obfuscate the age column.
Now that column has data that is replaced with similarly formatted data.
Finally, click lineage to see the history of changes you’ve made to the data asset.
You can see the events associated with the applicant_personal_data.csv data asset — for example, when you updated the age column with the age data class.
Conclusion
In this tutorial, you have explored tools to organize and govern data to ensure quality and searchability. Along the way, you’ve learned how to:
- Set up the catalog and data
- Add collaborators and control access
- Add categories
- Add data classes
- Add business terms
- Add rules for policies
This tutorial is part of the Modernizing your bank loan department series, a solution that shows how to automate and enhance loan transaction processes with AI. While this tutorial focuses on data preprocessing and governance, the other tutorials and code patterns in the solution focus on creating a machine learning model, explaining the model outcomes, and retraining the model to increase accuracy.