Implement data governance to manage and secure clients’ data

Data breaches are not the way you want your company to make headlines, and even one data breach can mean an average cost of 3.9 million USD. With data becoming more of a competitive advantage, and with the amount of data that is produced worldwide projected to double between 2010 and 2023, the process of organizing, managing, and securing data is more important than ever. It’s no surprise that the tools to take that data, organize it, govern it, and ensure quality and searchability are what you need.

As companies begin the process of using their data with AI, many already have the first step of collecting data complete. They store data in a database, and they use that data to inform customers of their previous transactions, conversations, and other information. The second step of organizing the data focuses on creating a foundation based on analytics. More specifically, it’s about enabling your data scientists and business analysts to do their job efficiently.

This step is where a secure metadata management platform like the IBM Watson Knowledge Catalog on the IBM Cloud Pak® for Data platform comes in. At its core, IBM Watson® Knowledge Catalog connects data and knowledge with the people who need to use it. Some typical use cases for Watson Knowledge Catalog include ensuring regulatory compliance, data quality management, and data delivery. In this tutorial, I focus on privacy and protection, with the goal of minimizing the chance for a data breach.

This hands-on tutorial focuses on showing you how to solve the problems of enterprise data governance on the IBM Cloud Pak for Data platform from the data steward or data administrator persona. I explain how to use governance, data quality, and active policy management to help your organization protect and govern sensitive data, trace data lineage, and manage data lakes. This knowledge can help you quickly discover, curate, categorize, and share data assets, data sets, and analytical models with other members of your organization. It serves as a single source of truth for data engineers, data stewards, data scientists, and business analysts to gain self-service access to data they can trust.

IBM Cloud Pak for Data login

You need the admin role to create a catalog, and you begin the tutorial by creating a catalog and loading data.

Set up catalog and data

Note: The default catalog is your enterprise catalog. It is created automatically after you install the Watson Knowledge Catalog service and is the only catalog to which advanced data curation tools apply. The default catalog is governed so that data protection rules are enforced. The information assets view shows additional properties of the assets in the default catalog to aid curation. Any subsequent catalogs that you create can be governed or ungoverned, do not have an information assets view, and supply basic data curation tools.

Create the catalog

Provision Watson Knowledge Catalog

If you haven’t started Watson Knowledge Catalog yet, you’ll need to provision it.

  1. Open Watson Knowledge Catalog by clicking Services at the upper right of the home page. Click services icon

  2. Click Watson Knowledge Catalog in the Data governance section. Watson Knowledge Catalog

Open Watson Knowledge Catalog

  1. Click Open to launch Watson Knowledge Catalog. Open Watson Knowledge Catalog

  2. Choose Organize, then All catalogs from the menu on the left. Open catalog menu

  3. Click either Create catalog or New Catalog from the Your catalogs page. Create WKC catalog

  4. Name your catalog and give it an optional description. Check Enforce data protection rules and click Create. Name and create wkc catalog

  5. Click OK on the pop-up menu that opens, then click Create. Enforce data protection

Option 1: Add data assets

  1. From the Browse Assets tab, then click here to add your data. Add assets

You could also click Add to Catalog + in the upper-right, and choose Local files.

Add local files to catalog

  1. Clone the following repository: https://github.com/horeaporutiu/wkc-tutorial-intelligent-loan, browse to where you cloned the repository and go to /data/split/applicant_personal_data.csv. Click Open, add an optional description, and click Add. Clone repo

Note: Stay in the catalog until loading is complete. If you leave the catalog, the incomplete asset is deleted.

The newly added applicant_personal_data.csv file appears under the Browse Assets tab of your catalog.

Newly added data in catalog

Option 2: Add connection

You can add a connection to any remote DB, such as Db2® Warehouse on IBM Cloud®, Netezza® Performance Server, or Mongo DB by choosing Add to Catalog + > Connection.

Add connection to catalog

Connect to Netezza Performance Server

  1. Before you create a connection to Netezza Performance Server, you should load the applicant data into the Netezza® Performance Server using the nzload CLI. To install the nzload CLI, follow the instructions.

Log in to your Netezza Performance Server console and run the create table SQL.

Create personal data

Then you can use the nzload CLI command to load the CSV data to your Netezza Performance Server database:

bash

nzload -u <user> -pw <password> -host <host> -db <database> -t <table name> -delim ',' -df <csv file name>

If the nzload CLI is not supported — for example, in macOS X — you can load the insert statements from applicant_personal_data.sql to the IBM Netezza console in a SQL editor and run them. This might take a little longer than the nzload command.

  1. Select global, then choose Netezza and CLI. Choose connection

  2. Enter the connection details and click Create. Enter Netezza connection details

  3. The connection now shows up in the catalog. Netezza connection shows up

  4. From the Add to Catalog drop-down, select Connected asset and click select source. Choose the Netezza connection you created earlier and the table you want to use. Admin-add-connected-asset-nps Admin-add-connected-asset-nps

The data asset you want to use in the Watson Knowledge Catalog now shows up in the list of assets.

Add-asset-shows-up

Connect to IBM Db2 on Cloud

  1. Click your remote database. Choose Db2 warehouse connection

  2. Enter the connection details and click Create. Db2 warehouse connection details

The connection now shows up in the catalog.

Db2 warehouse connection shows up

Option 3: Add virtualized data

Virtualized data can be added to the default catalog by someone with admin or editor access to that catalog.

  1. Choose Organize, then All catalogs from the menu on the left. Click Add to Catalog +, then Connected asset. Add connected asset

  2. Click Source, then Select source. Browse under DV to your schema (for example, UserXYZW), choose the joined table, and click Select. Select source

A user can now add this to a project like any other asset from a catalog.

Add collaborators and control access

  1. Click Add Collaborator under the Access Control to give other users access to your catalog. Give users access to the catalog

  2. Assign a role (admin, editor, or viewer) to a user by searching for the user, clicking the name, choosing a role, and clicking Add. Search for user and add as collaborator

  3. Access data in the catalog by clicking the name of the data. Access catalog data

A preview of the data opens, with metadata and the first few rows.

Data preview

You can click the Review tab and rate the data, as well as comment on it to provide feedback for your team.

Review data

Add categories

The fundamental abstraction in Watson Knowledge Catalog is the category, which is like a folder. You can add a category for your assets by choosing Organize > Data and AI Governance > Categories from the menu on the left.

Add category

You can then import the categories in a .csv format (option 1), or you can add categories manually (option 2).

Option 1: Import categories

  1. Click Import. Import categories

  2. Click Add file, navigate to where you cloned this repository, then choose data/wkc/glossary-organize-categories.csv. Import CSV file

  3. Under Select merge option, choose Replace all values and click Import. Import select merge option

You see “The import completed successfully” message when it is complete.

Import complete

Using this option, you can import categories, business terms, classifications, and policies to populate your governance catalogs.

Option 2: Add category manually

  1. Click Create category. Organize data categories

  2. Name your category, such as personal data, give it an optional description, then click Save. New category billing

  3. If you click Create category again on the personal data category screen, you can create a subcategory, such as residence information. Subcategories

  4. For the residence information subcategory you can select a type, such as business term to filter for all business terms associated with this subcategory. Currently, it is blank. Business term type

  5. You also can create classifications for assets — similar to Confidential, Personally Identifiable Information, or Sensitive Personal Information — in a similar way by choosing Organize > Data and AI Governance > Classifications from the left menu. New classification type

  6. Click New classification from the drop-down menu, then select Create new classification. These classifications can then be added to your category as a type. Create classification type

Add data classes

When you profile your assets, a data class is inferred from the contents where possible. You’ll see more on this later. You can also add your own data classes.

  1. Add a data class for your assets by choosing Organize > Data and AI Governance > Data classes > New data class > Create new data class from the left menu. Organize data classes

  2. Give your new data class a name — for example, alphanumeric — and an optional primary category or description, then click Save as draft. New data class

After the data class is created, you can add stewards for this class, and associate classifications and business terms if they are available (you might not have any business terms yet). When you are ready, click Publish.

Tools for data class

Publish comment data class

Now let’s add that data class to a column in your applicant_personal_data.csv asset.

  1. Go back to your CreditDataCatalog catalog and open it. Select Organize > All catalogs > CreditDataCatalog. Under the Browse assets tab, click applicant_personal_data.csv to get the column/row preview. Scroll to the right to get to the customer ID column, then click the down arrow next to customer number and select View all. Change data class

  2. Search for your newly created data class called alphanumeric in the window that opens, click it when it returns in the search, then click Select. Select data class

Add business terms

You can use business terms to standardize definitions of business concepts so that your data is described in a uniform and easily understood way across your enterprise. You already saw how to create a category and make it a business term. You also can create the business term as its own entity.

  1. Choose Organize > Data and AI Governance > Business terms from the left menu. Organize business terms

  2. Click New business term drop-down menu at the upper-right and choose Create new business term. Create business term

  3. Give the new business term a name such as contact Information and an optional description, then click Save as draft. Name new business term

  4. A window opens after the term is created. You can see a rich set of options for creating related terms and adding other metadata. For now, click Publish to make this term available to users of the platform. Publish business term

  5. Add an optional comment and click Publish in the new window. Verify publish business term

  6. Go back to your CreditDataCatalog catalog, open it to the column view, then select Organize > All catalogs > CreditDataCatalog.

  7. Under the Browse assets tab, click applicant_personal_data.csv to get the column/row preview. Scroll right to get to the email column and click column information (it looks like an eye). Choose TotalCharges column information

  8. In the window that opens, click the edit icon (it looks like a pencil) next to business terms. Edit business terms

  9. Enter Contact Information under Business terms, and the term is searched for. Click the Contact Information term that is found, then click Apply. Enter contact information

Close that window after the term has been applied and repeat the same steps to add the contact information business term to the telephone column.

You are now able to search for these terms from within the platform — for example, by going back to the top-level CreditDataCatalog in the search bar with the comment “What assets are you searching for?” and entering your unique contact information term.

Search using business terms

The applicant_personal_data.csv data set shows up because it contains columns that are tagged with the contact information business term.

Add rules for policies

You can now create rules to control how a user can access the data. Create a business term called CustomerID and assign it to your customer ID column in the data set using the instructions given previously. See the following if you need details, but you might like to try it yourself first. (Skip to the “Adding a rule” section if you do not need a reminder.)

Create a business term review

  1. Choose Organize > Data and AI Governance > Business terms from the left menu.

  2. Click + Create Business term.

  3. Give the new business term the name CustomerID and add an optional description, then click Save as draft. In the next window, click Publish.

  4. Go back to your credit data catalog and open it to the column view. Select Organize > All catalogs > CreditDataCatalog.

  5. Under the Browse assets tab, click the applicant_personal_data.csv data set to get the column/row preview. Scroll right to get to the customer ID column, and click the column information icon (it looks like an eye).

  6. In the window that opens, click the edit icon (it looks like a pencil) next to Business terms.

  7. Enter CustomerID under Business terms, and the term is searched for. Click the CustomerID term that is found and click Apply.

Adding a rule

  1. Choose Organize > Data and AI Governance > Rules from the left menu. Select rule

  2. Click New rule > Create new rule. Create rule

  3. Choose Data protection rule for the new rule, then click Select the type of rule to create. Data protection

  4. Under Details, give your rule a name, type = access, and a business definition.

  5. Under Rule builder Condition1, fill out if Business term Contains any CustomerID and action, then mask data in columns containing alphanumeric. Choose the tile for Substitute, which makes a non-identifiable hash. This obscures the actual customer ID, but allows actions like database joins to still work. Click Create. Define rule for masking customer ID

  6. If you go back to your applicant_personal_data.csv asset in the catalog at the CustomerID column, it looks the same as before. However, a non-admin user sees the lock icon and that the customer ID has now been substituted with a hash value. Masked customer ID

  7. Back in the CreditDataCatalog, under the applicant_personal_data.csv asset, go to the Overview tab and scroll to the age column. Click the down arrow, and you can see that the data has been inferred to be classified as a code. TotalCharges classified as quantity

  8. Change the classifier by clicking View all, and start typing Age. When this appears in the search, click Use, then click Close. Change classifier

  9. You can build a rule to obfuscate the age column. Obfuscate rule

Now that column has data that is replaced with similarly formatted data. Column with obfuscated data shown

Finally, click lineage to see the history of changes you’ve made to the data asset. Lineage

You can see the events associated with the applicant_personal_data.csv data asset — for example, when you updated the age column with the age data class.

Details

Conclusion

In this tutorial, you have explored tools to organize and govern data to ensure quality and searchability. Along the way, you’ve learned how to:

  • Set up the catalog and data
  • Add collaborators and control access
  • Add categories
  • Add data classes
  • Add business terms
  • Add rules for policies

This tutorial is part of the Modernizing your bank loan department series, a solution that shows how to automate and enhance loan transaction processes with AI. While this tutorial focuses on data preprocessing and governance, the other tutorials and code patterns in the solution focus on creating a machine learning model, explaining the model outcomes, and retraining the model to increase accuracy.