Accelerate the value of multicloud with collaborative DevSecOps Learn more

Implement data governance to manage and secure clients’ data

Data breaches are not the way you want your company to make headlines, and even one data breach can mean an average cost of 3.9 million USD. With data becoming more of a competitive advantage, and with the amount of data that is produced worldwide projected to double from 2020 – 2023, the process of organizing, managing, and securing data is more important than ever. It’s no surprise that the tools to take that data, organize it, govern it, and ensure quality and searchability are what you need.

As companies begin the process of using their data with artificial intelligence (AI), many already have the first step of collecting data complete. They store data in a database, and they use that data to inform customers of their previous transactions, conversations, and other information. The second step of organizing the data focuses on creating a foundation based on analytics. More specifically, it’s about enabling your data scientists and business analysts to do their job efficiently.

This step is where a secure metadata management platform like the IBM® Watson™ Knowledge Catalog on the IBM Cloud Pak® for Data platform comes in. At its core, Watson Knowledge Catalog connects data and knowledge with the people who need to use it. Some typical use cases for Watson Knowledge Catalog include ensuring regulatory compliance, data quality management, and data delivery. In this tutorial, I focus on privacy and protection, with the goal of minimizing the chance for a data breach.

This hands-on tutorial focuses on showing you how to solve the problems of enterprise data governance on the IBM Cloud Pak for Data platform from the data steward or data administrator persona. I explain how to use governance, data quality, and active policy management to help your organization protect and govern sensitive data, trace data lineage, and manage data lakes. This knowledge can help you quickly discover, curate, categorize, and share data assets, data sets, and analytical models with other members of your organization. It serves as a single source of truth for data engineers, data stewards, data scientists, and business analysts to gain self-service access to data they can trust.

Cloud Pak for Data login

You need the Admin role to create a catalog, and you begin the tutorial by creating a catalog and loading data.

Set up catalog and data

Note: The default catalog is your enterprise catalog. It is created automatically after you install the Watson Knowledge Catalog service and is the only catalog to which advanced data curation tools apply. The default catalog is governed so that data protection rules are enforced. The information assets view shows additional properties of the assets in the default catalog to aid curation. Any subsequent catalogs that you create can be governed or ungoverned, do not have an information assets view, and supply basic data curation tools.

Create the catalog

Provision Watson Knowledge Catalog

If you haven’t started Watson Knowledge Catalog yet, you’ll need to provision it.

  1. Open Watson Knowledge Catalog by clicking Services at the upper right of the home page.

    click services icon

  2. Click Watson Knowledge Catalog under the Data Governance section.

    open wkc

Open Watson Knowledge Catalog

  1. Click Open to launch Watson Knowledge Catalog.

    open wkc

  2. Choose Organize, then All catalogs from the menu on the left.

    open catalog menu

  3. Click either Create catalog or New Catalog from the Your catalogs page.

    create WKC catalog

  4. Name your catalog and give it an optional description. Check Enforce data protection rules, and click Create.

    name and create wkc catalog

  5. Click OK on the pop-up menu that opens, then click Create.

    enforce data protection

Option 1: Add data assets

  1. From the Browse Assets tab, click here to add your data.

    click here to add assets

    You could also click Add to catalog + in the upper-right, and choose Local files.

    add local files to catalog

  2. Clone the following repository: https://github.com/horeaporutiu/wkc-tutorial-intelligent-loan. Browse to where you cloned the repository and go to /data/split/applicant_personal_data.csv. Click Open. Add an optional description, and click Add.

    click add for local files to catalog

Note: Stay in the catalog until loading is complete. If you leave the catalog, the incomplete asset is deleted.

The newly added applicant_personal_data.csv file appears under the Browse Assets tab of your catalog.

newly added data in catalog

Option 2: Add connection

  1. Add a connection to a remote database, for example, DB2 Warehouse in IBM Cloud, by choosing Add to catalog +, then Connection.

    add connection to catalog

  2. Click your remote database.

    chose db2 warehouse connection

  3. Enter the connection details, and click Create.

    enter db2 warehouse connection details

The connection now shows up in the catalog.

db2 warehouse connection shows up

Option 3: Add Virtualized Data

Virtualized data can be added to the Default catalog by someone with Administrator or Editor access to that catalog.

  1. Choose Organize, then All catalogs from the menu on the left. Click Add to Catalog +, then Connected asset.

    add connected asset

  2. Click Source, then Select source. Browse under DV to your schema (for example, UserXYZW), and choose the joined table. Click Select.

    select source

A user can now add this to a project like any other asset from a catalog.

Add collaborators and control access

  1. Click Add Collaborator, under the Access Control to give other users access to your catalog.

    give users access to the catalog

  2. Assign a role (Admin, Editor, or Viewer) to a user by searching for the user, clicking the name, choosing a role, and clicking Add.

    search for user and add as collaborator

  3. Access data in the catalog by clicking the name of the data.

    click data name to open

A preview of the data opens, with metadata and the first few rows.

preview of data

You can click the Review tab and rate the data as well as comment on it to provide feedback for your team.

review data

Add categories

The fundamental abstraction in Watson Knowledge Catalog is the category, which is like a folder.

You can add a category for your assets by choosing Organize, Data and AI Governance, then Categories from the menu on the left.

Add category

You can then import the categories in a .csv format (option 1), or you can add categories manually (option 2).

Option 1: Import categories

  1. Click Import.

    Import categories

  2. Click Add file, and navigate to where you cloned this repository and choosing data/wkc/glossary-organize-categories.csv.

    Import csv

  3. Under Select merge option, choose Replace all values, and click Import.

    Import select merge option

You see “The import completed successfully” message when it is complete.

Import complete

This way, you can import categories, business terms, classifications, and policies to populate your governance catalogs.

Option 2: Add category manually

  1. Click Create category.

    organize data categories

  2. Name your category, such as Personal Data, give it an optional description, and then click Save.

    new category billing

  3. If you click Create category again on the Personal Data category screen, you can create a subcategory, such as Residence Information.

    sub category total charges

  4. For the Residence Information subcategory you can select a Type, such as Business term to filter for all business terms associated with this subcategory. Currently, it is blank.

    select business term type

  5. You also can create classifications for assets, similar to Confidential, Personally Identifiable Information, or Sensitive Personal Information in a similar way, by choosing Organize, Data and AI Governance, then Classifications from the left menu.

    select classification type

  6. Click New classification from the drop-down menu, then select Create new classification. Then, these classifications can be added to your category as a Type.

    select classification type

Add data classes

When you profile your assets, a data class is inferred from the contents where possible. You’ll see more on this later. You can also add your own data classes.

  1. Add a data class for your assets by choosing Organize, Data and AI Governance, Data classes. Then, click New data class, Create new data class from the left menu.

    organize data classes

  2. Give your new data class a name, for example, alphanumeric, and an optional Primary category or description. Click Save as draft.

    new data class

After the data class is created, you can add Stewards for this class, and also associate classifications and business terms if they are available (you might not have any business terms yet). When you are ready, click Publish.

tools for data class

publish comment data class

Now, let’s add that data class to a column in your applicant_personal_data.csv asset.

  1. Go back to your CreditDataCatalog catalog and open it. Select Organize, All catalogs, then choose CreditDataCatalog. Under the Browse assets tab, click applicant_personal_data.csv data set to get the column/row preview. Scroll to the right to get to the CustomerID column, then click the down arrow next to “Customer Number”, and select View all.

    change data class

  2. Search for your newly created data class, alphanumeric in the window that opens, and click it when it returns in the search. Then, click Select.

    Set column to numerical data class

5. Add business terms

You can use business terms to standardize definitions of business concepts so that your data is described in a uniform and easily understood way across your enterprise.

You already saw how to create a category and make it a business term. You also can create the business term as its own entity.

  1. Choose Organize, Data and AI Governance, then Business terms from the left menu.

    organize Data Business terms

  2. Click the New business term drop-down menu at the upper-right, and choose Create new business term.

    create business term

  3. Give the new business term a name such as Contact Information and an optional description, and click Save as draft.

    name new business term

  4. A window opens after the term is created. You can see a rich set of options for creating related terms and adding other metadata. For now, click Publish to make this term available to users of the platform.

    publish business term

  5. Add an optional comment, and click Publish in the new window.

    verify publish business term

  6. Go back to your CreditDataCatalog catalog, and open it to the column view. Select Organize, All catalogs, and choose CreditDataCatalog.

  7. Under the Browse assets tab, click the applicant_personal_data.csv data set to get the column/row preview. Scroll right to get to the Email column, and click Column information (it looks like an eye).

    choose TotalCharges column information

  8. In the window that opens, click the edit icon (it looks like a pencil) next to Business terms.

    edit business terms

  9. Enter Contact Information under Business terms, and the term is searched for. Click the Contact Information term that is found, then click Apply.

    edit business terms

Close that window after the term has been applied.

Now, do the same steps to add the Contact Information business term to the Telephone column.

You are now able to search for these terms from within the platform. For example, going back to the top level CreditDataCatalog, in the search bar with the comment “What assets are you searching for?”, enter your unique Contact Information term.

search using business terms

The applicant_personal_data.csv data set shows up because it contains columns that are tagged with the Contact Information business term.

Add rules for policies

You can now create rules to control how a user can access the data.

  1. Create a business term called CustomerID, and assign it to your CustomerID column in the data set using the instructions given previously. See the following if you need details, but try it yourself first. Skip to the Adding a rule section if you do not need a reminder.

How to create a business term review

  1. Choose Organize, Data and AI Governance, then Business terms from the left menu.

  2. Click + Create Business term.

  3. Give the new business term the name CustomerID, and add an optional description. Then, click Save as draft. In the next window, click Publish.

  4. Go back to your CreditDataCatalog catalog and open it to the column view. Select Organize, All catalogs, then choose CreditDataCatalog.

  5. Under the Browse assets tab, click the applicant_personal_data.csv data set to get the column/row preview. Scroll right to get to the CustomerID column, and click the Column information icon (it looks like an eye).

  6. In the window that opens, click the edit icon (it looks like a pencil) next to Business terms.

  7. Enter CustomerID under Business terms, and the term is searched for. Click the CustumerID term that is found, and click Apply.

Adding a rule

  1. Choose Organize, Data and AI Governance, then Rules from the left menu.

    select rule

  2. Click New rule, Create new rule.

    create rule

  3. Choose Data protection rule for the New rule -> Select the type of rule to create.

    data protection

  4. Under Details, give your rule a Name, Type = Access, and a Business definition.

  5. Under Rule builder Condition1, fill out if Business term Contains any CustomerID and Action, then mask data in columns containing alphanumeric. Choose the tile for Substitute, which makes a non-identifiable hash. This obscures the actual CustomerID, but allows actions like database joins to still work. Click Create.

    define rule for masking customerID

  6. Now, if you go back to your applicant_personal_data.csv asset in the catalog at the CustomerID column, it looks the same as before. However, a non-admin user sees the “lock” icon and that the customerID has now been substituted with a hash value.

    customerID is now masked

  7. Back in the CreditDataCatalog, under the applicant_personal_data.csv asset, go to the Overview tab and scroll to the Age column. Click the down arrow, and you can see that the data has been inferred to be classified as a Code.

    TotalCharges classified as Quantity

  8. Change the classifier by clicking View all, and tart typing Age. When this appears in the search, click Use. Then, click Close.

    TotalCharges classified as Quantity

  9. You can build a rule to Obfuscate the Age column.

    TotalCharges obfuscate rule

    Now that column has data that is replaced with similarly formatted data.

    TotalCharges column obfuscated

  10. Finally, click lineage to see the history of changes you’ve made to the data asset.

    click lineage

You can see the events that are associated with the applicant_personal_data.csv data asset, for example, when you updated the age column with the age data class.

details

Conclusion

In this tutorial, you learned how to:

  • Set up the catalog and data
  • Add collaborators and control access
  • Add categories
  • Add data classes
  • Add business terms
  • Add rules for policies

This tutorial is part of the Modernizing your bank loan department series, a solution that shows how to automate and enhance loan transaction processes with AI. While this tutorial focuses on data preprocessing and governance, the other tutorials and code patterns in the solution focus on creating a machine learning model, explaining the model outcomes, and retraining the model to increase accuracy.