Win $20,000. Help build the future of education. Answer the call. Learn more

Incorporate enterprise governance in your data

Governance is the process of curating, enriching, and controlling your data. You govern your data with governance artifacts, and you organize and control access to governance artifacts with categories. In a real life scenario, the governance artifacts are set up by the governance team of the organization.

In this tutorial, you will create the categories and the governance artifacts in IBM Cloud Pak for Data that are required for a synthetic patient healthcare dataset created using Synthea.

Learning objectives

In this tutorial, you will learn about and create these governance artifacts:

  • Categories
  • Business terms
  • Reference data
  • Data classes
  • Classifications
  • Policies
  • Governance rules

Prerequisites

Estimated time

This tutorial will take approximately 60 minutes to complete.

Categories

A category is used to organize your governance artifacts, just like a folder or a directory. Grouping governance artifacts into categories makes it easier to find them, control their visibility, and manage them. You can also use categories to specify which users can view and manage the artifacts within the category. A category can have subcategories, but a subcategory can have only one direct parent category.

You must have the Manage governance categories user permission on IBM Cloud Pak for Data to create top-level categories.

To create a subcategory within a top-level category, you must have the Access governance artifacts and Manage governance categories user permissions on IBM Cloud Pak for Data. Additionally, you must have an Admin or Owner category collaborator role in the parent category.

NOTE: No subcategories can be created within the pre-defined [uncategorized] category.

You must have the Admin role to create categories.

Step 1. Download the categories file

Step 2. Import categories

  • Open a browser and navigate to your IBM Cloud Pak for Data instance. Log in as a user with admin privileges. Log into CPD

  • Go to the hamburger (☰) menu in the upper-left corner, expand Governance, and click on Categories. Hamburger menu - categories

  • Click Add category > Import from file. Categories - add import

  • Click Add file and select the Healthcare_Data-category-csv-export.csv file you downloaded earlier, then click Next. Categories - add file

  • Select Replace all values, then click Import. Categories - click import

The categories will be imported from the file. Once the import is successful, you will see the Import summary modal that says, “The import completed successfully.” It will also show you that one new category was created and no errors were encountered. Click Close to go back. Categories - import successful

Business terms

Business terms are used to standardize the definitions of business concepts so that the data is described in a uniform manner across the organization. Business terms can be used to annotate columns with different column names, all of which have the same type of data as defined by the business term.

You must have the Admin, Data Engineer, Data Steward, or Data Quality Analyst role to create business terms.

Step. 1 Download the business terms file

Step 2. Import business terms

  • Go to the hamburger (☰) menu in the upper-left corner, expand Governance and click Business terms. Hamburger menu - business terms

  • Click Add business term, then click Import from file. Business terms - add import

  • Click Add file and select the Healthcare_Data-glossary_terms-csv-export.csv file you downloaded earlier, then click Next. Business terms - add file

  • Select Replace all values and click Import. Business terms - click import

The business terms will be imported from the file. Once the import is successful, you will see the Import summary modal that says, “The import completed successfully.” It will also show you that 112 new drafts for business terms were created and no errors were encountered. Click Go to task. Business terms - draft created

Step 3. Publish business terms

  • You will be taken to the Task inbox where you can see that a Publish Business terms task has been assigned to you. Click Publish to publish the draft business terms. Business terms - publish draft

Once the business terms publishes, you will see a notification that says that you have completed the task and that the admin has been notified. Business terms - publish successful

Reference data

Reference data sets are used to provide logical groupings of code values such as product codes, country codes, or in the healthcare domain, condition codes and medication codes. These are typically sets of allowed values for data fields, which can be used as matching patterns for data classes and to assign business terms.

You must have the Access governance artifacts user permission on IBM Cloud Pak for Data in order to create, edit, or delete reference data sets. Additionally, you must also have the Admin, Owner, or Editor category collaborator roles in the primary category for the reference data set.

Step 1. Download the reference data files

Step 2. Import reference data sets

  • Go to the hamburger (☰) menu in the upper-left corner, expand Governance, click Reference data. Hamburger menu - reference data

  • Click Add reference data set > Import from file. Reference data - add import

  • Click Add file and select the Healthcare_Data-reference_data-csv-export.csv file you downloaded earlier, then click Next. Reference data - add file

  • Select Replace all values and click Import. Reference data - click import

The reference data sets will be imported from the file. Once the import is successful, you will see the Import summary modal that says, “The import completed successfully.” It will also show you that 10 new drafts for reference data were created and no errors were encountered. Click Go to task. Reference data - draft created

Step 3. Populate the reference data sets with reference data

  • You will be taken to the Task inbox where you can see that a Publish Reference data sets task has been assigned to you. Look for the Encounter classes row in the table and click on See details against it. Reference data - see details

  • The Encounter classes reference data set will be loaded on the screen. Click on the overflow menu, then select Upload file. Reference data - encounter classes - upload file

  • Upload the Encounter classes-reference_data_set-csv-export.csv file from the extracted contents of Healthcare_Data-reference_data_sets.zip either by dragging and dropping the file or by browsing to the downloaded location, then click Next. Reference data - encounter classes - add file

  • On the next screen, you need to map the columns in the CSV file to target columns. Select Code, Value, Description, Parent, and Related terms columns in the drop-downs for code, value, description, parent, and related terms, respectively. Click Save. Reference data - encounter classes - save

  • You will see a notification that says the file has successfully been submitted for import. Reference data - encounter classes - save successful

  • Go back to the task inbox by going to the upper-left hamburger (☰) menu and clicking on Task inbox. Repeat the process of populating the reference data sets for Encounter codes and Condition codes using the Encounter codes-reference_data_set-csv-export.csv and Condition codes-reference_data_set-csv-export.csv files, respectively. Reference data - encounter classes - back to task

NOTE: For the extended version of this tutorial, populate each of the reference data sets listed in the task. Use the following table to choose the csv file for each reference data set:

Reference data set CSV file
Allergy codes Allergy codes-reference_data_set-csv-export.csv
Careplan codes Careplan codes-reference_data_set-csv-export.csv
Condition codes Condition codes-reference_data_set-csv-export.csv
Encounter classes Encounter classes-reference_data_set-csv-export.csv
Encounter codes Encounter codes-reference_data_set-csv-export.csv
Immunization codes Immunization codes-reference_data_set-csv-export.csv
Medication codes Medication codes-reference_data_set-csv-export.csv
Observation codes Observation codes-reference_data_set-csv-export.csv
Procedure codes Procedure codes-reference_data_set-csv-export.csv
Provider specialties Provider specialties-reference_data_set-csv-export.csv

Step 4. Publish reference data

  • Once the reference data sets have been populated, you can publish the reference data sets. Go back to the task inbox, and publish the reference data sets draft by clicking Publish. Reference data - publish draft

  • Once the reference data sets have been published, you will see a notification that says that you have completed the task and that the admin has been notified. Reference data - publish successful

Data classes

Data classes are used to describe the type of data contained in data assets — for example, data fields or table columns such as city, account number, or social security number. Watson Knowledge Catalog provides a set of predefined data classes. You can also create custom data classes and use matching logic such as lists of valid values, reference data, or regular expressions to specify how to classify data automatically. You can also associate related governance artifacts such as classifications and business terms to the data classes. The business terms are then suggested for the data assets when the data class is assigned to the data assets.

You must have the Access governance artifacts user permission on IBM Cloud Pak for Data in order to create, edit, or delete data classes. Additionally, you must also have the Admin, Owner, or Editor category collaborator roles in the primary category for the data class.

Step 1. Download the data classes file

Step 2. Import data classes

  • Go to the hamburger (☰) menu in the upper-left corner, expand Governance and click Data classes. Hamburger menu - data classes

  • Click Add data class > Import from file. Data classes - add import

  • Click Add file and select the Healthcare_Data-data_class-csv-export.csv file downloaded earlier, then click Next. Data classes - add file

  • Select Replace all values and click Import. Data classes - click import

The data classes will be imported from the file. Once the import is successful, you will see the Import summary modal that says, “The import completed successfully.” It will also show you that 16 new drafts for data classes were created and no errors were encountered. Click Go to task. Data classes - draft created

Step 3. Add reference data (matching method) to data classes

  • You will be taken to the Task inbox where you can see that a Publish Data classes task has been assigned to you. Look for the Encounter class row in the table and click on See details against it. Data classes - see details

  • The Encounter class data class will be loaded on the screen. Click on the + (Add) button under Matching method toward the bottom of the screen. Data classes - encounter class - add match method

  • In the modal window, select Match to reference data, then click Next. Data classes - encounter class - match ref data

  • Search for Encounter classes in the list of reference data sets. You can also search for it using the search bar. Click on the Encounter classes reference data set to select it, then click Next. Data classes - encounter class - ref data name

  • On the next screen, click Save. Data classes - encounter class - save

  • You will see a notification that says the changes have been saved. Data classes - encounter class - save successful

  • Go back to the task inbox by going to the upper-left hamburger (☰) menu and clicking on Task inbox. Repeat the steps to update the Encounter code and Condition code data classes and add the matching method to match using reference data. Choose the Encounter codes and Condition codes reference data sets, respectively. Data classes - encounter class - back to task

NOTE: For the extended version of this tutorial, update the following data classes listed in the task to match using the given reference data sets:

Data class Reference data set
Allergy code Allergy codes
Careplan code Careplan codes
Condition code Condition codes
Encounter class Encounter classes
Encounter code Encounter codes
Immunization code Immunization codes
Medication code Medication codes
Observation code Observation codes
Procedure code Procedure codes
Provider specialty Provider specialties

Step 4. Add list of values (matching method) to data classes

  • Back on the Task inbox, click on See details against the Ethnicity (hispanic/non-hispanic) data class. Data classes - ethnicity

  • The Ethnicity (hispanic/non-hispanic) data class will be loaded on the screen. Click on the + (Add) button under Matching method toward the bottom of the screen. Data classes - ethnicity - add match method

  • In the modal window, select Match to list of valid values, then click Next. Data classes - ethnicity - match list values

  • Select List of valid values and type hispanic under List of valid values. Click Add valid value to add one more space to enter valid values, type in non hispanic, then click Next. Data classes - ethnicity - populate list of values

  • On the next screen, add [Ee]thnic|ETHNIC under Column name criteria and click Save. Data classes - ethnicity - save

  • You will see a notification that says the changes have been saved. Data classes - ethnicity - save successful

  • Go back to the task inbox by going to the upper left hamburger (☰) menu and clicking on Task inbox. Repeat the process of updating the matching method to use a list of valid values for the Race data class. Data classes - ethnicity - back to task

  • Provide asian, black, native, other, and white as the valid values. Add [rR]ac(e|ial)|RAC(E|IAL) under Column name criteria.

Step 5. Add regular expression (matching method) to data classes

  • Back on the Task inbox, click on See details against the Passport data class. Data classes - passport

  • The Passport data class will be loaded on the screen. Click on the + (Add) button under Matching method toward the bottom of the screen. Data classes - passport - add match method

  • In the modal window, select Match to criteria in regular expression, then click Next. Data classes - passport - match regex

  • Provide ^[A-Z0-9]{6,9}$ as the Match criteria for column value, then click Next. Data classes - passport - regex

  • On the next screen, provide the Column name criteria as [pP]assport|PASSPORT|[iI]d|ID and click Save. Data classes - passport - save

  • You will see a notification that says the changes have been saved. Data classes - passport - save successful

  • Go back to the task inbox by going to the upper-left hamburger (☰) menu and clicking on Task inbox. Data classes - passport - back to task

  • Repeat the process of updating the matching method to use regular expressions for the data classes listed in the table below. Use the table to choose the regular expression and the column name matching criteria for each data class:

Data class Regular expression Column name matching criteria
Passport ^[A-Z0-9]{6,9}$ [pP]assport|PASSPORT|[iI]d|ID
Numeric ^\d+$
Timestamp ^(19|20)[0-9]{2}-(0[1-9]|1[0-2])-(0[1-9]|[1-2][0-9]|3[0-1])T([0-1][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]Z$
UUID ^[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}$

Step 6. Publish data classes

  • Once the matching methods for all the data classes have been specified, you can publish the data classes. Go back to the Task inbox, and publish the data classes draft by clicking on Publish. Data classes - publish draft

  • Once the data classes have been published, you will see a notification that says that you have completed the task and that the admin has been notified. Data classes - publish successful

Classifications

Classifications are used to classify assets based on their level of sensitivity or confidentiality for the organization. Unlike data classes which include logic to match data values, classifications are more like labels.

Watson Knowledge Catalog includes three pre-defined, commonly used classifications:

  • Personally Identifiable Information is used for any data that can be used to distinguish one person from another and could potentially identify a specific individual.
  • Sensitive Personal Information is used for information relating to an individual with regard to racial or ethnic origin, political opinions, religious beliefs or other beliefs of a similar nature, trade union membership, physical or mental health or condition, sexual life, or any criminal or alleged criminal history of a person.
  • Confidential is used for any data that can cause significant and/or long-term harm to the institution and/or individuals whose data it is, if the data is inappropriately accessed.

You must have the Access governance artifacts user permission on IBM Cloud Pak for Data in order to create, edit, or delete classifications. Additionally, you must also have the Admin, Owner, or Editor category collaborator roles in the primary category for the classification.

NOTE: Watson Knowledge Catalog provides the ability to create or import user-defined classifications. However, for the Healthcare use case, you will only need the pre-defined classifications.

Policies

Policies are natural-language documents used to describe the guidelines, regulations, standards, or procedures that your organization needs to follow, in order to ensure that the organizations’ data and information assets are properly managed or used. A policy is a natural-language description of a governance subject area.

Policies can further contain multiple subpolicies, which pertain to a specific area within the broad definition of the parent policy. They can also reference rules that lay out specific details to make the data and asset resources compliant with corporate objectives. Policies can be organized in a hierarchy based on their meaning and relationships to each other. Policies can only be applied to data in relational data sets.

You must have the Access governance artifacts user permission on IBM Cloud Pak for Data in order to create, edit, or delete a policy. Additionally, you must also have the Admin, Owner, or Editor category collaborator roles in the primary category for the policy.

Step 1. Download the policies file

Step 2. Import policies

  • Go to the hamburger (☰) menu in the upper-left corner, expand Governance and click Policies. Hamburger menu - Policies - add import

  • Click Add policy > Import from file. Policies - add import

  • Click Add file and select the Healthcare_Data-policy-csv-export.csv file downloaded earlier, then click Next. Policies - add file

  • Select Replace all values and click Import. Policies - click import

The policies will be imported from the file. Once the import is successful, you will see the Import summary modal that says, “The import completed successfully.” It will also show you that three new drafts for policies were created and no errors were encountered. Click Go to task. Policies - draft created

Step 3. Publish policies

  • You will be taken to the Task inbox where you can see that a Publish Policies task has been assigned to you. Click Publish to publish the draft policies. Policies - publish draft

Once the policies have been published, you will see a notification that says that you have completed the task and that the admin has been notified. Policies - publish successful

Governance rules

Governance rules provide the business description of the required behavior or actions that need to be taken in order to implement a governance policy. Governance rules are descriptive rules written in natural language and cannot be enforced.

A governance rule essentially consists of a name and a text description. Like all other governance artifacts, it is housed within a primary category. It can reference governance policies as well as other rules. Related rule relationships are bidirectional, meaning if Governance_rule_1 is related to Governance_rule_2, then Governance_rule_2 is automatically related to Governance_rule_1.

To create a governance rule, you need to have the Access governance artifacts user permission on IBM Cloud Pak for Data. In addition to this, you must have a Admin, Owner, or Editor role in the primary category for the governance rule.

Step 1. Download the governance rules file

Step 2. Import governance rules

  • Go to the hamburger (☰) menu in the upper-left corner, expand Governance and click Rules. Hamburger menu - rules

  • Click Add category > Import from file. Rules - add import

  • Click Add file and select the Healthcare_Data-rule-csv-export.csv file downloaded earlier, then click Next. Rules - add file

  • Select Replace all values and click Import. Rules - click import

The rules will be imported from the file. Once the import is successful, you will see the Import summary modal that says, “The import completed successfully.” It will also show you that four new drafts for governance rules were created and no errors were encountered. Click Go to task. Rules - draft created

Step 3. Add parent policy to governance rules

  • You will be taken to the Task inbox where you can see that a Publish Governance rules task has been assigned to you. Look for the Mask Sensitive personal information row in the table and click on See details against it. Rules - see details

  • The Mask Sensitive personal information rule will be loaded on the screen. Click on the Add policy + button under Parent policies. Rules - SPI - add policy

  • In the modal window, select Protect Sensitive Personal Information, then click Add. Rules - SPI - select policy

  • You will see a notification that says the changes have been saved. Once the parent policy has been updated, go back to the task inbox by going to the upper left hamburger (☰) menu and clicking Task inbox. Rules - back to task

  • Repeat the process of adding the parent policy for the three remaining rules in the task. Select the Protect Personal Identifiable Information policy for each rule. Rules - repeat for others Rules - select PII

Step 4. Publish rules

  • Once the policies for all the rules have been specified, you can publish the rules. Go back to the Task inbox, and publish the rules draft by clicking on Publish. Rules - publish draft

  • Once the rules have been published, you will see a notification that says that you have completed the task and that the admin has been notified. Rules - publish successful

Summary

In this tutorial, you have learned about the governance artifacts — categories, business terms, reference data sets, data classes, classifications, policies, and governance rules, and have created the governance artifacts required for the healthcare use case. You can read the IBM Cloud Pak for Data documentation to learn more about governance artifacts.

This tutorial is part of the An introduction to the DataOps discipline series. To continue the series, take a look at the next tutorial titled Learn to discover data that resides in your data sources, to see how you can discover data assets within your data sources and how Watson Knowledge Catalog automatically assigns governance artifacts to these discovered assets.