In the Learn to discover data that resides in your data sources tutorial, you saw how data can be discovered using the data discovery feature of Watson Knowledge Catalog. In this tutorial, you will work with the discovered data and learn how to associate governance artifacts with these discovered assets. You will identify the primary keys for the data assets and perform relationship analysis in order to identify foreign key relationships between the assets. You will look at data quality analysis results and observe what affects the quality scores of data assets. You will also learn to enforce rules that you expect the data assets to follow.
In this tutorial, you will:
- Review and update data classes, business terms, and keys for the data assets
- Run relationship analysis to identify foreign key relationships between the data assets
- Review data quality dimensions violations
- Add rules to improve data quality
- Perform re-analysis of the data assets and observe changes in data quality
- IBM Cloud Pak for Data
- Watson Knowledge Catalog on Cloud Pak for Data
- The steps in Learn to discover data that resides in your data sources must be completed.
This tutorial will take approximately 60 minutes to complete.
Step 1. Review and update data classes, business terms, and keys for the data assets
Log into your IBM Cloud Pak for Data instance.
Navigate to the upper-left hamburger (☰) menu, expand Governance, and click on Data quality. Click on the tile for your data quality project HealthcareAnalysis. This is the project that contains the results from the automated discovery process that was performed in Learn to discover data that resides in your data sources.
On the Data assets tab, click on PATIENTS to open the PATIENTS data asset.
The Columns tab for the PATIENTS data asset shows information about the columns in the asset. You can see that the analysis process has identified and auto-assigned data classes and business terms for some of the columns. To edit these values, click Edit, then click on the first column: ID.
Go to the Data classes tab. You will see that the data classes Text and UUID were identified for this column. Moreover, both these data classes have confidence of 100%. Of these, the UUID data class was detected as the data class for this column during analysis and, therefore, you can see UUID mentioned under Detected Data Class. The data class was correctly identified in this case, but if it is not, and if you wish to update the data class, you can use the drop-down under Selected Data Class to choose the data class you wish to assign to this column.
Go to the Governance tab. You can see that a number of business terms have been suggested, but none have been assigned to the ID column. This is because none of the business terms has a confidence of over 80%. You can click on the checkmark next to Patient ID to assign this business term to the ID column. You can also use the search bar to search for terms in case they aren’t assigned or suggested. You can also click on the cross next to the other suggested terms to mark them as Rejected. Marking terms as Assigned or Rejected ensures that the terms remain as assigned or rejected after any future analyses of the data.
NOTE: While you can skip rejecting incorrectly suggested terms, as in case of this column, rejection of incorrect terms is especially important when the confidence in those terms is over 80%, as a result of which they get automatically assigned to the column. In these cases, rejection will ensure that the rejected terms will not be assigned to this column on future analyses even if the confidence is over 80%.
Now that the business terms have been accepted and rejected, you can repeat the steps of correcting the data class and business terms for the other columns by going into the columns using the links on the left. Refer the PATIENTS section of the Data asset annotations file to assign data classes and business terms to the columns in the PATIENTS data asset. Once you have completed the steps for all the remaining columns in the PATIENTS data asset, go back to the PATIENTS data asset using the breadcrumbs at the top.
Go to the Keys tab. You can see a list of columns that were identified as candidate keys for the PATIENTS data asset. Check the box against the ID column and click on Mark as selected. In the pop-up window, click Mark to confirm. This will make sure the system knows that the ID column is to be used as a primary key for the PATIENTS data asset. You can select the other identified candidate keys and click Mark as rejected in order to reject these keys. This will ensure that when you re-analyze the PATIENTS data asset in the future, the rejected keys will not be selected as candidate keys.
Click Done to save all the data class, business term, and key updates you have made to the PATIENTS asset, then click Analyze to re-analyze the asset based on these updates.
In the pop-up window, choose the checkbox for Analyze data quality, then click Analyze.
The analysis score will now be updated. (NOTE: you may need to refresh the page to see the new analysis results.) You can see that the data classes and business terms you had assigned have been retained. The ID column now shows a key icon in front of it to indicate that this is the primary key for the data asset. In the left panel, you can see that the Data quality score for the PATIENTS data asset has dropped. In the left bottom panel, where the column names are listed, you can see that the quality score for BIRTHDATE and CITY has dropped, which caused the drop in the overall quality score.
Go back to the project using the breadcrumbs. Update the data classes, business terms, and keys for the ENCOUNTERS data asset. (NOTE: For the extended version of this tutorial, update the governance artifacts for all 12 data assets.) Use the Data asset annotations file to learn which data class and business term to assign for each column of each data asset. The file also lists the primary keys for each data asset. For each data asset, perform re-analysis once the updates have been made.
Step 2. Run relationship analysis to identify foreign key relationships between the data assets
After you have selected the primary keys for your tables, you can run Relationship analysis to identify foreign key relationships between the data assets. In your project, go to the Relationships tab. Click the checkboxes against ENCOUNTERS and PATIENTS to select them, then click Run analysis.
In the pop-up window, choose Key relationship analysis and click on Analyze.
Wait for a minute or two for the analysis to complete. Click on the refresh icon to refresh the page. You will be able to see a chart and a table both showing the relationship between the PATIENTS and the ENCOUNTERS asset.
Scroll all the way to the end of the table. You can see that this relationship has been set to status Selected. Click on the overflow menu for the record. You will see the options to set the status as Candidate or Rejected and the option to delete the key relationship. This relationship needs to be set as Selected (which it already is). For any other relationships you generate, you can mark them as Selected or Rejected using the options in the overflow menu. This will ensure that future key relationship analyses will remember what you had set.
NOTE: For the extended version of this tutorial, run relationship analysis between the other data assets. The following table lists all the foreign key relationships between all the 12 data assets in the Healthcare-Data.zip archive. Ensure that you mark these relationships as Selected and all others as Rejected.
|Parent data asset||Primary key||Child data asset||Foreign key|
- Make sure to click on the Customize display tab to specify which relationships you want to display on the screen. Select Selected and Candidate in the filter drop-down, click the checkbox to Select all data assets, then click Customize display.
Step 3. Review data quality dimensions violations
Go to the Data assets tab and click on PATIENTS to open the PATIENTS data asset.
Go to the Data quality tab. The data quality analysis process identifies quality problems with your data by analyzing quality dimensions and lists the violations that it observes on the Data quality tab. The number of violations found against each data quality dimension as well as the delta (percentage change in the number of violations between the last two analyses) is shown on the screen.
The first dimension in the list — Data class violations — shows the values in a column that do not match the detected data class of that column. Each value that violates the class is identified as a violation. Click on Data class violations to expand it. The column names that contain data class violations is shown, along with the number of violations found for each column. Click on the first column name in the list: CITY.
A pop-up window is displayed, showing all the records from the PATIENTS data asset that have data class violations in the CITY column. Scroll left to right to see the entire record, and scroll up and down to see more records. You can also use the Download button to download the records as a CSV file. Close the pop-up window by clicking on the X on the top right.
Likewise, you can expand and see records for the other columns listed under data class violations, such as ADDRESS and BIRTHDATE.
Expand and look through the other data quality dimensions on the screen.
- Suspect values identifies values that do not seem to match the majority of the other values in the column because their characteristics are different.
- Inconsistent capitalization identifies values where the usage of uppercases and lowercases is not consistent.
- Data type violations identifies values that do not conform to the inferred data type in length, precision, or scale, or violate the specified/identified data type.
- Duplicated values identifies duplicated values in columns where most of the values are unique.
- Values out of range identifies outliers in a column’s data. Any value that does not fall between the minimum and maximum values specified for the column is identified.
- Suspect values in correlated columns identifies columns that are correlated with other columns (one column’s value can be predicted using the other) and then uses that information to identify records that do not have the same correlation.
- Missing values looks for missing values in a column that has been defined as a non-nullable column.
- Inconsistent representation of missing values looks for varying representations of missing data (NA, NULL, blank fields, for example). A column that contains both null values and empty values suggests that there is no standardized way to represent missing values.
- Format violations identify values that match a format that has been specified as invalid for a particular column in an analysis.
You can also ignore any of the dimensions. As a result, the dimension’s violations will not affect the quality analysis score. To ignore a dimension, go into the Edit mode by clicking on Edit, then toggle the Ignore button against the dimension. Click on Done (which is shown in place of the Edit button) to save the changes, then Analyze data quality again to view the change in the quality score.
Step 4. Add rules
You can also make use of rules to ensure that your data is of high quality. These include:
- Data rule definitions, which are used to develop rule logic to analyze data. These are used as the basis for data rules and quality rules.
- Rule set definitions, which are collections of data rule definitions.
- Data rules and Quality rules, which evaluate and validate specific conditions associated with your data sources by binding data rule definitions to physical data.
- Rule sets, which are collections of data rules.
- Automation rules, which can be used to automate the process of governing your data.
Import data rule definitions
Download the HealthcareAnalysis-rules.xml file.
Back in your data quality project, go to the Data rules tab and click on the Import rules and definitions button.
In the pop-up window, click on Add file and choose the HealthcareAnalysis-rules.xml file you downloaded earlier. Once the file has been uploaded, click Next.
On the next screen, you can see three checkboxes, each representing a data rule definition, in the uploaded file. Ensure that all three are checked and click Import.
You will see a notification that the assets were imported successfully. Click Close.
Back on the Data rules tab, click the Refresh button to refresh the list of rules and expand All. You can see that three new records are now available under All.
Two of the newly added records relate to the date of death (of a patient). They include two data rule definitions, one of which verifies that the value of Date of Death is less than or equal to today (meaning, Date of Death cannot be a date in the future), and another verifies that the Date of Death occurs on or after the Date of Birth. The third new record is a data rule definition that verifies that at least the driver’s license number or the passport number of a patient has been provided.
Create data rule
You can now create a data rule using one of the imported data definitions.
Go back to the Data assets tab and click on the PATIENTS asset.
Click on the Rules tab and click Create rule +.
Select Data rule, then under Data rule definition, expand All and select the rule named DoD_gte_DoB. This rule definition says that the date of death of a patient should always be greater than or equal to the date of birth of the patient. Click Next.
Provide a name for the rule (
DeathDate greater than or equal to BirthDate), and optional short and long descriptions. Click Next.
On the next screen, click Next.
Select the datebirth variable on the left. In the table on the right, expand your data source and look for the PATIENTS data asset. Select the BIRTHDATE column within the PATIENTS data asset. Click Bind on the left side of the screen. The PATIENTS.BIRTHDATE column will be bound to the datebirth variable as shown under Implemented bindings. Repeat the steps for the datedeath variable on the left and bind it to the DEATHDATE column of the PATIENTS data asset. Once both the variables have been bound, click Next.
On the next screen, click Next.
Output from a data rule can be saved to a customized table, wherein you can specify which rows and columns need to be included. On this screen, you can select the columns that you would like to include in the output table. The datedeath and datebirth variables are added automatically to the output table and should be visible under Selected output on the right side of the screen. On the left side of the screen, click on Columns tab, then find and select the ID, FIRST_NAME and LAST_NAME columns within the list. Click on Add to output. The three columns should now be listed under the Selected output on the right. Click Next.
On the next screen, click Next.
Click Test to test the rule. The rule is run on a sample of 100 rows, and the results are displayed on the screen. Toward the bottom of the screen, you can click on the Did not meet rule conditions tab, to view the records from the sample that did not satisfy the rule. Click on Save to save the data rule. (NOTE: It is possible that all 100 records in the sample meet the rule conditions.)
The newly added data rule will be displayed on the Rules tab of the PATIENTS data asset.
Run data rule
Scroll to the end and click on the overflow menu (the three vertical dots) for the data rule and click Run in the pop-up window.
Click on the Refresh icon to refresh the Run status of the data rule. Once the rule has completed running, the Run status will be updated to Successful, and you can see that 27 rows (or 4.61% of the total records) in the PATIENTS data asset have failed the rule. Click on the overflow menu icon and click on View run history.
Click on the checkbox next to the latest run (the first run in the list), then click View run details.
The 27 records that failed the rule will be displayed on the screen. You can see that the Date of Death for these records is not on or after the Date of Birth. You can click on the Download button to export the failed records to your local machine. You can specify which delimiter to use and whether the column headers should be included. You can also provide a start index and the number of rows starting at that index to export. Finally, click on PATIENTS in the breadcrumbs at the top of the screen to go back to the PATIENTS data asset.
Data rules also run as part of the quality analysis process, and the results from the data rules (records that do not satisfy the rule condition) affect the quality score for the data asset.
Create quality rule
Next, you will see how to create a Quality rule.
Back on the Rules tab of the PATIENTS data asset, click on Create rule +.
Select Quality rule, then under Data rule definition, expand All and select the rule named At_least_one_of_DL_Passport_exists. This rule definition says that at least the passport or the driver’s license of the patient must be provided (both cannot be null). Click Next.
Select the passport variable on the left. In the table on the right, expand your data source and look for the PATIENTS data asset. Select the PASSPORT column within the PATIENTS data asset. Click Bind on the left side of the screen. The PATIENTS.PASSPORT column will be bound to the passport variable, as shown under Implemented bindings. Repeat the steps for the drivers_license variable on the left and bind it to the DRIVERS_LICENSE column of the PATIENTS data asset. Once both the variables have been bound, click Next.
Click Test to test the rule. The rule is run on a sample of 100 rows, and the results are displayed on the screen. Click Save to save the quality rule. (NOTE: It is possible that all 100 records in the sample meet the rule condition.)
The newly added quality rule will be displayed on the Rules tab of the PATIENTS data asset.
Quality rules, as opposed to data rules, can only be run as part of the quality analysis process.
Create automation rule
Next, you will create an automation rule. Automation rules help automate some of the tasks that you may need to run on your data to ensure that it is of the highest quality. This may include applying rule definitions, including data quality dimensions, or setting the data quality threshold.
The data quality project must be configured to run automation rules. Go to the HealthcareAnalysis project, then go to the Settings tab. Click on Data quality, then scroll all the way down to find the checkbox for Enable automation rules. Check the checkbox and click Save.
Earlier, you had imported the DoD_lte_Today data rule definition into the HealthcareAnalysis project. The rule states that the Date of Death should be less than or equal to today, that is, it cannot be a future date.
You will now create an automation rule that will add a quality rule using this rule definition to any asset that has the Patient Death Date business term assigned to it.
In order to use a data rule definition in an automation rule, it must first be published so that it is available for other users.
Go to the Data rules tab. Expand All, look for the DoD_lte_Today data rule definition and click on the overflow menu (three vertical dots) against it, then click Publish.
Click Publish in the pop-up window.
The data rule definition will be published, and you will be able to see a record for the data rule if you expand Published Rules and scroll all the way to the bottom of the page.
NOTE: If you wish, you can also publish the other rules and definitions under All to make them available to other users.
Navigate to the upper-left hamburger (☰) menu, expand Governance and click on Automation rules. Click on the Create automation rule + button.
Give your automation rule a name (
Date of Death cannot be in the future), a description, and set the Status as Accepted. Setting the status to Accepted activates the rule, which means that the assets that are specified in the rule logic will be affected.
Next, you need to build the rule logic. Scroll down to the Rule logic canvas. Click on Conditions and in the menu that is displayed, click on the asset has the term (Select a term) assigned.
The selected condition is added on to the canvas on the screen. Drag and drop the condition against the if in the if-then logic statement that was already present on the canvas. You should hear a click sound when the condition locks in place. You can verify that the condition has been added to the if-then statement by clicking on the green if-then block and moving it around on the canvas. The condition should move along with the if-then block. Next, click on the magnifying glass within the condition to choose the business term.
A new pop-up window is displayed. Search for the Patient Death Date business term, then select the Patient Death Date business term from the list. Click Save.
Back on the canvas, you can see that the business term has been updated in the condition.
The next step is to add an action. Click on Actions in the left-hand menu and select bind the data rule definition.
The selected action is added on to the canvas on the screen. As before, drag and drop the action against the then in the if-then logic statement that was already present on the canvas. You should hear a click sound when the action locks in place. If you click on the green if-then block and move it around on the canvas, the condition as well as the action should move along with it. Next, click on the magnifying glass within the action to choose the data rule definition.
A new pop-up window is displayed. Search for the DoD_lte_Today data rule definition, then select the DoD_lte_Today data rule definition from the list. Click Save.
Back on the canvas, you can see that the data rule definition has been updated in the action. Scroll back up and click on Save.
A pop-up window is displayed. It gives information about the assets and workspaces that will be affected by the automation rule. Clicking on Show details opens a new browser page that lists the data assets and columns that will be affected. You will see that the DEATHDATE column in the PATIENTS data asset will be affected. Back on the pop-up window, click on Save to confirm that you want to save the automation rule.
You will be brought back to the Automation rules page, where you should now see a new record on the page for the automation rule that you created. If you don’t see the new record, click on the Refresh icon to refresh the list.
The automation rule will run as part of the data quality analysis process and will assign a new quality rule as per the DoD_lte_Today data rule definition to all columns across all data assets that have the Patient Death Date business term assigned.
Re-run data quality analysis
- Once the data, quality and automation rules have been added, you can re-run data analysis which in turn runs these rules.
NOTE: Data rules can be run by themselves to generate an output table, as seen earlier. They are also run as part of the data quality analysis process. Quality rules are run only as part of the data analysis process. Automation rules are run as part of the data discovery and the data analysis process.
Navigate to the upper-left hamburger (☰) menu, expand Governance and click on Data quality. Click on the tile for your data quality project HealthcareAnalysis.
You will see that for the PATIENTS data asset, the Last analysis value is Outdated. Select the record for the PATIENTS data asset by clicking on the checkbox against it and click Analyze.
In the pop-up window, select Analyze data quality and click Analyze.
The data quality analysis is run, and the results should be available after a few minutes. You may need to click the Refresh icon. Once data analysis has completed, click on PATIENTS to go to the PATIENTS data asset.
Go to the Data quality tab of the PATIENTS data asset. You will see three new data quality dimensions in addition to the ones that you had looked through earlier. Notice that the overall quality score for the PATIENTS data asset has dropped by 2% due to these rules and the individual columns — BIRTHDATE, DEATHDATE, DRIVERS_LICENSE, and PASSPORT — also had a drop in quality score.
The first newly added data quality dimension — Rule violations: At_least_one_of_DL_Passport_exists – specifies the rule violations for the quality rule you created. The second one — Rule violations: DoD_gte_DoB — specifies the rule violations for the data rule you created. The third one — Rule violations: DoD_lte_Today — specifies the rule violations for the quality rule that was added by the automation rule.
As before, you can expand these new rule violation entries to see which records have failed the rules.
In this tutorial, you have looked at tools available on the IBM Cloud Pak for Data platform that can be used to improve the quality of your data. You learned how to assign governance artifacts to your data and how to run analysis to calculate the data quality score for your data. You identified and selected primary keys for your data assets and identified foreign key relationships between the data assets. You learned about the various kinds of rules that can be incorporated into your data quality project to ensure that your data is of the highest possible quality. You also looked through the different data quality dimensions to see what affects the quality score for a data asset.
This tutorial is part of the An introduction to the DataOps discipline series. To continue the series and learn more about how to add privacy features to protect some or all of your data, take a look at the next tutorial titled Protect your data using data privacy features.