Win $20,000. Help build the future of education. Answer the call. Learn more

Analyze discovered data to gain insights on the quality of your data

In the Learn to discover data that resides in your data sources tutorial, you saw how data can be discovered using the data discovery feature of Watson Knowledge Catalog. In this tutorial, you will work with the discovered data and learn how to associate governance artifacts with these discovered assets. You will identify the primary keys for the data assets and perform relationship analysis in order to identify foreign key relationships between the assets. You will look at data quality analysis results and observe what affects the quality scores of data assets. You will also learn to enforce rules that you expect the data assets to follow.

Learning objectives

In this tutorial, you will:

  • Review and update data classes, business terms, and keys for the data assets
  • Run relationship analysis to identify foreign key relationships between the data assets
  • Review data quality dimensions violations
  • Add rules to improve data quality
  • Perform re-analysis of the data assets and observe changes in data quality

Prerequisites

Estimated time

This tutorial will take approximately 60 minutes to complete.

Step 1. Review and update data classes, business terms, and keys for the data assets

  • Log into your IBM Cloud Pak for Data instance. CPD login

  • Navigate to the upper-left hamburger (☰) menu, expand Governance, and click on Data quality. Click on the tile for your data quality project HealthcareAnalysis. This is the project that contains the results from the automated discovery process that was performed in Learn to discover data that resides in your data sources. CPD open data quality project

  • On the Data assets tab, click on PATIENTS to open the PATIENTS data asset. Click PATIENTS asset

  • The Columns tab for the PATIENTS data asset shows information about the columns in the asset. You can see that the analysis process has identified and auto-assigned data classes and business terms for some of the columns. To edit these values, click Edit, then click on the first column: ID. Patients - edit annotations

  • Go to the Data classes tab. You will see that the data classes Text and UUID were identified for this column. Moreover, both these data classes have confidence of 100%. Of these, the UUID data class was detected as the data class for this column during analysis and, therefore, you can see UUID mentioned under Detected Data Class. The data class was correctly identified in this case, but if it is not, and if you wish to update the data class, you can use the drop-down under Selected Data Class to choose the data class you wish to assign to this column. Patients ID data class

  • Go to the Governance tab. You can see that a number of business terms have been suggested, but none have been assigned to the ID column. This is because none of the business terms has a confidence of over 80%. You can click on the checkmark next to Patient ID to assign this business term to the ID column. You can also use the search bar to search for terms in case they aren’t assigned or suggested. You can also click on the cross next to the other suggested terms to mark them as Rejected. Marking terms as Assigned or Rejected ensures that the terms remain as assigned or rejected after any future analyses of the data. Patients ID governance

NOTE: While you can skip rejecting incorrectly suggested terms, as in case of this column, rejection of incorrect terms is especially important when the confidence in those terms is over 80%, as a result of which they get automatically assigned to the column. In these cases, rejection will ensure that the rejected terms will not be assigned to this column on future analyses even if the confidence is over 80%.

  • Now that the business terms have been accepted and rejected, you can repeat the steps of correcting the data class and business terms for the other columns by going into the columns using the links on the left. Refer the PATIENTS section of the Data asset annotations file to assign data classes and business terms to the columns in the PATIENTS data asset. Once you have completed the steps for all the remaining columns in the PATIENTS data asset, go back to the PATIENTS data asset using the breadcrumbs at the top. Patients ID governance complete

  • Go to the Keys tab. You can see a list of columns that were identified as candidate keys for the PATIENTS data asset. Check the box against the ID column and click on Mark as selected. In the pop-up window, click Mark to confirm. This will make sure the system knows that the ID column is to be used as a primary key for the PATIENTS data asset. You can select the other identified candidate keys and click Mark as rejected in order to reject these keys. This will ensure that when you re-analyze the PATIENTS data asset in the future, the rejected keys will not be selected as candidate keys. Patients - select primary key

  • Click Done to save all the data class, business term, and key updates you have made to the PATIENTS asset, then click Analyze to re-analyze the asset based on these updates. Patients - reanalyze

  • In the pop-up window, choose the checkbox for Analyze data quality, then click Analyze. Patients - reanalyze confirm

  • The analysis score will now be updated. (NOTE: you may need to refresh the page to see the new analysis results.) You can see that the data classes and business terms you had assigned have been retained. The ID column now shows a key icon in front of it to indicate that this is the primary key for the data asset. In the left panel, you can see that the Data quality score for the PATIENTS data asset has dropped. In the left bottom panel, where the column names are listed, you can see that the quality score for BIRTHDATE and CITY has dropped, which caused the drop in the overall quality score. Patients - reanalysis results

  • Go back to the project using the breadcrumbs. Update the data classes, business terms, and keys for the ENCOUNTERS data asset. (NOTE: For the extended version of this tutorial, update the governance artifacts for all 12 data assets.) Use the Data asset annotations file to learn which data class and business term to assign for each column of each data asset. The file also lists the primary keys for each data asset. For each data asset, perform re-analysis once the updates have been made.

Step 2. Run relationship analysis to identify foreign key relationships between the data assets

  • After you have selected the primary keys for your tables, you can run Relationship analysis to identify foreign key relationships between the data assets. In your project, go to the Relationships tab. Click the checkboxes against ENCOUNTERS and PATIENTS to select them, then click Run analysis. Project - run relationship analysis

  • In the pop-up window, choose Key relationship analysis and click on Analyze. Project - run relationship analysis confirm

  • Wait for a minute or two for the analysis to complete. Click on the refresh icon to refresh the page. You will be able to see a chart and a table both showing the relationship between the PATIENTS and the ENCOUNTERS asset. Project - run relationship analysis result

  • Scroll all the way to the end of the table. You can see that this relationship has been set to status Selected. Click on the overflow menu for the record. You will see the options to set the status as Candidate or Rejected and the option to delete the key relationship. This relationship needs to be set as Selected (which it already is). For any other relationships you generate, you can mark them as Selected or Rejected using the options in the overflow menu. This will ensure that future key relationship analyses will remember what you had set. Project - run relationship analysis result overflow menu

NOTE: For the extended version of this tutorial, run relationship analysis between the other data assets. The following table lists all the foreign key relationships between all the 12 data assets in the Healthcare-Data.zip archive. Ensure that you mark these relationships as Selected and all others as Rejected.

Parent data asset Primary key Child data asset Foreign key
ENCOUNTERS ID ALLERGIES ENCOUNTER
PATIENTS ID ALLERGIES PATIENT
ENCOUNTERS ID CAREPLANS ENCOUNTER
PATIENTS ID CAREPLANS PATIENT
ENCOUNTERS ID CONDITIONS ENCOUNTER
PATIENTS ID CONDITIONS PATIENT
ORGANIZATIONS ID ENCOUNTERS ORGANIZATION
PATIENTS ID ENCOUNTERS PATIENT
PAYERS ID ENCOUNTERS PAYER
PROVIDERS ID ENCOUNTERS PROVIDER
ENCOUNTERS ID IMMUNIZATIONS ENCOUNTER
PATIENTS ID IMMUNIZATIONS PATIENT
ENCOUNTERS ID MEDICATIONS ENCOUNTER
PATIENTS ID MEDICATIONS PATIENT
PAYERS ID MEDICATIONS PAYER
ENCOUNTERS ID OBSERVATIONS ENCOUNTER
PATIENTS ID OBSERVATIONS PATIENT
ENCOUNTERS ID PROCEDURES ENCOUNTER
PATIENTS ID PROCEDURES PATIENT
ORGANIZATIONS ID PROVIDERS ORGANIZATION
  • Make sure to click on the Customize display tab to specify which relationships you want to display on the screen. Select Selected and Candidate in the filter drop-down, click the checkbox to Select all data assets, then click Customize display. Project - customize diplay for relationship analysis

Step 3. Review data quality dimensions violations

  • Go to the Data assets tab and click on PATIENTS to open the PATIENTS data asset. Project - click Patients asset

  • Go to the Data quality tab. The data quality analysis process identifies quality problems with your data by analyzing quality dimensions and lists the violations that it observes on the Data quality tab. The number of violations found against each data quality dimension as well as the delta (percentage change in the number of violations between the last two analyses) is shown on the screen. Patients - data quality tab

  • The first dimension in the list — Data class violations — shows the values in a column that do not match the detected data class of that column. Each value that violates the class is identified as a violation. Click on Data class violations to expand it. The column names that contain data class violations is shown, along with the number of violations found for each column. Click on the first column name in the list: CITY. Patients - data class violations

  • A pop-up window is displayed, showing all the records from the PATIENTS data asset that have data class violations in the CITY column. Scroll left to right to see the entire record, and scroll up and down to see more records. You can also use the Download button to download the records as a CSV file. Close the pop-up window by clicking on the X on the top right. Patients - city violations

  • Likewise, you can expand and see records for the other columns listed under data class violations, such as ADDRESS and BIRTHDATE.

  • Expand and look through the other data quality dimensions on the screen.

    • Suspect values identifies values that do not seem to match the majority of the other values in the column because their characteristics are different.
    • Inconsistent capitalization identifies values where the usage of uppercases and lowercases is not consistent.
    • Data type violations identifies values that do not conform to the inferred data type in length, precision, or scale, or violate the specified/identified data type.
    • Duplicated values identifies duplicated values in columns where most of the values are unique.
    • Values out of range identifies outliers in a column’s data. Any value that does not fall between the minimum and maximum values specified for the column is identified.
    • Suspect values in correlated columns identifies columns that are correlated with other columns (one column’s value can be predicted using the other) and then uses that information to identify records that do not have the same correlation.
    • Missing values looks for missing values in a column that has been defined as a non-nullable column.
    • Inconsistent representation of missing values looks for varying representations of missing data (NA, NULL, blank fields, for example). A column that contains both null values and empty values suggests that there is no standardized way to represent missing values.
    • Format violations identify values that match a format that has been specified as invalid for a particular column in an analysis.
  • You can also ignore any of the dimensions. As a result, the dimension’s violations will not affect the quality analysis score. To ignore a dimension, go into the Edit mode by clicking on Edit, then toggle the Ignore button against the dimension. Click on Done (which is shown in place of the Edit button) to save the changes, then Analyze data quality again to view the change in the quality score. Patients - ignore dimension

Step 4. Add rules

You can also make use of rules to ensure that your data is of high quality. These include:

  • Data rule definitions, which are used to develop rule logic to analyze data. These are used as the basis for data rules and quality rules.
  • Rule set definitions, which are collections of data rule definitions.
  • Data rules and Quality rules, which evaluate and validate specific conditions associated with your data sources by binding data rule definitions to physical data.
  • Rule sets, which are collections of data rules.
  • Automation rules, which can be used to automate the process of governing your data.

Import data rule definitions

  • Download the HealthcareAnalysis-rules.xml file.

  • Back in your data quality project, go to the Data rules tab and click on the Import rules and definitions button. Project - import data rules

  • In the pop-up window, click on Add file and choose the HealthcareAnalysis-rules.xml file you downloaded earlier. Once the file has been uploaded, click Next. Project - import data rules add file

  • On the next screen, you can see three checkboxes, each representing a data rule definition, in the uploaded file. Ensure that all three are checked and click Import. Project - import data rules import

  • You will see a notification that the assets were imported successfully. Click Close. Project - import data rules successful

  • Back on the Data rules tab, click the Refresh button to refresh the list of rules and expand All. You can see that three new records are now available under All. Project - import data rules result

  • Two of the newly added records relate to the date of death (of a patient). They include two data rule definitions, one of which verifies that the value of Date of Death is less than or equal to today (meaning, Date of Death cannot be a date in the future), and another verifies that the Date of Death occurs on or after the Date of Birth. The third new record is a data rule definition that verifies that at least the driver’s license number or the passport number of a patient has been provided.

Create data rule

  • You can now create a data rule using one of the imported data definitions.

  • Go back to the Data assets tab and click on the PATIENTS asset. Click Patients asset

  • Click on the Rules tab and click Create rule +. Patients create rule

  • Select Data rule, then under Data rule definition, expand All and select the rule named DoD_gte_DoB. This rule definition says that the date of death of a patient should always be greater than or equal to the date of birth of the patient. Click Next. Patients - create data rule

  • Provide a name for the rule (DeathDate greater than or equal to BirthDate), and optional short and long descriptions. Click Next. Patients - data rule name

  • On the next screen, click Next. Patients - data rule governance

  • Select the datebirth variable on the left. In the table on the right, expand your data source and look for the PATIENTS data asset. Select the BIRTHDATE column within the PATIENTS data asset. Click Bind on the left side of the screen. The PATIENTS.BIRTHDATE column will be bound to the datebirth variable as shown under Implemented bindings. Repeat the steps for the datedeath variable on the left and bind it to the DEATHDATE column of the PATIENTS data asset. Once both the variables have been bound, click Next. Patients - data rule binding

  • On the next screen, click Next. Patients - data rule joins

  • Output from a data rule can be saved to a customized table, wherein you can specify which rows and columns need to be included. On this screen, you can select the columns that you would like to include in the output table. The datedeath and datebirth variables are added automatically to the output table and should be visible under Selected output on the right side of the screen. On the left side of the screen, click on Columns tab, then find and select the ID, FIRST_NAME and LAST_NAME columns within the list. Click on Add to output. The three columns should now be listed under the Selected output on the right. Click Next. Patients - data rule output content

  • On the next screen, click Next. Patients - data rule output settings

  • Click Test to test the rule. The rule is run on a sample of 100 rows, and the results are displayed on the screen. Toward the bottom of the screen, you can click on the Did not meet rule conditions tab, to view the records from the sample that did not satisfy the rule. Click on Save to save the data rule. (NOTE: It is possible that all 100 records in the sample meet the rule conditions.) Patients - data rule save

  • The newly added data rule will be displayed on the Rules tab of the PATIENTS data asset. Patients - data rule saved

Run data rule

  • Scroll to the end and click on the overflow menu (the three vertical dots) for the data rule and click Run in the pop-up window. Patients - run data rule

  • Click on the Refresh icon to refresh the Run status of the data rule. Once the rule has completed running, the Run status will be updated to Successful, and you can see that 27 rows (or 4.61% of the total records) in the PATIENTS data asset have failed the rule. Click on the overflow menu icon and click on View run history. Patients - run data rule complete

  • Click on the checkbox next to the latest run (the first run in the list), then click View run details. Patients - run data rule view details

  • The 27 records that failed the rule will be displayed on the screen. You can see that the Date of Death for these records is not on or after the Date of Birth. You can click on the Download button to export the failed records to your local machine. You can specify which delimiter to use and whether the column headers should be included. You can also provide a start index and the number of rows starting at that index to export. Finally, click on PATIENTS in the breadcrumbs at the top of the screen to go back to the PATIENTS data asset. Patients - run data rule details download

  • Data rules also run as part of the quality analysis process, and the results from the data rules (records that do not satisfy the rule condition) affect the quality score for the data asset.

Create quality rule

  • Next, you will see how to create a Quality rule.

  • Back on the Rules tab of the PATIENTS data asset, click on Create rule +. Patients - create rule - 2

  • Select Quality rule, then under Data rule definition, expand All and select the rule named At_least_one_of_DL_Passport_exists. This rule definition says that at least the passport or the driver’s license of the patient must be provided (both cannot be null). Click Next. Patients - create quality rule

  • Select the passport variable on the left. In the table on the right, expand your data source and look for the PATIENTS data asset. Select the PASSPORT column within the PATIENTS data asset. Click Bind on the left side of the screen. The PATIENTS.PASSPORT column will be bound to the passport variable, as shown under Implemented bindings. Repeat the steps for the drivers_license variable on the left and bind it to the DRIVERS_LICENSE column of the PATIENTS data asset. Once both the variables have been bound, click Next. Patients - quality rule binding

  • Click Test to test the rule. The rule is run on a sample of 100 rows, and the results are displayed on the screen. Click Save to save the quality rule. (NOTE: It is possible that all 100 records in the sample meet the rule condition.) Patients - quality rule save

  • The newly added quality rule will be displayed on the Rules tab of the PATIENTS data asset. Patients - quality rule saved

  • Quality rules, as opposed to data rules, can only be run as part of the quality analysis process.

Create automation rule

  • Next, you will create an automation rule. Automation rules help automate some of the tasks that you may need to run on your data to ensure that it is of the highest quality. This may include applying rule definitions, including data quality dimensions, or setting the data quality threshold.

  • The data quality project must be configured to run automation rules. Go to the HealthcareAnalysis project, then go to the Settings tab. Click on Data quality, then scroll all the way down to find the checkbox for Enable automation rules. Check the checkbox and click Save. Project - enable automation rules

  • Earlier, you had imported the DoD_lte_Today data rule definition into the HealthcareAnalysis project. The rule states that the Date of Death should be less than or equal to today, that is, it cannot be a future date.

  • You will now create an automation rule that will add a quality rule using this rule definition to any asset that has the Patient Death Date business term assigned to it.

  • In order to use a data rule definition in an automation rule, it must first be published so that it is available for other users.

  • Go to the Data rules tab. Expand All, look for the DoD_lte_Today data rule definition and click on the overflow menu (three vertical dots) against it, then click Publish. Project - publish rule

  • Click Publish in the pop-up window. Project - confirm publish rule

  • The data rule definition will be published, and you will be able to see a record for the data rule if you expand Published Rules and scroll all the way to the bottom of the page. Project - rule published

NOTE: If you wish, you can also publish the other rules and definitions under All to make them available to other users.

  • Navigate to the upper-left hamburger (☰) menu, expand Governance and click on Automation rules. Click on the Create automation rule + button. CPD - create automation rule

  • Give your automation rule a name (Date of Death cannot be in the future), a description, and set the Status as Accepted. Setting the status to Accepted activates the rule, which means that the assets that are specified in the rule logic will be affected. CPD - automation rule name

  • Next, you need to build the rule logic. Scroll down to the Rule logic canvas. Click on Conditions and in the menu that is displayed, click on the asset has the term (Select a term) assigned. CPD - automation rule select condition

  • The selected condition is added on to the canvas on the screen. Drag and drop the condition against the if in the if-then logic statement that was already present on the canvas. You should hear a click sound when the condition locks in place. You can verify that the condition has been added to the if-then statement by clicking on the green if-then block and moving it around on the canvas. The condition should move along with the if-then block. Next, click on the magnifying glass within the condition to choose the business term. CPD - automation rule add condition

  • A new pop-up window is displayed. Search for the Patient Death Date business term, then select the Patient Death Date business term from the list. Click Save. CPD - automation rule condition business term

  • Back on the canvas, you can see that the business term has been updated in the condition. CPD - automation rule condition updated

  • The next step is to add an action. Click on Actions in the left-hand menu and select bind the data rule definition. CPD - automation rule select action

  • The selected action is added on to the canvas on the screen. As before, drag and drop the action against the then in the if-then logic statement that was already present on the canvas. You should hear a click sound when the action locks in place. If you click on the green if-then block and move it around on the canvas, the condition as well as the action should move along with it. Next, click on the magnifying glass within the action to choose the data rule definition. CPD - automation rule add action

  • A new pop-up window is displayed. Search for the DoD_lte_Today data rule definition, then select the DoD_lte_Today data rule definition from the list. Click Save. CPD - automation rule action rule def

  • Back on the canvas, you can see that the data rule definition has been updated in the action. Scroll back up and click on Save. CPD - automation rule save

  • A pop-up window is displayed. It gives information about the assets and workspaces that will be affected by the automation rule. Clicking on Show details opens a new browser page that lists the data assets and columns that will be affected. You will see that the DEATHDATE column in the PATIENTS data asset will be affected. Back on the pop-up window, click on Save to confirm that you want to save the automation rule. CPD - automation rule save confirm

  • You will be brought back to the Automation rules page, where you should now see a new record on the page for the automation rule that you created. If you don’t see the new record, click on the Refresh icon to refresh the list. CPD - automation rule created

  • The automation rule will run as part of the data quality analysis process and will assign a new quality rule as per the DoD_lte_Today data rule definition to all columns across all data assets that have the Patient Death Date business term assigned.

Re-run data quality analysis

  • Once the data, quality and automation rules have been added, you can re-run data analysis which in turn runs these rules.

NOTE: Data rules can be run by themselves to generate an output table, as seen earlier. They are also run as part of the data quality analysis process. Quality rules are run only as part of the data analysis process. Automation rules are run as part of the data discovery and the data analysis process.

  • Navigate to the upper-left hamburger (☰) menu, expand Governance and click on Data quality. Click on the tile for your data quality project HealthcareAnalysis. CPD open data quality project

  • You will see that for the PATIENTS data asset, the Last analysis value is Outdated. Select the record for the PATIENTS data asset by clicking on the checkbox against it and click Analyze. Project - run analysis for rules

  • In the pop-up window, select Analyze data quality and click Analyze. Project - run analysis for rules confirm

  • The data quality analysis is run, and the results should be available after a few minutes. You may need to click the Refresh icon. Once data analysis has completed, click on PATIENTS to go to the PATIENTS data asset. Project - go to Patients asset

  • Go to the Data quality tab of the PATIENTS data asset. You will see three new data quality dimensions in addition to the ones that you had looked through earlier. Notice that the overall quality score for the PATIENTS data asset has dropped by 2% due to these rules and the individual columns — BIRTHDATE, DEATHDATE, DRIVERS_LICENSE, and PASSPORT — also had a drop in quality score. Patients - data quality rule dimensions

  • The first newly added data quality dimension — Rule violations: At_least_one_of_DL_Passport_exists – specifies the rule violations for the quality rule you created. The second one — Rule violations: DoD_gte_DoB — specifies the rule violations for the data rule you created. The third one — Rule violations: DoD_lte_Today — specifies the rule violations for the quality rule that was added by the automation rule.

  • As before, you can expand these new rule violation entries to see which records have failed the rules. Patients - data quality rule view results 1 Patients - data quality rule view results 2

Summary

In this tutorial, you have looked at tools available on the IBM Cloud Pak for Data platform that can be used to improve the quality of your data. You learned how to assign governance artifacts to your data and how to run analysis to calculate the data quality score for your data. You identified and selected primary keys for your data assets and identified foreign key relationships between the data assets. You learned about the various kinds of rules that can be incorporated into your data quality project to ensure that your data is of the highest possible quality. You also looked through the different data quality dimensions to see what affects the quality score for a data asset.

This tutorial is part of the An introduction to the DataOps discipline series. To continue the series and learn more about how to add privacy features to protect some or all of your data, take a look at the next tutorial titled Protect your data using data privacy features.