Customize data quality assessment with automation rules
When cataloging data assets, it is vital to assess the quality of data in the data assets. This can be a very time consuming process.…
Data quality assessment evaluates data against a set of criteria. The set of criteria depends upon the business context of the data, including what information data represents, what is the criticality of the information, and purpose of the information. Data quality dimensions and custom data quality rules should be used together to perform data quality assessment. Data profiling provides an initial view into data and data quality dimensions. It also leverages profiling information to assess quality, such as checking for Null or checking for incorrect data types or formats. Data quality rules allow for specific logic such as checking that a date falls within a range.
Let’s consider two simple examples. We have a data set that contains two columns. We can use missing values as a dimension for both these columns without knowing any additional information. Profiling could then provide additional insight. One column contains alphanumeric values, and the other contains dates. For dates, it is then possible to use invalid format to evaluate quality further. For both columns, it may be possible to find if duplicates exist. However, duplicates may or may not indicate a quality issue. Using business classification, it is found that the first column is an email address and the second column is date of birth. Using this information, we can add more evaluation for emails to make sure the emails are formatted right.
For a more customized view of quality, we’ll need to use additional information. For emails we need to know the context. For example, if the emails are work emails for an enterprise’s customers, then we can add a rule that checks that the emails use the format email@example.com. However, if the emails are for customers of the enterprise, that rule does not help evaluate quality.
For illustration, let’s assume that the Bank3 schema contains information about internal users. Therefore, we’ll add an additional check via the rule “InternalEmail/ValidDomain”. For all other email addresses, this rule will not be applied.
When running an automation discovery, the automation rule is applied using the condition. In this particular dataset, the additional rule impacted the quality score significantly.
For date of birth, we’ll check for three dimensions (missing values, data type, and data class violations) and add a rule to check that the date is less than “today’s date”. We’ll also increase the quality score threshold to 95 from the default 80.
IBM Watson Knowledge Catalog on IBM Cloud Pak for Data provides a way to customize data quality evaluation using a variety of conditions.